[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEH94Ljpih875VOEdCY9Z9CdMF_1DesW7dsz_k7Zm=xt165w5A@mail.gmail.com>
Date: Mon, 29 Jul 2013 09:38:11 +0800
From: Zhi Yong Wu <zwu.kernel@...il.com>
To: Dave Chinner <david@...morbit.com>
Cc: xfstests <xfs@....sgi.com>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
linux-kernel mlist <linux-kernel@...r.kernel.org>,
Zhi Yong Wu <wuzhy@...ux.vnet.ibm.com>
Subject: Re: [PATCH] xfs: introduce object readahead to log recovery
On Fri, Jul 26, 2013 at 7:35 PM, Dave Chinner <david@...morbit.com> wrote:
> On Fri, Jul 26, 2013 at 02:36:15PM +0800, Zhi Yong Wu wrote:
>> Dave,
>>
>> All comments are good to me, and will be applied to next version, thanks a lot.
>>
>> On Fri, Jul 26, 2013 at 10:50 AM, Dave Chinner <david@...morbit.com> wrote:
>> > On Thu, Jul 25, 2013 at 04:23:39PM +0800, zwu.kernel@...il.com wrote:
>> >> From: Zhi Yong Wu <wuzhy@...ux.vnet.ibm.com>
>> >>
>> >> It can take a long time to run log recovery operation because it is
>> >> single threaded and is bound by read latency. We can find that it took
>> >> most of the time to wait for the read IO to occur, so if one object
>> >> readahead is introduced to log recovery, it will obviously reduce the
>> >> log recovery time.
>> >>
>> >> In dirty log case as below:
>> >> data device: 0xfd10
>> >> log device: 0xfd10 daddr: 20480032 length: 20480
>> >>
>> >> log tail: 7941 head: 11077 state: <DIRTY>
>> >
>> > That's only a small log (10MB). As I've said on irc, readahead won't
>> Yeah, it is one 10MB log, but how do you calculate it based on the above info?
>
> length = 20480 blocks. 20480 * 512 = 10MB....
Thanks.
>
>> > And the recovery time from this is between 15-17s:
>> >
>> > ....
>> > log device: 0xfd20 daddr: 107374182032 length: 4173824
>> > ^^^^^^^ almost 2GB
>> > log tail: 19288 head: 264809 state: <DIRTY>
>> > ....
>> > real 0m17.913s
>> > user 0m0.000s
>> > sys 0m2.381s
>> >
>> > And runs at 3-4000 read IOPs for most of that time. It's largely IO
>> > bound, even on SSDs.
>> >
>> > With your patch:
>> >
>> > log tail: 35871 head: 308393 state: <DIRTY>
>> > real 0m12.715s
>> > user 0m0.000s
>> > sys 0m2.247s
>> >
>> > And it peaked at ~5000 read IOPS.
>> How do you know its READ IOPS is ~5000?
>
> Other monitoring. iostat can tell you this, though I use PCP...
thanks.
>
>> > Ok, so you've based the readahead on the transaction item list
>> > having a next pointer. What I think you should do is turn this into
>> > a readahead queue by moving objects to a new list. i.e.
>> >
>> > list_for_each_entry_safe(item, next, &trans->r_itemq, ri_list) {
>> >
>> > case XLOG_RECOVER_PASS2:
>> > if (ra_qdepth++ >= MAX_QDEPTH) {
>> > recover_items(log, trans, &buffer_list, &ra_item_list);
>> > ra_qdepth = 0;
>> > } else {
>> > xlog_recover_item_readahead(log, item);
>> > list_move_tail(&item->ri_list, &ra_item_list);
>> > }
>> > break;
>> > ...
>> > }
>> > }
>> > if (!list_empty(&ra_item_list))
>> > recover_items(log, trans, &buffer_list, &ra_item_list);
>> >
>> > I'd suggest that a queue depth somewhere between 10 and 100 will
>> > be necessary to keep enough IO in flight to keep the pipeline full
>> > and prevent recovery from having to wait on IO...
>> Good suggestion, will apply it to next version, thanks.
>
> FWIW, I hacked a quick test of this into your patch here and a depth
> of 100 brought the reocvery time down to under 8s. For other
> workloads which have nothing but dirty inodes (like fsmark) a depth
> of 100 drops the recovery time from ~100s to ~25s, and the iop rate
> is peaking at well over 15,000 IOPS. So we definitely want to queue
> up more than a single readahead...
Excited, I will try it.
By the way, how do you try the workload which has nothing but dirty
dquote objects?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@...morbit.com
--
Regards,
Zhi Yong Wu
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists