[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEH94Lh-UCCEs7hQi_t5v+X+ER1DH9dCtjr6e9GVNX5KJ-f1hQ@mail.gmail.com>
Date: Fri, 26 Jul 2013 14:36:15 +0800
From: Zhi Yong Wu <zwu.kernel@...il.com>
To: Dave Chinner <david@...morbit.com>
Cc: xfstests <xfs@....sgi.com>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
linux-kernel mlist <linux-kernel@...r.kernel.org>,
Zhi Yong Wu <wuzhy@...ux.vnet.ibm.com>
Subject: Re: [PATCH] xfs: introduce object readahead to log recovery
Dave,
All comments are good to me, and will be applied to next version, thanks a lot.
On Fri, Jul 26, 2013 at 10:50 AM, Dave Chinner <david@...morbit.com> wrote:
> On Thu, Jul 25, 2013 at 04:23:39PM +0800, zwu.kernel@...il.com wrote:
>> From: Zhi Yong Wu <wuzhy@...ux.vnet.ibm.com>
>>
>> It can take a long time to run log recovery operation because it is
>> single threaded and is bound by read latency. We can find that it took
>> most of the time to wait for the read IO to occur, so if one object
>> readahead is introduced to log recovery, it will obviously reduce the
>> log recovery time.
>>
>> In dirty log case as below:
>> data device: 0xfd10
>> log device: 0xfd10 daddr: 20480032 length: 20480
>>
>> log tail: 7941 head: 11077 state: <DIRTY>
>
> That's only a small log (10MB). As I've said on irc, readahead won't
Yeah, it is one 10MB log, but how do you calculate it based on the above info?
> show any real difference on this and you need to be testing with
> large logs.
>
>>
>> This dirty ratio is about 15%. I am trying to do tests in larger scale
>> and dirtier filesystem environment.
>>
>> Log recovery time stat:
>>
>> w/o this patch w/ this patch
>> real 0m1.051s 0m0.965s
>> sys 0m0.033s 0m0.035s
>>
>> iowait 0m1.018s 0m0.930s
>
> Well, it's not made much of a difference there. That's probably
> within the noise of repeated log recovery runs.
>
> My simple test is:
>
> $ cat recovery_time.sh
> #!/bin/bash
>
> cd /home/dave/src/compilebench-0.6/
>
> mkfs.xfs -f /dev/vdc;
> mount /dev/vdc /mnt/scratch
> chmod 777 /mnt/scratch;
> ./compilebench --no-sync -D /mnt/scratch &
> sleep 55
> /home/dave/src/xfstests-dev/src/godown /mnt/scratch
> umount /mnt/scratch
> xfs_logprint -t /dev/vdc |head -20
> time mount /dev/vdc /mnt/scratch
> umount /mnt/scratch
> $
>
> And the recovery time from this is between 15-17s:
>
> ....
> log device: 0xfd20 daddr: 107374182032 length: 4173824
> ^^^^^^^ almost 2GB
> log tail: 19288 head: 264809 state: <DIRTY>
> ....
> real 0m17.913s
> user 0m0.000s
> sys 0m2.381s
>
> And runs at 3-4000 read IOPs for most of that time. It's largely IO
> bound, even on SSDs.
>
> With your patch:
>
> log tail: 35871 head: 308393 state: <DIRTY>
> real 0m12.715s
> user 0m0.000s
> sys 0m2.247s
>
> And it peaked at ~5000 read IOPS.
How do you know its READ IOPS is ~5000?
>
> It's definitely an improvement, but there's a lot of dead time still
> spent waiting for IO that we should be able to remove. Let's have a
> look at the code...
>
>
>> +STATIC void
>> +xlog_recover_inode_ra_pass2(
>> + struct xlog *log,
>> + struct xlog_recover_item *item)
>> +{
>> + xfs_inode_log_format_t *in_f;
>> + xfs_mount_t *mp = log->l_mp;
>> + int error;
>> + int need_free = 0;
>> +
>> + if (item->ri_buf[0].i_len == sizeof(xfs_inode_log_format_t)) {
>> + in_f = item->ri_buf[0].i_addr;
>> + } else {
>> + in_f = kmem_alloc(sizeof(xfs_inode_log_format_t), KM_SLEEP);
>> + need_free = 1;
>> + error = xfs_inode_item_format_convert(&item->ri_buf[0], in_f);
>> + if (error)
>> + goto error;
>> + }
>
> I'd just put the conversion buffer on stack and avoid the need to
> alloc/free memory here. There's plenty of stack space available
> during recovery here, so let's use it to keep the overhead of
> readahead down.
OK, will use the buffer on stack.
>
>> +STATIC void
>> +xlog_recover_dquot_ra_pass2(
>> + struct xlog *log,
>> + struct xlog_recover_item *item)
>> +{
>> + xfs_mount_t *mp = log->l_mp;
>> + xfs_buf_t *bp;
>> + struct xfs_disk_dquot *recddq;
>> + int error;
>> + xfs_dq_logformat_t *dq_f;
>> + uint type;
>> +
>> +
>> + if (mp->m_qflags == 0)
>> + return;
>> +
>> + recddq = item->ri_buf[1].i_addr;
>> + if (recddq == NULL)
>> + return;
>> + if (item->ri_buf[1].i_len < sizeof(xfs_disk_dquot_t))
>> + return;
>> +
>> + type = recddq->d_flags & (XFS_DQ_USER | XFS_DQ_PROJ | XFS_DQ_GROUP);
>> + ASSERT(type);
>> + if (log->l_quotaoffs_flag & type)
>> + return;
>> +
>> + dq_f = item->ri_buf[0].i_addr;
>> + ASSERT(dq_f);
>> + error = xfs_qm_dqcheck(mp, recddq, dq_f->qlf_id, 0, XFS_QMOPT_DOWARN,
>> + "xlog_recover_dquot_ra_pass2 (log copy)");
>> + if (error)
>> + return;
>> + ASSERT(dq_f->qlf_len == 1);
>> +
>> + error = xfs_trans_read_buf(mp, NULL, mp->m_ddev_targp, dq_f->qlf_blkno,
>> + XFS_FSB_TO_BB(mp, dq_f->qlf_len), 0, &bp,
>> + NULL);
>> + if (!error)
>> + xfs_buf_relse(bp);
>> +}
>
> That's not doing readahead - that's an integrity check followed by a
> blocking read. Shouldn't it just be calling xfs_buf_readahead() on
> dq_f->qlf_blkno/dq_f->qlf_len, after checking whether it had been
> cancelled?
ah, i miss tha it is one blocking read. will fix it
>
>> STATIC int
>> xlog_recover_commit_pass1(
>> struct xlog *log,
>> @@ -3140,10 +3255,14 @@ xlog_recover_commit_pass2(
>> struct xlog *log,
>> struct xlog_recover *trans,
>> struct list_head *buffer_list,
>> - struct xlog_recover_item *item)
>> + struct xlog_recover_item *item,
>> + struct xlog_recover_item *next_item)
>> {
>> trace_xfs_log_recover_item_recover(log, trans, item, XLOG_RECOVER_PASS2);
>>
>> + if (next_item)
>> + xlog_recover_ra_pass2(log, next_item);
>> +
>
> Ah, that explains the limited improvement - it's only doing
> readahead of a single item at a time. IOWs, we are completing
> recovery of the current object before IO completion of the next
> object has occurred. This generally means the read-ahead queue is
Yes.
> not deep enough...
>
>> switch (ITEM_TYPE(item)) {
>> case XFS_LI_BUF:
>> return xlog_recover_buffer_pass2(log, buffer_list, item);
>> @@ -3181,7 +3300,7 @@ xlog_recover_commit_trans(
>> int pass)
>> {
>> int error = 0, error2;
>> - xlog_recover_item_t *item;
>> + xlog_recover_item_t *item, *next_item, *temp_item;
>> LIST_HEAD (buffer_list);
>>
>> hlist_del(&trans->r_list);
>> @@ -3190,14 +3309,18 @@ xlog_recover_commit_trans(
>> if (error)
>> return error;
>>
>> - list_for_each_entry(item, &trans->r_itemq, ri_list) {
>> + list_for_each_entry_safe(item, temp_item, &trans->r_itemq, ri_list) {
>> switch (pass) {
>> case XLOG_RECOVER_PASS1:
>> error = xlog_recover_commit_pass1(log, trans, item);
>> break;
>> case XLOG_RECOVER_PASS2:
>> + if (&temp_item->ri_list != &trans->r_itemq)
>> + next_item = temp_item;
>> + else
>> + next_item = NULL;
>> error = xlog_recover_commit_pass2(log, trans,
>> - &buffer_list, item);
>> + &buffer_list, item, next_item);
>
> Ok, so you've based the readahead on the transaction item list
> having a next pointer. What I think you should do is turn this into
> a readahead queue by moving objects to a new list. i.e.
>
> list_for_each_entry_safe(item, next, &trans->r_itemq, ri_list) {
>
> case XLOG_RECOVER_PASS2:
> if (ra_qdepth++ >= MAX_QDEPTH) {
> recover_items(log, trans, &buffer_list, &ra_item_list);
> ra_qdepth = 0;
> } else {
> xlog_recover_item_readahead(log, item);
> list_move_tail(&item->ri_list, &ra_item_list);
> }
> break;
> ...
> }
> }
> if (!list_empty(&ra_item_list))
> recover_items(log, trans, &buffer_list, &ra_item_list);
>
> I'd suggest that a queue depth somewhere between 10 and 100 will
> be necessary to keep enough IO in flight to keep the pipeline full
> and prevent recovery from having to wait on IO...
Good suggestion, will apply it to next version, thanks.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@...morbit.com
--
Regards,
Zhi Yong Wu
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists