[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1262593090.29897.14.camel@localhost>
Date: Mon, 04 Jan 2010 16:18:10 +0800
From: "Zhang, Yanmin" <yanmin_zhang@...ux.intel.com>
To: Corrado Zoccolo <czoccolo@...il.com>
Cc: Jens Axboe <jens.axboe@...cle.com>,
Shaohua Li <shaohua.li@...el.com>,
"jmoyer@...hat.com" <jmoyer@...hat.com>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1
On Sat, 2010-01-02 at 19:52 +0100, Corrado Zoccolo wrote:
> Hi
> On Sat, Jan 2, 2010 at 1:33 PM, Zhang, Yanmin
> <yanmin_zhang@...ux.intel.com> wrote:
> > On Fri, 2010-01-01 at 17:32 +0100, Corrado Zoccolo wrote:
> >> Hi Yanmin,
> >> On Fri, Jan 1, 2010 at 11:12 AM, Zhang, Yanmin
> >> <yanmin_zhang@...ux.intel.com> wrote:
> >> > On Thu, 2009-12-31 at 11:34 +0100, Corrado Zoccolo wrote:
> >> >> Hi Yanmin,
> >> >> On Thu, Dec 31, 2009 at 10:16 AM, Zhang, Yanmin
> >> >> <yanmin_zhang@...ux.intel.com> wrote:
> >> >> > Comparing with kernel 2.6.32, fio mmap randread 64k has more than 40% regression with
> >> >> > 2.6.33-rc1.
> >> >>
> >> > Thanks for your timely reply. Some comments inlined below.
> >> >
> >> >> Can you compare the performance also with 2.6.31?
> >> > We did. We run Linux kernel Performance Tracking project and run many benchmarks when a RC kernel
> >> > is released.
> >> >
> >> > The result of 2.6.31 is quite similar to the one of 2.6.32. But the one of 2.6.30 is about
> >> > 8% better than the one of 2.6.31.
> >> >
> >> >> I think I understand what causes your problem.
> >> >> 2.6.32, with default settings, handled even random readers as
> >> >> sequential ones to provide fairness. This has benefits on single disks
> >> >> and JBODs, but causes harm on raids.
> >> > I didn't test RAID as that machine with hardware RAID HBA is crashed now. But if we turn on
> >> > hardware RAID in HBA, mostly we use noop io scheduler.
> >> I think you should start testing cfq with them, too. From 2.6.33, we
> >> have some big improvements in this area.
> > Great! I once compared cfq and noop against non-raid and raid0. One interesting finding
> > about sequential read testing is when there are fewer processes to read files on the raid0
> > JBOD, noop on raid0 is pretty good, but when there are lots of processes to do so on a non-raid
> > JBOD, cfq is pretty better. I planed to investigate it, but too busy in other issues.
> >
> >> >
> >> >> For 2.6.33, we changed the way in which this is handled, restoring the
> >> >> enable_idle = 0 for seeky queues as it was in 2.6.31:
> >> >> @@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd,
> >> >> struct cfq_queue *cfqq,
> >> >> enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
> >> >>
> >> >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> >> >> - (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq)))
> >> >> + (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
> >> >> enable_idle = 0;
> >> >> (compare with 2.6.31:
> >> >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> >> >> (cfqd->hw_tag && CIC_SEEKY(cic)))
> >> >> enable_idle = 0;
> >> >> excluding the sample_valid check, it should be equivalent for you (I
> >> >> assume you have NCQ disks))
> >> >> and we provide fairness for them by servicing all seeky queues
> >> >> together, and then idling before switching to other ones.
> >> > As for function cfq_update_idle_window, you is right. But since
> >> > 2.6.32, CFQ merges many patches and the patches have impact on each other.
> >> >
> >> >>
> >> >> The mmap 64k randreader will have a large seek_mean, resulting in
> >> >> being marked seeky, but will send 16 * 4k sequential requests one
> >> >> after the other, so alternating between those seeky queues will cause
> >> >> harm.
> >> >>
> >> >> I'm working on a new way to compute seekiness of queues, that should
> >> >> fix your issue, correctly identifying those queues as non-seeky (for
> >> >> me, a queue should be considered seeky only if it submits more than 1
> >> >> seeky requests for 8 sequential ones).
> >> >>
> >> >> >
> >> >> > The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create
> >> >> > 8 1-GB files per partition and start 8 processes to do rand read on the 8 files
> >> >> > per partitions. There are 8*24 processes totally. randread block size is 64K.
> >> >> >
> >> >> > We found the regression on 2 machines. One machine has 8GB memory and the other has
> >> >> > 6GB.
> >> >> >
> >> >> > Bisect is very unstable. The related patches are many instead of just one.
> >> >> >
> >> >> >
> >> >> > 1) commit 8e550632cccae34e265cb066691945515eaa7fb5
> >> >> > Author: Corrado Zoccolo <czoccolo@...il.com>
> >> >> > Date: Thu Nov 26 10:02:58 2009 +0100
> >> >> >
> >> >> > cfq-iosched: fix corner cases in idling logic
> >> >> >
> >> >> >
> >> >> > This patch introduces about less than 20% regression. I just reverted below section
> >> >> > and this part regression disappear. It shows this regression is stable and not impacted
> >> >> > by other patches.
> >> >> >
> >> >> > @@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
> >> >> > return;
> >> >> >
> >> >> > /*
> >> >> > - * still requests with the driver, don't idle
> >> >> > + * still active requests from this queue, don't idle
> >> >> > */
> >> >> > - if (rq_in_driver(cfqd))
> >> >> > + if (cfqq->dispatched)
> >> >> > return;
> >> > Although 5 patches are related to the regression, above line is quite
> >> > independent. Reverting above line could always improve the result for about
> >> > 20%.
> >> I've looked at your fio script, and it is quite complex,
> > As we have about 40 fio sub cases, we have a script to create fio job file from
> > a specific parameter list. So there are some superfluous parameters.
> >
> My point is that there are so many things going on, that is more
> difficult to analyse the issues.
> I prefer looking at one problem at a time, so (initially) removing the
> possibility of queue merging, that Shaohua already investigated, can
> help in spotting the still not-well-understood problem.
Sounds reasonable.
> Could you generate the same script, but with each process accessing
> only one of the files, instead of chosing it at random?
Ok. New testing starts 8 processes per partition and every process just works
on one file.
>
> > Another point is we need stable result.
> >
> >> with lot of
> >> things going on.
> >> Let's keep this for last.
> > Ok. But the change like what you do mostly reduces regresion.
> >
> >> I've created a smaller test, that already shows some regression:
> >> [global]
> >> direct=0
> >> ioengine=mmap
> >> size=8G
> >> bs=64k
> >> numjobs=1
> >> loops=5
> >> runtime=60
> >> #group_reporting
> >> invalidate=0
> >> directory=/media/hd/cfq-tests
> >>
> >> [job0]
> >> startdelay=0
> >> rw=randread
> >> filename=testfile1
> >>
> >> [job1]
> >> startdelay=0
> >> rw=randread
> >> filename=testfile2
> >>
> >> [job2]
> >> startdelay=0
> >> rw=randread
> >> filename=testfile3
> >>
> >> [job3]
> >> startdelay=0
> >> rw=randread
> >> filename=testfile4
> >>
> >> The attached patches, in particular 0005 (that apply on top of
> >> for-linus branch of Jen's tree
> >> git://git.kernel.dk/linux-2.6-block.git) fix the regression on this
> >> simplified workload.
> > I didn't download the tree. I tested the 3 attached patches against 2.6.33-rc1. The
> > result isn't resolved.
> Can you quantify if there is an improvement, though?
Ok. Because of company policy, I could only post percent instead of real number.
> Please, also include Shahoua's patches.
> I'd like to see the comparison between (always with low_latency set to 0):
> plain 2.6.33
> plain 2.6.33 + shahoua's
> plain 2.6.33 + shahoua's + my patch
> plain 2.6.33 + shahoua's + my patch + rq_in_driver vs dispatched patch.
1) low_latency=0
2.6.32 kernel 0
2.6.33-rc1 -0.33
2.6.33-rc1_shaohua -0.33
2.6.33-rc1+corrado 0.03
2.6.33-rc1_corrado+shaohua 0.02
2.6.33-rc1_corrado+shaohua+rq_in_driver 0.01
2) low_latency=1
2.6.32 kernel 0
2.6.33-rc1 -0.45
2.6.33-rc1+corrado -0.24
2.6.33-rc1_corrado+shaohua -0.23
2.6.33-rc1_corrado+shaohua+rq_in_driver -0.23
When low_latency=1, we get the biggest number with kernel 2.6.32.
Comparing with low_latency=0's result, the prior one is about 4% better.
>
> >>
> >> >
> >> >> >
> >> >> This shouldn't affect you if all queues are marked as idle.
> >> > Do you mean to use command ionice to mark it as idle class? I didn't try it.
> >> No. I meant forcing enable_idle = 1, as you were almost doing with
> >> your patch, when cfq_latency was set.
> >> With my above patch, this should not be needed any more, since the
> >> queues should be seen as sequential.
> >>
> >> >
> >> >> Does just
> >> >> your patch:
> >> >> > - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples)
> >> >> > - && CFQQ_SEEKY(cfqq)))
> >> >> > + (!cfqd->cfq_latency && !cfq_cfqq_deep(cfqq) &&
> >> >> > + sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
> >> >> fix most of the regression without touching arm_slice_timer?
> >> > No. If to fix the regression completely, I need apply above patch plus
> >> > a debug patch. The debug patch is to just work around the 3 patches report by
> >> > Shaohua's tiobench regression report. Without the debug patch, the regression
> >> > isn't resolved.
> >>
> >> Jens already merged one of Shaohua's patches, that may fix the problem
> >> with queue combining.
> > I did another testing. Apply my debug patch+ the low_latency patch, but use
> > Shaohua's 2 patches (improve merge and split), the regression disappears.
> >
> >>
> >> > Below is the debug patch.
> >> > diff -Nraup linux-2.6.33_rc1/block/cfq-iosched.c linux-2.6.33_rc1_randread64k/block/cfq-iosched.c
> >> > --- linux-2.6.33_rc1/block/cfq-iosched.c 2009-12-23 14:12:03.000000000 +0800
> >> > +++ linux-2.6.33_rc1_randread64k/block/cfq-iosched.c 2009-12-30 17:12:28.000000000 +0800
> >> > @@ -592,6 +592,9 @@ cfq_set_prio_slice(struct cfq_data *cfqd
> >> > cfqq->slice_start = jiffies;
> >> > cfqq->slice_end = jiffies + slice;
> >> > cfqq->allocated_slice = slice;
> >> > +/*YMZHANG*/
> >> > + cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
> >> > +
> >> This is disabled, on a vanilla 2.6.33 kernel, by setting low_latency = 0
> >> > cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
> >> > }
> >> >
> >> > @@ -1836,7 +1839,8 @@ static void cfq_arm_slice_timer(struct c
> >> > /*
> >> > * still active requests from this queue, don't idle
> >> > */
> >> > - if (cfqq->dispatched)
> >> > + //if (cfqq->dispatched)
> >> > + if (rq_in_driver(cfqd))
> >> > return;
> >> >
> >> > /*
> >> > @@ -1941,6 +1945,9 @@ static void cfq_setup_merge(struct cfq_q
> >> > new_cfqq = __cfqq;
> >> > }
> >> >
> >> > + /* YMZHANG debug */
> >> > + return;
> >> > +
> >> This should be partially addressed by Shaohua's patch merged in Jens' tree.
> >> But note that your 8 processes, can randomly start doing I/O on the
> >> same file, so merging those queues is sometimes reasonable.
> > Another reason is I start 8 processes per partition and every disk has 2 partitions,
> > so there are 16 processes per disk. With another JBOD, I use one partition per disk,
> > and the regression is only 8%.
> With half of the processes, time slices are higher, and the disk cache
> can do a better job when servicing interleaved sequential requests.
> >
> > >From this point, can CFQ do not merge request queues which access different partitions?
> (puzzled: I didn't write this, and can't find a message in the thread
> with this question.)
My email client is evolution and sometimes it adds > unexpectedly.
> > As you know, it's unusual that a process accesses files across partitions. io scheduler
> > is at low layer which doesn't know partition.
> CFQ bases decision on distance between requests, and requests going to
> different partitions will have much higher distance. So the associated
> queues will be more likely marked as seeky.
Right. Thanks for your explanation.
> >
> >
> >> The patch to split them quickly was still not merged, though, so you
> >> will still see some regression due to this. In my simplified job file,
> >> I removed the randomness to make sure this cannot happen.
> >>
> >> > process_refs = cfqq_process_refs(cfqq);
> >> > /*
> >> > * If the process for the cfqq has gone away, there is no
> >> >
> >> >
> >> >>
> >> >> I guess
> >> >> > 5db5d64277bf390056b1a87d0bb288c8b8553f96.
> >> >> will still introduce a 10% regression, but this is needed to improve
> >> >> latency, and you can just disable low_latency to avoid it.
> >> > You are right. I did a quick testing. If my patch + revert 2 patches and keep
> >> > 5db5d64, the regression is about 20%.
> >> >
> >> > But low_latency=0 doesn't work like what we imagined. If patch + revert 2 patches
> >> > and keep 5db5d64 while set low_latency=0, the regression is still there. One
> >> > reason is my patch doesn't work when low_latency=0.
> >> Right. You can try with my patch, instead, that doesn't depend on
> >> low_latency, and set it to 0 to remove this performance degradation.
> >> My results:
> >> 2.6.32.2:
> >> READ: io=146688KB, aggrb=2442KB/s, minb=602KB/s, maxb=639KB/s,
> >> mint=60019msec, maxt=60067msec
> >>
> >> 2.6.33 - jens:
> >> READ: io=128512KB, aggrb=2140KB/s, minb=526KB/s, maxb=569KB/s,
> >> mint=60004msec, maxt=60032msec
> >>
> >> 2.6.33 - jens + my patches :
> >> READ: io=143232KB, aggrb=2384KB/s, minb=595KB/s, maxb=624KB/s,
> >> mint=60003msec, maxt=60072msec
> >>
> >> 2.6.33 - jens + my patches + low_lat = 0:
> >> READ: io=145216KB, aggrb=2416KB/s, minb=596KB/s, maxb=632KB/s,
> >> mint=60027msec, maxt=60087msec
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists