linux-kernel - Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 01 Jan 2010 18:12:10 +0800
From:	"Zhang, Yanmin" <yanmin_zhang@...ux.intel.com>
To:	Corrado Zoccolo <czoccolo@...il.com>
Cc:	Jens Axboe <jens.axboe@...cle.com>,
	Shaohua Li <shaohua.li@...el.com>,
	"jmoyer@...hat.com" <jmoyer@...hat.com>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1

On Thu, 2009-12-31 at 11:34 +0100, Corrado Zoccolo wrote:
> Hi Yanmin,
> On Thu, Dec 31, 2009 at 10:16 AM, Zhang, Yanmin
> <yanmin_zhang@...ux.intel.com> wrote:
> > Comparing with kernel 2.6.32, fio mmap randread 64k has more than 40% regression with
> > 2.6.33-rc1.
> 
Thanks for your timely reply. Some comments inlined below.

> Can you compare the performance also with 2.6.31?
We did. We run Linux kernel Performance Tracking project and run many benchmarks when a RC kernel
is released.

The result of 2.6.31 is quite similar to the one of 2.6.32. But the one of 2.6.30 is about
8% better than the one of 2.6.31.

> I think I understand what causes your problem.
> 2.6.32, with default settings, handled even random readers as
> sequential ones to provide fairness. This has benefits on single disks
> and JBODs, but causes harm on raids.
I didn't test RAID as that machine with hardware RAID HBA is crashed now. But if we turn on
hardware RAID in HBA, mostly we use noop io scheduler.

> For 2.6.33, we changed the way in which this is handled, restoring the
> enable_idle = 0 for seeky queues as it was in 2.6.31:
> @@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd,
> struct cfq_queue *cfqq,
>        enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
> 
>        if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> -           (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq)))
> +           (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
>                enable_idle = 0;
> (compare with 2.6.31:
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>             (cfqd->hw_tag && CIC_SEEKY(cic)))
>                 enable_idle = 0;
> excluding the sample_valid check, it should be equivalent for you (I
> assume you have NCQ disks))
> and we provide fairness for them by servicing all seeky queues
> together, and then idling before switching to other ones.
As for function cfq_update_idle_window, you is right. But since
2.6.32, CFQ merges many patches and the patches have impact on each other.

> 
> The mmap 64k randreader will have a large seek_mean, resulting in
> being marked seeky, but will send 16 * 4k sequential requests one
> after the other, so alternating between those seeky queues will cause
> harm.
> 
> I'm working on a new way to compute seekiness of queues, that should
> fix your issue, correctly identifying those queues as non-seeky (for
> me, a queue should be considered seeky only if it submits more than 1
> seeky requests for 8 sequential ones).
> 
> >
> > The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create
> > 8 1-GB files per partition and start 8 processes to do rand read on the 8 files
> > per partitions. There are 8*24 processes totally. randread block size is 64K.
> >
> > We found the regression on 2 machines. One machine has 8GB memory and the other has
> > 6GB.
> >
> > Bisect is very unstable. The related patches are many instead of just one.
> >
> >
> > 1) commit 8e550632cccae34e265cb066691945515eaa7fb5
> > Author: Corrado Zoccolo <czoccolo@...il.com>
> > Date:   Thu Nov 26 10:02:58 2009 +0100
> >
> >    cfq-iosched: fix corner cases in idling logic
> >
> >
> > This patch introduces about less than 20% regression. I just reverted below section
> > and this part regression disappear. It shows this regression is stable and not impacted
> > by other patches.
> >
> > @@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
> >                return;
> >
> >        /*
> > -        * still requests with the driver, don't idle
> > +        * still active requests from this queue, don't idle
> >         */
> > -       if (rq_in_driver(cfqd))
> > +       if (cfqq->dispatched)
> >                return;
Although 5 patches are related to the regression, above line is quite
independent. Reverting above line could always improve the result for about
20%.

> >
> This shouldn't affect you if all queues are marked as idle.
Do you mean to use command ionice to mark it as idle class? I didn't try it.

>  Does just
> your patch:
> > -           (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples)
> > -            && CFQQ_SEEKY(cfqq)))
> > +           (!cfqd->cfq_latency && !cfq_cfqq_deep(cfqq) &&
> > +               sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
> fix most of the regression without touching arm_slice_timer?
No. If to fix the regression completely, I need apply above patch plus
a debug patch. The debug patch is to just work around the 3 patches report by
Shaohua's tiobench regression report. Without the debug patch, the regression
isn't resolved.
Below is the debug patch.
diff -Nraup linux-2.6.33_rc1/block/cfq-iosched.c linux-2.6.33_rc1_randread64k/block/cfq-iosched.c
--- linux-2.6.33_rc1/block/cfq-iosched.c	2009-12-23 14:12:03.000000000 +0800
+++ linux-2.6.33_rc1_randread64k/block/cfq-iosched.c	2009-12-30 17:12:28.000000000 +0800
@@ -592,6 +592,9 @@ cfq_set_prio_slice(struct cfq_data *cfqd
 	cfqq->slice_start = jiffies;
 	cfqq->slice_end = jiffies + slice;
 	cfqq->allocated_slice = slice;
+/*YMZHANG*/
+	cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
+
 	cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
 }
 
@@ -1836,7 +1839,8 @@ static void cfq_arm_slice_timer(struct c
 	/*
 	 * still active requests from this queue, don't idle
 	 */
-	if (cfqq->dispatched)
+	//if (cfqq->dispatched)
+	if (rq_in_driver(cfqd))
 		return;
 
 	/*
@@ -1941,6 +1945,9 @@ static void cfq_setup_merge(struct cfq_q
 		new_cfqq = __cfqq;
 	}
 
+	/* YMZHANG debug */
+	return;
+
 	process_refs = cfqq_process_refs(cfqq);
 	/*
 	 * If the process for the cfqq has gone away, there is no


> 
> I guess
> > 5db5d64277bf390056b1a87d0bb288c8b8553f96.
> will still introduce a 10% regression, but this is needed to improve
> latency, and you can just disable low_latency to avoid it.
You are right. I did a quick testing. If my patch + revert 2 patches and keep
5db5d64, the regression is about 20%.

But low_latency=0 doesn't work like what we imagined. If patch + revert 2 patches
and keep 5db5d64 while set low_latency=0, the regression is still there. One
reason is my patch doesn't work when low_latency=0.

> 
> Thanks,
> Corrado
I attach the fio job file for your reference.

I got a cold and will continue to work on it next week.

Yanmin



View attachment "fio_randread_job_file" of type "text/plain" (24675 bytes)