linux-kernel - Re: CFQ read performance regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <s2q4e5e476b1004241336lbd071624ke489350093ca6e1a@mail.gmail.com>
Date:	Sat, 24 Apr 2010 22:36:48 +0200
From:	Corrado Zoccolo <czoccolo@...il.com>
To:	Miklos Szeredi <mszeredi@...e.cz>, Vivek Goyal <vgoyal@...hat.com>
Cc:	Jens Axboe <jens.axboe@...cle.com>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	Jan Kara <jack@...e.cz>, Suresh Jayaraman <sjayaraman@...e.de>
Subject: Re: CFQ read performance regression

On Fri, Apr 23, 2010 at 12:57 PM, Miklos Szeredi <mszeredi@...e.cz> wrote:
> On Thu, 2010-04-22 at 16:31 -0400, Vivek Goyal wrote:
>> On Thu, Apr 22, 2010 at 09:59:14AM +0200, Corrado Zoccolo wrote:
>> > Hi Miklos,
>> > On Wed, Apr 21, 2010 at 6:05 PM, Miklos Szeredi <mszeredi@...e.cz> wrote:
>> > > Jens, Corrado,
>> > >
>> > > Here's a graph showing the number of issued but not yet completed
>> > > requests versus time for CFQ and NOOP schedulers running the tiobench
>> > > benchmark with 8 threads:
>> > >
>> > > http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/queue-depth.jpg
>> > >
>> > > It shows pretty clearly the performance problem is because CFQ is not
>> > > issuing enough request to fill the bandwidth.
>> > >
>> > > Is this the correct behavior of CFQ or is this a bug?
>> >  This is the expected behavior from CFQ, even if it is not optimal,
>> > since we aren't able to identify multi-splindle disks yet.
>>
>> In the past we were of the opinion that for sequential workload multi spindle
>> disks will not matter much as readahead logic (in OS and possibly in
>> hardware also) will help. For random workload we anyway don't idle on the
>> single cfqq so it is fine. But my tests now seem to be telling a different
>> story.
>>
>> I also have one FC link to one of the HP EVA and I am running increasing
>> number of sequential readers to see if throughput goes up as number of
>> readers go up. The results are with noop and cfq. I do flush OS caches
>> across the runs but I have no control on caching on HP EVA.
>>
>> Kernel=2.6.34-rc5
>> DIR=/mnt/iostestmnt/fio        DEV=/dev/mapper/mpathe
>> Workload=bsr      iosched=cfq     Filesz=2G   bs=4K
>> =========================================================================
>> job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
>> ---       --- --  ------------   -----------    -------------  -----------
>> bsr       1   1   135366         59024          0              0
>> bsr       1   2   124256         126808         0              0
>> bsr       1   4   132921         341436         0              0
>> bsr       1   8   129807         392904         0              0
>> bsr       1   16  129988         773991         0              0
>>
>> Kernel=2.6.34-rc5
>> DIR=/mnt/iostestmnt/fio        DEV=/dev/mapper/mpathe
>> Workload=bsr      iosched=noop    Filesz=2G   bs=4K
>> =========================================================================
>> job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
>> ---       --- --  ------------   -----------    -------------  -----------
>> bsr       1   1   126187         95272          0              0
>> bsr       1   2   185154         72908          0              0
>> bsr       1   4   224622         88037          0              0
>> bsr       1   8   285416         115592         0              0
>> bsr       1   16  348564         156846         0              0
>>
>
> These numbers are very similar to what I got.
>
>> So in case of NOOP, throughput shotup to 348MB/s but CFQ reamains more or
>> less constat, about 130MB/s.
>>
>> So atleast in this case, a single sequential CFQ queue is not keeing the
>> disk busy enough.
>>
>> I am wondering why my testing results were different in the past. May be
>> it was a different piece of hardware and behavior various across hardware?
>
> Probably.  I haven't seen this type of behavior on other hardware.
>
>> Anyway, if that's the case, then we probably need to allow IO from
>> multiple sequential readers and keep a watch on throughput. If throughput
>> drops then reduce the number of parallel sequential readers. Not sure how
>> much of code that is but with multiple cfqq going in parallel, ioprio
>> logic will more or less stop working in CFQ (on multi-spindle hardware).
Hi Vivek,
I tried to implement exactly what you are proposing, see the attached patches.
I leverage the queue merging features to let multiple cfqqs share the
disk in the same timeslice.
I changed the queue split code to trigger on throughput drop instead
of on seeky pattern, so diverging queues can remain merged if they
have good throughput. Moreover, I measure the max bandwidth reached by
single queues and merged queues (you can see the values in the
bandwidth sysfs file).
If merged queues can outperform non-merged ones, the queue merging
code will try to opportunistically merge together queues that cannot
submit enough requests to fill half of the NCQ slots. I'd like to know
if you can see any improvements out of this on your hardware. There
are some magic numbers in the code, you may want to try tuning them.
Note that, since the opportunistic queue merging will start happening
only after merged queues have shown to reach higher bandwidth than
non-merged queues, you should use the disk for a while before trying
the test (and you can check sysfs), or the merging will not happen.

>
> Have you tested on older kernels?  Around 2.6.16 it seemed to allow more
> parallel reads, but that might have been just accidental (due to I/O
> being submitted in a different pattern).
Is the BW for 1 single reader also better on 2.6.16, or the
improvement is only seen with more concurrent readers?

Thanks,
Corrado
>
> Thanks,
> Miklos
>
>

Download attachment "0001-cfq-iosched-introduce-bandwidth-measurement.patch" of type "application/octet-stream" (3522 bytes)

Download attachment "0002-cfq-iosched-optimistic-queue-merging.patch" of type "application/octet-stream" (3158 bytes)