lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 08 Apr 2015 11:38:14 -0600
From:	Jens Axboe <axboe@...nel.dk>
To:	Jeff Moyer <jmoyer@...hat.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] blk-plug: don't flush nested plug lists

On 04/06/2015 01:14 PM, Jeff Moyer wrote:
> The way the on-stack plugging currently works, each nesting level
> flushes its own list of I/Os.  This can be less than optimal (read
> awful) for certain workloads.  For example, consider an application
> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
> I/Os together in a single io_submit call, only to have each of them
> dispatched individually down in the bowells of the dirct I/O code.
> The reason is that there are blk_plug's instantiated both at the upper
> call site in do_io_submit and down in do_direct_IO.  The latter will
> submit as little as 1 I/O at a time (if you have a small enough I/O
> size) instead of performing the batching that the plugging
> infrastructure is supposed to provide.
>
> Now, for the case where there is an elevator involved, this doesn't
> really matter too much.  The elevator will keep the I/O around long
> enough for it to be merged.  However, in cases where there is no
> elevator (like blk-mq), I/Os are simply dispatched immediately.
>
> Try this, for example (note I'm using a virtio-blk device, so it's
> using the blk-mq single queue path, though I've also reproduced this
> with the micron p320h):
>
> fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based
>
> If you run that on a current kernel, you will get zero merges.  Zero!
> After this patch, you will get many merges (the actual number depends
> on how fast your storage is, obviously), and much better throughput.
> Here are results from my test rig:
>
> Unpatched kernel:
> Read B/W:    283,638 KB/s
> Read Merges: 0
>
> Patched kernel:
> Read B/W:    873,224 KB/s
> Read Merges: 2,046K
>
> I considered several approaches to solving the problem:
> 1)  get rid of the inner-most plugs
> 2)  handle nesting by using only one on-stack plug
> 2a) #2, except use a per-cpu blk_plug struct, which may clean up the
>      code a bit at the expense of memory footprint
>
> Option 1 will be tricky or impossible to do, since inner most plug
> lists are sometimes the only plug lists, depending on the call path.
> Option 2 is what this patch implements.  Option 2a is perhaps a better
> idea, but since I already implemented option 2, I figured I'd post it
> for comments and opinions before rewriting it.
>
> Much of the patch involves modifying call sites to blk_start_plug,
> since its signature is changed.  The meat of the patch is actually
> pretty simple and constrained to block/blk-core.c and
> include/linux/blkdev.h.  The only tricky bits were places where plugs
> were finished and then restarted to flush out I/O.  There, I went
> ahead and exported blk_flush_plug_list and called that directly.
>
> Comments would be greatly appreciated.

It's hard to argue with the increased merging for your case. The task 
plugs did originally work like you changed them to, not flushing until 
the outermost plug was flushed. Unfortunately I don't quite remember why 
I changed them, will have to do a bit of digging to refresh my memory.

For cases where we don't do any merging (like nvme), we always want to 
flush. Well almost, if we start do utilize the batched submission, then 
the plug would still potentially help (just for other reasons than merging).

And agree with Ming, this can be cleaned up substantially. I'd also like 
to see some test results from the other end of the spectrum. Your posted 
cased is clearly based case (we missed tons of merging, now we don't), 
I'd like to see a normal case and a worst case result as well so we have 
an idea of what this would do to latencies.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ