linux-kernel - Re: [PATCH v3 0/2] io_uring: add support for IORING_OP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <de7e79a4-ebbd-4536-ea54-365e26646587@kernel.dk>
Date:   Sun, 21 Feb 2021 14:12:56 -0700
From:   Jens Axboe <axboe@...nel.dk>
To:     David Laight <David.Laight@...LAB.COM>,
        'Lennert Buytenhek' <buytenh@...tstofly.org>,
        Al Viro <viro@...iv.linux.org.uk>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "io-uring@...r.kernel.org" <io-uring@...r.kernel.org>
Cc:     Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH v3 0/2] io_uring: add support for IORING_OP_GETDENTS

On 2/21/21 12:38 PM, David Laight wrote:
> From: Jens Axboe
>> Sent: 20 February 2021 18:29
>>
>> On 2/20/21 10:44 AM, David Laight wrote:
>>> From: Lennert Buytenhek
>>>> Sent: 18 February 2021 12:27
>>>>
>>>> These patches add support for IORING_OP_GETDENTS, which is a new io_uring
>>>> opcode that more or less does an lseek(sqe->fd, sqe->off, SEEK_SET)
>>>> followed by a getdents64(sqe->fd, (void *)sqe->addr, sqe->len).
>>>>
>>>> A dumb test program for IORING_OP_GETDENTS is available here:
>>>>
>>>> 	https://krautbox.wantstofly.org/~buytenh/uringfind-v2.c
>>>>
>>>> This test program does something along the lines of what find(1) does:
>>>> it scans recursively through a directory tree and prints the names of
>>>> all directories and files it encounters along the way -- but then using
>>>> io_uring.  (The io_uring version prints the names of encountered files and
>>>> directories in an order that's determined by SQE completion order, which
>>>> is somewhat nondeterministic and likely to differ between runs.)
>>>>
>>>> On a directory tree with 14-odd million files in it that's on a
>>>> six-drive (spinning disk) btrfs raid, find(1) takes:
>>>>
>>>> 	# echo 3 > /proc/sys/vm/drop_caches
>>>> 	# time find /mnt/repo > /dev/null
>>>>
>>>> 	real    24m7.815s
>>>> 	user    0m15.015s
>>>> 	sys     0m48.340s
>>>> 	#
>>>>
>>>> And the io_uring version takes:
>>>>
>>>> 	# echo 3 > /proc/sys/vm/drop_caches
>>>> 	# time ./uringfind /mnt/repo > /dev/null
>>>>
>>>> 	real    10m29.064s
>>>> 	user    0m4.347s
>>>> 	sys     0m1.677s
>>>> 	#
>>>
>>> While there may be uses for IORING_OP_GETDENTS are you sure your
>>> test is comparing like with like?
>>> The underlying work has to be done in either case, so you are
>>> swapping system calls for code complexity.
>>
>> What complexity?
> 
> Evan adding commands to a list to execute later is 'complexity'.
> As in adding more cpu cycles.

That's a pretty blanket statement. If the list is heavily shared, and
hence contended, yes that's generally true. But it isn't.

>>> I suspect that find is actually doing a stat() call on every
>>> directory entry and that your io_uring example is just believing
>>> the 'directory' flag returned in the directory entry for most
>>> modern filesystems.
>>
>> While that may be true (find doing stat as well), the runtime is
>> clearly dominated by IO. Adding a stat on top would be an extra
>> copy, but no extra IO.
> 
> I'd expect stat() to require the disk inode be read into memory.
> getdents() only requires the data of the directory be read.
> So calling stat() requires a lot more IO.

I actually went and checked instead of guessing, and find isn't doing a
stat by default on the files.

> The other thing I just realises is that the 'system time'
> output from time is completely meaningless for the io_uring case.
> All that processing is done by a kernel thread and I doubt
> is re-attributed to the user process.

For sure, you can't directly compare the sys times. But the actual
runtime is of course directly comparable. The actual example is btrfs,
which heavily offloads to threads as well. So the find case doesn't show
you the full picture either. Note that once the reworked worker scheme
is adopted, sys time will be directly attributed to the original task
and it will be included (for io_uring, not talking about btrfs).

I'm going to ignore the rest of this email as it isn't productive going
down that path. Suffice it to say that ideally _no_ operations will be
using the async offload, that's really only for regular file IO where
you cannot poll for readiness. And even those cases are gradually being
improved to not rely on it, like for async buffered reads. We still
need to handle writes better, and ideally things like this as well. But
that's a bit further out.

-- 
Jens Axboe