linux-ext4 - Re: [PATCH 0/8 v2] Non-blocking AIO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <334fb384-438b-9929-d92f-af1e598d4650@scylladb.com>
Date:   Mon, 6 Mar 2017 20:50:37 +0200
From:   Avi Kivity <avi@...lladb.com>
To:     Jens Axboe <axboe@...nel.dk>, Jan Kara <jack@...e.cz>
Cc:     Goldwyn Rodrigues <rgoldwyn@...e.de>, jack@...e.com,
        hch@...radead.org, linux-fsdevel@...r.kernel.org,
        linux-block@...r.kernel.org, linux-btrfs@...r.kernel.org,
        linux-ext4@...r.kernel.org, linux-xfs@...r.kernel.org
Subject: Re: [PATCH 0/8 v2] Non-blocking AIO

On 03/06/2017 08:27 PM, Jens Axboe wrote:
> On 03/06/2017 11:17 AM, Avi Kivity wrote:
>>
>> On 03/06/2017 07:06 PM, Jens Axboe wrote:
>>> On 03/06/2017 09:59 AM, Avi Kivity wrote:
>>>> On 03/06/2017 06:08 PM, Jens Axboe wrote:
>>>>> On 03/06/2017 08:59 AM, Avi Kivity wrote:
>>>>>> On 03/06/2017 05:38 PM, Jens Axboe wrote:
>>>>>>> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>>>>>>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>>>>>>>>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>>>>>>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>>>>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>>>>>>>>> any of these conditions are met. This way userspace can push most
>>>>>>>>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>>>>>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>>>>>>>>
>>>>>>>>>>> Is it not possible to push the iocb to a workqueue?  This will allow
>>>>>>>>>>> existing userspace to work with the new functionality, unchanged. Any
>>>>>>>>>>> userspace implementation would have to do the same thing, so it's not like
>>>>>>>>>>> we're saving anything by pushing it there.
>>>>>>>>>> That is not easy because until IO is fully submitted, you need some parts
>>>>>>>>>> of the context of the process which submits the IO (e.g. memory mappings,
>>>>>>>>>> but possibly also other credentials). So you would need to somehow transfer
>>>>>>>>>> this information to the workqueue.
>>>>>>>>> Outside of technical challenges, the API also needs to return EAGAIN or
>>>>>>>>> start blocking at some point. We can't expose a direct connection to
>>>>>>>>> queue work like that, and let any user potentially create millions of
>>>>>>>>> pending work items (and IOs).
>>>>>>>> You wouldn't expect more concurrent events than the maxevents parameter
>>>>>>>> that was supplied to io_setup syscall; it should have reserved any
>>>>>>>> resources needed.
>>>>>>> Doesn't matter what limit you apply, my point still stands - at some
>>>>>>> point you have to return EAGAIN, or block. Returning EAGAIN without
>>>>>>> the caller having flagged support for that change of behavior would
>>>>>>> be problematic.
>>>>>> Doesn't it already return EAGAIN (or some other error) if you exceed
>>>>>> maxevents?
>>>>> It's a setup thing. We check these limits when someone creates an IO
>>>>> context, and carve out the specified entries form our global pool. Then
>>>>> we free those "resources" when the io context is freed.
>>>>>
>>>>> Right now I can setup an IO context with 1000 entries on it, yet that
>>>>> number has NO bearing on when io_submit() would potentially block or
>>>>> return EAGAIN.
>>>>>
>>>>> We can have a huge gap on the intent signaled by io context setup, and
>>>>> the reality imposed by what actually happens on the IO submission side.
>>>> Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return
>>>> EAGAIN?
>>>>
>>>> Just tested it, and maxevents is not respected for this:
>>>>
>>>> io_setup(1, [0x7fc64537f000])           = 0
>>>> io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000,
>>>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>>>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>>>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread,
>>>> fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3,
>>>> buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000,
>>>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>>>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>>>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10
>>>>
>>>> which is unexpected, to me.
>>> ioctx_alloc()
>>> {
>>>           [...]
>>>
>>>           /*
>>>            * We keep track of the number of available ringbuffer slots, to prevent
>>>            * overflow (reqs_available), and we also use percpu counters for this.
>>>            *
>>>            * So since up to half the slots might be on other cpu's percpu counters
>>>            * and unavailable, double nr_events so userspace sees what they
>>>            * expected: additionally, we move req_batch slots to/from percpu
>>>            * counters at a time, so make sure that isn't 0:
>>>            */
>>>           nr_events = max(nr_events, num_possible_cpus() * 4);
>>>           nr_events *= 2;
>>> }
>> On a 4-lcore desktop:
>>
>> io_setup(1, [0x7fc210041000])           = 0
>> io_submit(0x7fc210041000, 10000, [big array]) = 126
>> io_submit(0x7fc210041000, 10000, [big array]) = -1 EAGAIN (Resource
>> temporarily unavailable)
>>
>> so, the user should already expect EAGAIN from io_submit() due to
>> resource limits.  I'm sure the check could be tightened so that if we do
>> have to use a workqueue, we respect the user's limit rather than some
>> inflated number.
> This is why I previously said that the 1000 requests you potentially
> asks for when setting up your IO context has NOTHING to do with when you
> will run into EAGAIN. Yes, returning EAGAIN if the app exceeds the
> limit that it itself has set is existing behavior and it certainly makes
> sense. And it's an easily handled condition, since the app can just
> backoff and wait/reap completion events.

Every time I used aio, I considered maxevents to be the maximum number 
of in-flight requests for that queue, and observed this limit 
religiously.  It's possible others don't.

> But if we allow EAGAIN to bubble up from block request submission, then
> that's a change in behavior. This can happen without the app having any
> pending IO against that IO context, hence we can return EAGAIN to the
> app that then has no reasonable way to handle that condition.
>

For sure (and it's a different EAGAIN -- it's tied to the iocb, not 
request submission).  But we do have an upper bound for the number of 
concurrent requests, even if inflated, so having the kernel convert a 
blocking iocb into a workqueue item does not allow userspace to exploit 
the kernel.

We could limit the number of workqueue submissions to maxevents, and 
queue anything between maxevents and (maxevents * inflation_factor) 
using a regular queue.  So the intent of maxevents is respected, and 
applications that overflow it are not regressed.

   if iocb would overflow inflated maxevents:
       io_submit returns EAGAIN
   elseif iocb can be submitted asynchronously:
       do that
   elseif number of iocbs running in workqueues < maxevents:
       push to a workqueue
   else
       queue somewhere, when a work_item completes it can pick up an 
iocb from the queue

this enables aio for all filesystems, and doesn't require lots of idle 
thread pools if the filesystem works or useless syscalls if it doesn't.  
It's a lot more work for the kernel, but results in a tighter and 
simpler interface.