[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <9b4a780e-72af-125e-d104-4179a859d581@scylladb.com>
Date: Tue, 17 Sep 2019 18:12:50 +0300
From: Avi Kivity <avi@...lladb.com>
To: Jens Axboe <axboe@...nel.dk>
Cc: linux-kernel@...r.kernel.org, linux-block@...r.kernel.org
Subject: Re: [PATCH v1] io_uring: reserve word at cqring tail+4 for the user
On 17/09/2019 17.54, Jens Axboe wrote:
> On 9/17/19 3:13 AM, Avi Kivity wrote:
>> In some applications, a thread waits for I/O events generated by
>> the kernel, and also events generated by other threads in the same
>> application. Typically events from other threads are passed using
>> in-memory queues that are not known to the kernel. As long as the
>> threads is active, it polls for both kernel completions and
>> inter-thread completions; when it is idle, it tells the other threads
>> to use an I/O event to wait it up (e.g. an eventfd or a pipe) and
>> then enters the kernel, waiting for such an event or an ordinary
>> I/O completion.
>>
>> When such a thread goes idle, it typically spins for a while to
>> avoid the kernel entry/exit cost in case an event is forthcoming
>> shortly. While it spins it polls both I/O completions and
>> inter-thread queues.
>>
>> The x86 instruction pair UMONITOR/UMWAIT allows waiting for a cache
>> line to be written to. This can be used with io_uring to wait for a
>> wakeup without spinning (and wasting power and slowing down the other
>> hyperthread). Other threads can also wake up the waiter by doing a
>> safe write to the tail word (which triggers the wakeup), but safe
>> writes are slow as they require an atomic instruction. To speed up
>> those wakeups, reserve a word after the tail for user writes.
>>
>> A thread consuming an io_uring completion queue can then use the
>> following sequences:
>>
>> - while busy:
>> - pick up work from the completion queue and from other threads,
>> and process it
>>
>> - while idle:
>> - use UMONITOR/UMWAIT to wait on completions and notifications
>> from other threads for a short period
>> - if no work is picked up, let other threads know you will need
>> a kernel wakeup, and use io_uring_enter to wait indefinitely
> This is cool, I like it. A few comments:
>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index cfb48bd088e1..4bd7905cee1d 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -77,12 +77,13 @@
>>
>> #define IORING_MAX_ENTRIES 4096
>> #define IORING_MAX_FIXED_FILES 1024
>>
>> struct io_uring {
>> - u32 head ____cacheline_aligned_in_smp;
>> - u32 tail ____cacheline_aligned_in_smp;
>> + u32 head ____cacheline_aligned;
>> + u32 tail ____cacheline_aligned;
>> + u32 reserved_for_user; // for cq ring and UMONITOR/UMWAIT (or similar) wakeups
>> };
> Since we have that full cacheline, maybe name this one a bit more
> appropriately as we can add others if we need it. Not a big deal.
You mean, name it for its intended purpose of serving as a write target
for umonitor/umwait wakes?
Note that the user won't see the name, and that it's only accurate for
an io_uring that's used for completions.
> But definitely use /* */ style comments :-)
Sorry, in C++-land for a while. You're lucky I didn't turn the whole
thing into a virtual template something.
>
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index 1e1652f25cc1..1a6a826a66f3 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -103,10 +103,14 @@ struct io_sqring_offsets {
>> */
>> #define IORING_SQ_NEED_WAKEUP (1U << 0) /* needs io_uring_enter wakeup */
>>
>> struct io_cqring_offsets {
>> __u32 head;
>> + // tail is guaranteed to be aligned on a cache line, and to have the
>> + // following __u32 free for user use. This allows using e.g.
>> + // UMONITOR/UMWAIT to wait on both writes to head and writes from
>> + // other threads to the following word.
>> __u32 tail;
>> __u32 ring_mask;
>> __u32 ring_entries;
>> __u32 overflow;
>> __u32 cqes;
> Ditto on the comments here.
Sure.
> Would be ideal if we could pair this with an example for liburing, a basic
> test case would be fine. Something that shows how to use it, and verifies
> that it works.
I'll have to look for a machine with waitpkg for that.
> Also, this patch is against master, it should be against for-5.4/io_iuring as
> it won't apply there right now.
Sure, will rebase.
Powered by blists - more mailing lists