lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20190917091358.3652-1-avi@scylladb.com>
Date:   Tue, 17 Sep 2019 12:13:58 +0300
From:   Avi Kivity <avi@...lladb.com>
To:     Jens Axboe <axboe@...nel.dk>
Cc:     linux-kernel@...r.kernel.org, linux-block@...r.kernel.org
Subject: [PATCH v1] io_uring: reserve word at cqring tail+4 for the user

In some applications, a thread waits for I/O events generated by
the kernel, and also events generated by other threads in the same
application. Typically events from other threads are passed using
in-memory queues that are not known to the kernel. As long as the
threads is active, it polls for both kernel completions and
inter-thread completions; when it is idle, it tells the other threads
to use an I/O event to wait it up (e.g. an eventfd or a pipe) and
then enters the kernel, waiting for such an event or an ordinary
I/O completion.

When such a thread goes idle, it typically spins for a while to
avoid the kernel entry/exit cost in case an event is forthcoming
shortly. While it spins it polls both I/O completions and
inter-thread queues.

The x86 instruction pair UMONITOR/UMWAIT allows waiting for a cache
line to be written to. This can be used with io_uring to wait for a
wakeup without spinning (and wasting power and slowing down the other
hyperthread). Other threads can also wake up the waiter by doing a
safe write to the tail word (which triggers the wakeup), but safe
writes are slow as they require an atomic instruction. To speed up
those wakeups, reserve a word after the tail for user writes.

A thread consuming an io_uring completion queue can then use the
following sequences:

  - while busy:
    - pick up work from the completion queue and from other threads,
      and process it

  - while idle:
    - use UMONITOR/UMWAIT to wait on completions and notifications
      from other threads for a short period
    - if no work is picked up, let other threads know you will need
      a kernel wakeup, and use io_uring_enter to wait indefinitely

Signed-off-by: Avi Kivity <avi@...lladb.com>
---
 fs/io_uring.c                 | 5 +++--
 include/uapi/linux/io_uring.h | 4 ++++
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index cfb48bd088e1..4bd7905cee1d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -77,12 +77,13 @@
 
 #define IORING_MAX_ENTRIES	4096
 #define IORING_MAX_FIXED_FILES	1024
 
 struct io_uring {
-	u32 head ____cacheline_aligned_in_smp;
-	u32 tail ____cacheline_aligned_in_smp;
+	u32 head ____cacheline_aligned;
+	u32 tail ____cacheline_aligned;
+	u32 reserved_for_user; // for cq ring and UMONITOR/UMWAIT (or similar) wakeups
 };
 
 /*
  * This data is shared with the application through the mmap at offset
  * IORING_OFF_SQ_RING.
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 1e1652f25cc1..1a6a826a66f3 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -103,10 +103,14 @@ struct io_sqring_offsets {
  */
 #define IORING_SQ_NEED_WAKEUP	(1U << 0) /* needs io_uring_enter wakeup */
 
 struct io_cqring_offsets {
 	__u32 head;
+	// tail is guaranteed to be aligned on a cache line, and to have the
+	// following __u32 free for user use. This allows using e.g.
+	// UMONITOR/UMWAIT to wait on both writes to head and writes from
+	// other threads to the following word.
 	__u32 tail;
 	__u32 ring_mask;
 	__u32 ring_entries;
 	__u32 overflow;
 	__u32 cqes;
-- 
2.21.0

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ