linux-kernel - [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250128133927.3989681-1-max.kellermann@ionos.com>
Date: Tue, 28 Jan 2025 14:39:19 +0100
From: Max Kellermann <max.kellermann@...os.com>
To: axboe@...nel.dk,
	asml.silence@...il.com,
	io-uring@...r.kernel.org,
	linux-kernel@...r.kernel.org
Cc: Max Kellermann <max.kellermann@...os.com>
Subject: [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention)

While optimizing my io_uring-based web server, I found that the kernel
spends 35% of the CPU time waiting for `io_wq_acct.lock`.  This patch
set reduces contention of this lock, though I believe much more should
be done in order to allow more worker concurrency.

I measured these patches with my HTTP server (serving static files and
running a tiny PHP script) and with a micro-benchmark that submits
millions of `IORING_OP_NOP` entries (with `IOSQE_ASYNC` to force
offloading the operation to a worker, so this offload overhead can be
measured).

Some of the optimizations eliminate memory accesses, e.g. by passing
values that are already known to (inlined) functions and by caching
values in local variables.  These are useful optimizations, but they
are too small to measure them in a benchmark (too much noise).

Some of the patches have a measurable effect and they contain
benchmark numbers that I could reproduce in repeated runs, despite the
noise.

I'm not confident about the correctness of the last patch ("io_uring:
skip redundant poll wakeups").  This seemed like low-hanging fruit, so
low that it seemed suspicious to me.  If this is a useful
optimization, the idea could probably be ported to other wait_queue
users, or even into the wait_queue library.  What I'm not confident
about is whether the optimization is valid or whether it may miss
wakeups, leading to stalls.  Please advise!

Total "perf diff" for `IORING_OP_NOP`:

    42.25%     -9.24%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
     4.79%     +2.83%  [kernel.kallsyms]     [k] io_worker_handle_work
     7.23%     -1.41%  [kernel.kallsyms]     [k] io_wq_submit_work
     6.80%     +1.23%  [kernel.kallsyms]     [k] io_wq_free_work
     3.19%     +1.10%  [kernel.kallsyms]     [k] io_req_task_complete
     2.45%     +0.94%  [kernel.kallsyms]     [k] try_to_wake_up
               +0.81%  [kernel.kallsyms]     [k] io_acct_activate_free_worker
     0.79%     +0.64%  [kernel.kallsyms]     [k] __schedule

Serving static files with HTTP (send+receive on local+TCP,splice
file->pipe->TCP):

    42.92%     -7.84%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
     1.53%     -1.51%  [kernel.kallsyms]     [k] ep_poll_callback
     1.18%     +1.49%  [kernel.kallsyms]     [k] io_wq_free_work
     0.61%     +0.60%  [kernel.kallsyms]     [k] try_to_wake_up
     0.76%     -0.43%  [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
     2.22%     -0.33%  [kernel.kallsyms]     [k] io_wq_submit_work

Running PHP script (send+receive on local+TCP, splice pipe->TCP):

    33.01%     -4.13%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
     1.57%     -1.56%  [kernel.kallsyms]     [k] ep_poll_callback
     1.36%     +1.19%  [kernel.kallsyms]     [k] io_wq_free_work
     0.94%     -0.61%  [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
     2.56%     -0.36%  [kernel.kallsyms]     [k] io_wq_submit_work
     2.06%     +0.36%  [kernel.kallsyms]     [k] io_worker_handle_work
     1.00%     +0.35%  [kernel.kallsyms]     [k] try_to_wake_up

(The `IORING_OP_NOP` benchmark finishes after a hardcoded number of
operations; the two HTTP benchmarks finish after a certain wallclock
duration, and therefore more HTTP requests were handled.)

Max Kellermann (8):
  io_uring/io-wq: eliminate redundant io_work_get_acct() calls
  io_uring/io-wq: add io_worker.acct pointer
  io_uring/io-wq: move worker lists to struct io_wq_acct
  io_uring/io-wq: cache work->flags in variable
  io_uring/io-wq: do not use bogus hash value
  io_uring/io-wq: pass io_wq to io_get_next_work()
  io_uring: cache io_kiocb->flags in variable
  io_uring: skip redundant poll wakeups

 include/linux/io_uring_types.h |  10 ++
 io_uring/io-wq.c               | 230 +++++++++++++++++++--------------
 io_uring/io-wq.h               |   7 +-
 io_uring/io_uring.c            |  63 +++++----
 io_uring/io_uring.h            |   2 +-
 5 files changed, 187 insertions(+), 125 deletions(-)

-- 
2.45.2