linux-kernel - Re: [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a5d8d039-f2d7-4adb-afd7-693b3be41e45@gmail.com>
Date: Wed, 29 Jan 2025 19:30:43 +0000
From: Pavel Begunkov <asml.silence@...il.com>
To: Max Kellermann <max.kellermann@...os.com>, Jens Axboe <axboe@...nel.dk>
Cc: io-uring@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/8] Various io_uring micro-optimizations (reducing lock
 contention)

On 1/29/25 17:39, Max Kellermann wrote:
> On Wed, Jan 29, 2025 at 6:19 PM Jens Axboe <axboe@...nel.dk> wrote:
>> The other patches look pretty straight forward to me. Only thing that
>> has me puzzled a bit is why you have so much io-wq activity with your
>> application, in general I'd expect 0 activity there. But Then I saw the
>> forced ASYNC flag, and it makes sense. In general, forcing that isn't a
>> great idea, but for a benchmark for io-wq it certainly makes sense.
> 
> I was experimenting with io_uring and wanted to see how much
> performance I can squeeze out of my web server running
> single-threaded. The overhead of io_uring_submit() grew very large,
> because the "send" operation would do a lot of synchronous work in the
> kernel. I tried SQPOLL but it was actually a big performance
> regression; this just shifted my CPU usage to epoll_wait(). Forcing
> ASYNC gave me large throughput improvements (moving the submission
> overhead to iowq), but then the iowq lock contention was the next
> limit, thus this patch series.

It's great to see iowq getting some optimisations, but note that
it wouldn't be fair comparing it to single threaded peers when
you have a lot of iowq activity as it might be occupying multiple
CPUs. A curious open question is whether it'd be more performant
to have several user threads with their private rings.

> I'm still experimenting, and I will certainly revisit SQPOLL to learn
> more about why it didn't help and how to fix it.

It's wasteful unless you saturate it close to 100%, and then you
usually have SQPOLL on a separate CPU than the user task submitting
requests, and so it'd take some cache bouncing. It's not a silver
bullet.

-- 
Pavel Begunkov