linux-kernel - Re: [PATCH v5] Subject: io_uring: releasing CPU resources when polling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5146c97b-912e-41e3-bea9-547b0881707a@kernel.dk>
Date: Wed, 19 Jun 2024 09:51:29 -0600
From: Jens Axboe <axboe@...nel.dk>
To: hexue <xue01.he@...sung.com>
Cc: asml.silence@...il.com, io-uring@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5] Subject: io_uring: releasing CPU resources when
 polling

On 6/19/24 1:18 AM, hexue wrote:
> io_uring use polling mode could improve the IO performence, but it will
> spend 100% of CPU resources to do polling.
> 
> This set a signal "IORING_SETUP_HY_POLL" to application, aim to provide
> a interface for user to enable a new hybrid polling at io_uring level.
> 
> A new hybrid poll is implemented on the io_uring layer. Once IO issued,
> it will not polling immediately, but block first and re-run before IO
> complete, then poll to reap IO. This poll function could keep polling
> high performance and free up some CPU resources.
> 
> we considered about complex situations, such as multi-concurrency,
> different processing speed of multi-disk, etc.
> 
> Test results:
> set 8 poll queues, fio-3.35, Gen5 SSD, 8 CPU VM
> 
> per CPU utilization:
>     read(128k, QD64, 1Job)     53%   write(128k, QD64, 1Job)     45%
>     randread(4k, QD64, 16Job)  70%   randwrite(4k, QD64, 16Job)  16%
> performance reduction:
>     read  0.92%     write  0.92%    randread  1.61%    randwrite  0%

Haven't tried this on slower storage yet, but my usual 122M IOPS polled
test case (24 drives, each using a single thread to load up a drive)
yields the following with hybrid polling enabled:

IOPS=57.08M, BW=27.87GiB/s, IOS/call=32/31
IOPS=56.91M, BW=27.79GiB/s, IOS/call=32/32
IOPS=57.93M, BW=28.29GiB/s, IOS/call=31/31
IOPS=57.82M, BW=28.23GiB/s, IOS/call=32/32

which is even slower than IRQ driven.

It does use less cpu, about 1900% compared to 2400% before as it's
polling. And obviously this is not the best case for this scenario, as
these devices have low latencies. Like I predicted in earlier replies,
most of the added overhead here is TSC reading, outside of the obvious
one of now having wakeups and context switches, about 1M/sec of the
latter for this test.

If we move to regular flash, here's another box I have with 32 flash
drives in it. For a similar test, we get:

IOPS=104.01M, BW=50.78GiB/s, IOS/call=31/31
IOPS=103.92M, BW=50.74GiB/s, IOS/call=31/31
IOPS=103.99M, BW=50.78GiB/s, IOS/call=31/31
IOPS=103.97M, BW=50.77GiB/s, IOS/call=31/31
IOPS=104.01M, BW=50.79GiB/s, IOS/call=31/31
IOPS=104.02M, BW=50.79GiB/s, IOS/call=31/31
IOPS=103.62M, BW=50.59GiB/s, IOS/call=31/31

using 3200% CPU (32 drives, 32 threads polling) with regular polling,
and enabling hybrid polling:

IOPS=53.62M, BW=26.18GiB/s, IOS/call=32/32
IOPS=53.37M, BW=26.06GiB/s, IOS/call=31/31
IOPS=53.45M, BW=26.10GiB/s, IOS/call=32/31
IOPS=53.43M, BW=26.09GiB/s, IOS/call=32/32
IOPS=53.11M, BW=25.93GiB/s, IOS/call=32/32

and again a lot of tsc overhead (> 10%), overhead from your extra
allocations (8%).

If we just do a single flash drive, it'll do 3.25M with 100% with normal
polling, and 2.0M with 50% CPU usage with hybrid polling.

While I do suspect there are cases where hybrid polling will be more
efficient, not sure there are many of them. And you're most likely
better off just doing IRQ driven IO at that point? Particularly with the
fairly substantial overhead of maintaining the data you need, and time
querying.

-- 
Jens Axboe