[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240812125717.413108-1-jdamato@fastly.com>
Date: Mon, 12 Aug 2024 12:57:03 +0000
From: Joe Damato <jdamato@...tly.com>
To: netdev@...r.kernel.org
Cc: mkarsten@...terloo.ca,
amritha.nambiar@...el.com,
sridhar.samudrala@...el.com,
sdf@...ichev.me,
Joe Damato <jdamato@...tly.com>,
Alexander Lobakin <aleksander.lobakin@...el.com>,
Alexander Viro <viro@...iv.linux.org.uk>,
Breno Leitao <leitao@...ian.org>,
Christian Brauner <brauner@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Jan Kara <jack@...e.cz>,
Jiri Pirko <jiri@...nulli.us>,
Johannes Berg <johannes.berg@...el.com>,
Jonathan Corbet <corbet@....net>,
linux-doc@...r.kernel.org (open list:DOCUMENTATION),
linux-fsdevel@...r.kernel.org (open list:FILESYSTEMS (VFS and infrastructure)),
linux-kernel@...r.kernel.org (open list),
Lorenzo Bianconi <lorenzo@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: [RFC net-next 0/5] Suspend IRQs during preferred busy poll
Greetings:
Martin Karsten (CC'd) and I have been collaborating on some ideas about
ways of reducing tail latency when using epoll-based busy poll and we'd
love to get feedback from the list on the code in this series. This is
the idea I mentioned at netdev conf, for those who were there. Barring
any major issues, we hope to submit this officially shortly after RFC.
The basic idea for suspending IRQs in this manner was described in an
earlier paper presented at Sigmetrics 2024 [1].
Previously, commit 18e2bf0edf4d ("eventpoll: Add epoll ioctl for
epoll_params") introduced the ability to enable or disable preferred
busy poll mode on a specific epoll context using an ioctl
(EPIOCSPARAMS).
This series extends preferred busy poll mode by adding a sysfs parameter,
irq_suspend_timeout, which when used in combination with preferred busy
poll suspends device IRQs up to irq_suspend_timeout nanoseconds.
Important call outs:
- Enabling per epoll-context preferred busy poll will now effectively
lead to a nonblocking iteration through napi_busy_loop, even when
busy_poll_usecs is 0. See patch 4.
- Patches apply cleanly on net-next commit c4e82c025b3f ("net: dsa:
microchip: ksz9477: split half-duplex monitoring function"), but
may need to be respun if/when commit b4988e3bd1f0 ("eventpoll: Annotate
data-race of busy_poll_usecs") picked up by the vfs folks makes its way
into net-next.
- In the future, time permitting, I hope to enable support for
napi_defer_hard_irqs, gro_flush_timeout (introduced in commit
6f8b12d661d0 ("net: napi: add hard irqs deferral feature")), and
irq_suspend_timeout (introduced in this series) on a per-NAPI basis
(presumably via netdev-genl).
~ Description of the changes
The overall idea is that IRQ suspension is introduced via a sysfs
parameter which controls the maximum time that IRQs can be suspended.
Here's how it is intended to work:
- An administrator sets the existing sysfs parameters for
defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.
- An administrator sets the new sysfs parameter irq_suspend_timeout
to a larger value than gro-timeout to enable IRQ suspension.
- The user application issues the existing epoll ioctl to set the
prefer_busy_poll flag on the epoll context.
- The user application then calls epoll_wait to busy poll for network
events, as it normally would.
- If epoll_wait returns events to userland, IRQ are suspended for the
duration of irq_suspend_timeout.
- If epoll_wait finds no events and the thread is about to go to
sleep, IRQ handling using gro_flush_timeout and defer_hard_irqs is
resumed.
As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. Unless IRQ
suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
When network traffic reduces, eventually a busy poll loop in the kernel
will retrieve no data. When this occurs, regular deferral using
gro_flush_timeout for the polled NAPI is immediately re-enabled. Regular
deferral is also immediately re-enabled when the epoll context is
destroyed.
~ Benchmark configs & descriptions
These changes were benchmarked with memcached [2] using the
benchmarking tool mutilate [3].
To facilitate benchmarking, a small patch [4] was applied to
memcached 1.6.29 (the latest memcached release as of this RFC) to allow
setting per-epoll context preferred busy poll and other settings
via environment variables.
Multiple scenarios were benchmarked as described below
and the scripts used for producing these results can be found on
github [5].
(note: all scenarios use NAPI-based traffic splitting via SO_INCOMING_ID
by passing -N to memcached):
- base: Other than NAPI-based traffic splitting, no other options are
enabled.
- busy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 200,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 64,
busy_poll_budget = 64, prefer_busy_poll = true)
- deferX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- suspendX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
~ Benchmark results
Tested on:
Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC
The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. memcached binds to the
ipv4 address on the configured interface.
The NIC's interrupts coalescing configuration are left at boot-time
defaults.
The overall takeaway from the results below is that the new mechanism
(suspend20, see below) results in reduced 99th percentile latency and
increased QPS in the MAX QPS case (compared to the other cases), and
reduced latency in the lower QPS cases for comparable CPU usage to the
base case (and less CPU than the busy case).
base
load qps avglat 95%lat 99%lat cpu
200K 199982 109 225 385 30
400K 400054 138 262 676 44
600K 599968 165 396 737 64
800K 800002 353 1136 2098 83
1000K 964960 3202 5556 7003 98
MAX 957274 4255 5526 6843 100
busy
load qps avglat 95%lat 99%lat cpu
200K 199936 101 239 287 57
400K 399795 81 230 302 83
600K 599797 65 169 264 95
800K 799789 67 145 221 99
1000K 1000135 97 186 287 100
MAX 1079228 3752 7481 12634 98
defer20
load qps avglat 95%lat 99%lat cpu
200K 200052 60 130 156 28
400K 399797 67 140 176 49
600K 600049 94 189 483 68
800K 800106 246 959 2201 88
1000K 857377 4377 5674 5830 100
MAX 974672 4162 5454 5815 100
defer200
load qps avglat 95%lat 99%lat cpu
200K 200029 165 258 316 18
400K 399978 183 280 340 32
600K 599818 205 310 367 46
800K 799869 265 439 829 73
1000K 995961 2307 5163 7027 98
MAX 1050680 3837 5020 5596 100
suspend20
load qps avglat 95%lat 99%lat cpu
200K 199968 58 128 161 31
400K 400191 61 135 175 51
600K 599872 67 142 196 66
800K 800050 78 153 220 82
1000K 999638 101 194 292 91
MAX 1144308 3596 3961 4155 100
suspend200
load qps avglat 95%lat 99%lat cpu
200K 199973 149 251 313 20
400K 399957 154 270 331 35
600K 599878 157 284 351 51
800K 800091 158 293 359 65
1000K 1000399 173 311 393 85
MAX 1128033 3636 4210 4381 100
Thanks,
Martin and Joe
[1]: https://doi.org/10.1145/3626780
[2]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[3]: https://github.com/leverich/mutilate
[4]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/memcached.patch
[5]: https://github.com/martinkarsten/irqsuspend
Thanks,
Martin and Joe
Martin Karsten (5):
net: Add sysfs parameter irq_suspend_timeout
net: Suspend softirq when prefer_busy_poll is set
net: Add control functions for irq suspension
eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set
eventpoll: Control irq suspension for prefer_busy_poll
Documentation/networking/napi.rst | 3 ++
fs/eventpoll.c | 26 +++++++++++++--
include/linux/netdevice.h | 2 ++
include/net/busy_poll.h | 3 ++
net/core/dev.c | 55 +++++++++++++++++++++++++++----
net/core/net-sysfs.c | 18 ++++++++++
6 files changed, 98 insertions(+), 9 deletions(-)
--
2.25.1
Powered by blists - more mailing lists