[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <57a0e1e7-1079-4055-8072-d9105b70103f@redhat.com>
Date: Tue, 29 Oct 2024 11:25:18 +0100
From: Paolo Abeni <pabeni@...hat.com>
To: Joe Damato <jdamato@...tly.com>, netdev@...r.kernel.org
Cc: namangulati@...gle.com, edumazet@...gle.com, amritha.nambiar@...el.com,
sridhar.samudrala@...el.com, sdf@...ichev.me, peter@...eblog.net,
m2shafiei@...terloo.ca, bjorn@...osinc.com, hch@...radead.org,
willy@...radead.org, willemdebruijn.kernel@...il.com, skhawaja@...gle.com,
kuba@...nel.org, Alexander Lobakin <aleksander.lobakin@...el.com>,
Alexander Viro <viro@...iv.linux.org.uk>,
"open list:BPF [MISC] :Keyword:(?:b|_)bpf(?:b|_)" <bpf@...r.kernel.org>,
Christian Brauner <brauner@...nel.org>, "David S. Miller"
<davem@...emloft.net>, Donald Hunter <donald.hunter@...il.com>,
Jan Kara <jack@...e.cz>, Jesper Dangaard Brouer <hawk@...nel.org>,
Jiri Pirko <jiri@...nulli.us>, Johannes Berg <johannes.berg@...el.com>,
Jonathan Corbet <corbet@....net>,
"open list:DOCUMENTATION" <linux-doc@...r.kernel.org>,
"open list:FILESYSTEMS (VFS and infrastructure)"
<linux-fsdevel@...r.kernel.org>, open list <linux-kernel@...r.kernel.org>,
Lorenzo Bianconi <lorenzo@...nel.org>, Martin Karsten
<mkarsten@...terloo.ca>, Mina Almasry <almasrymina@...gle.com>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
Xuan Zhuo <xuanzhuo@...ux.alibaba.com>
Subject: Re: [PATCH net-next v2 0/6] Suspend IRQs during application busy
periods
On 10/21/24 03:52, Joe Damato wrote:
> Greetings:
>
> Welcome to v2, see changelog below.
>
> This series introduces a new mechanism, IRQ suspension, which allows
> network applications using epoll to mask IRQs during periods of high
> traffic while also reducing tail latency (compared to existing
> mechanisms, see below) during periods of low traffic. In doing so, this
> balances CPU consumption with network processing efficiency.
>
> Martin Karsten (CC'd) and I have been collaborating on this series for
> several months and have appreciated the feedback from the community on
> our RFC [1]. We've updated the cover letter and kernel documentation in
> an attempt to more clearly explain how this mechanism works, how
> applications can use it, and how it compares to existing mechanisms in
> the kernel. We've added an additional test case, 'fullbusy', achieved by
> modifying libevent for comparison. See below for a detailed description,
> link to the patch, and test results.
>
> I briefly mentioned this idea at netdev conf 2024 (for those who were
> there) and Martin described this idea in an earlier paper presented at
> Sigmetrics 2024 [2].
>
> ~ The short explanation (TL;DR)
>
> We propose adding a new napi config parameter: irq_suspend_timeout to
> help balance CPU usage and network processing efficiency when using IRQ
> deferral and napi busy poll.
>
> If this parameter is set to a non-zero value *and* a user application
> has enabled preferred busy poll on a busy poll context (via the
> EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
> epoll ioctl for epoll_params")), then application calls to epoll_wait
> for that context will cause device IRQs and softirq processing to be
> suspended as long as epoll_wait successfully retrieves data from the
> NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.
>
> If/when network traffic subsides and epoll_wait returns no data, IRQ
> suspension is immediately reverted back to the existing
> napi_defer_hard_irqs and gro_flush_timeout mechanism which was
> introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
> feature")).
>
> The irq_suspend_timeout serves as a safety mechanism. If userland takes
> a long time processing data, irq_suspend_timeout will fire and restart
> normal NAPI processing.
>
> For a more in depth explanation, please continue reading.
>
> ~ Comparison with existing mechanisms
>
> Interrupt mitigation can be accomplished in napi software, by setting
> napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
> in the NIC. This can be quite efficient, but in both cases, a fixed
> timeout (or packet count) needs to be configured. However, a fixed
> timeout cannot effectively support both low- and high-load situations:
>
> At low load, an application typically processes a few requests and then
> waits to receive more input data. In this scenario, a large timeout will
> cause unnecessary latency.
>
> At high load, an application typically processes many requests before
> being ready to receive more input data. In this case, a small timeout
> will likely fire prematurely and trigger irq/softirq processing, which
> interferes with the application's execution. This causes overhead, most
> likely due to cache contention.
>
> While NICs attempt to provide adaptive interrupt coalescing schemes,
> these cannot properly take into account application-level processing.
>
> An alternative packet delivery mechanism is busy-polling, which results
> in perfect alignment of application processing and network polling. It
> delivers optimal performance (throughput and latency), but results in
> 100% cpu utilization and is thus inefficient for below-capacity
> workloads.
>
> We propose to add a new packet delivery mode that properly alternates
> between busy polling and interrupt-based delivery depending on busy and
> idle periods of the application. During a busy period, the system
> operates in busy-polling mode, which avoids interference. During an idle
> period, the system falls back to interrupt deferral, but with a small
> timeout to avoid excessive latencies. This delivery mode can also be
> viewed as an extension of basic interrupt deferral, but alternating
> between a small and a very large timeout.
>
> This delivery mode is efficient, because it avoids softirq execution
> interfering with application processing during busy periods. It can be
> used with blocking epoll_wait to conserve cpu cycles during idle
> periods. The effect of alternating between busy and idle periods is that
> performance (throughput and latency) is very close to full busy polling,
> while cpu utilization is lower and very close to interrupt mitigation.
>
> ~ Usage details
>
> IRQ suspension is introduced via a per-NAPI configuration parameter that
> controls the maximum time that IRQs can be suspended.
>
> Here's how it is intended to work:
> - The user application (or system administrator) uses the netdev-genl
> netlink interface to set the pre-existing napi_defer_hard_irqs and
> gro_flush_timeout NAPI config parameters to enable IRQ deferral.
>
> - The user application (or system administrator) sets the proposed
> irq_suspend_timeout parameter via the netdev-genl netlink interface
> to a larger value than gro_flush_timeout to enable IRQ suspension.
>
> - The user application issues the existing epoll ioctl to set the
> prefer_busy_poll flag on the epoll context.
>
> - The user application then calls epoll_wait to busy poll for network
> events, as it normally would.
>
> - If epoll_wait returns events to userland, IRQs are suspended for the
> duration of irq_suspend_timeout.
>
> - If epoll_wait finds no events and the thread is about to go to
> sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
> is resumed.
>
> As long as epoll_wait is retrieving events, IRQs (and softirq
> processing) for the NAPI being polled remain disabled. When network
> traffic reduces, eventually a busy poll loop in the kernel will retrieve
> no data. When this occurs, regular IRQ deferral using gro_flush_timeout
> for the polled NAPI is re-enabled.
>
> Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
> automatically times out after the irq_suspend_timeout timer expires.
> Regular deferral is also immediately re-enabled when the epoll context
> is destroyed.
>
> ~ Usage scenario
>
> The target scenario for IRQ suspension as packet delivery mode is a
> system that runs a dominant application with substantial network I/O.
> The target application can be configured to receive input data up to a
> certain batch size (via epoll_wait maxevents parameter) and this batch
> size determines the worst-case latency that application requests might
> experience. Because packet delivery is suspended during the target
> application's processing, the batch size also determines the worst-case
> latency of concurrent applications using the same RX queue(s).
>
> gro_flush_timeout should be set as small as possible, but large enough to
> make sure that a single request is likely not being interfered with.
>
> irq_suspend_timeout is largely a safety mechanism against misbehaving
> applications. It should be set large enough to cover the processing of an
> entire application batch, i.e., the factor between gro_flush_timeout and
> irq_suspend_timeout should roughly correspond to the maximum batch size
> that the target application would process in one go.
>
> ~ Design rationale
>
> The implementation of the IRQ suspension mechanism very nicely dovetails
> with the existing mechanism for IRQ deferral when preferred busy poll is
> enabled (introduced in commit 7fd3253a7de6 ("net: Introduce preferred
> busy-polling"), see that commit message for more details).
>
> While it would be possible to inject the suspend timeout via
> the existing epoll ioctl, it is more natural to avoid this path for one
> main reason:
>
> An epoll context is linked to NAPI IDs as file descriptors are added;
> this means any epoll context might suddenly be associated with a
> different net_device if the application were to replace all existing
> fds with fds from a different device. In this case, the scope of the
> suspend timeout becomes unclear and many edge cases for both the user
> application and the kernel are introduced
>
> Only a single iteration through napi busy polling is needed for this
> mechanism to work effectively. Since an important objective for this
> mechanism is preserving cpu cycles, exactly one iteration of the napi
> busy loop is invoked when busy_poll_usecs is set to 0.
>
> ~ Important call outs in the implementation
>
> - Enabling per epoll-context preferred busy poll will now effectively
> lead to a nonblocking iteration through napi_busy_loop, even when
> busy_poll_usecs is 0. See patch 4.
>
> - Patches apply cleanly on commit 160a810b2a85 ("net: vxlan: update
> the document for vxlan_snoop()").
>
> ~ Benchmark configs & descriptions
>
> The changes were benchmarked with memcached [3] using the benchmarking
> tool mutilate [4].
>
> To facilitate benchmarking, a small patch [5] was applied to memcached
> 1.6.29 to allow setting per-epoll context preferred busy poll and other
> settings via environment variables. Another small patch [6] was applied
> to libevent to enable full busy-polling.
>
> Multiple scenarios were benchmarked as described below and the scripts
> used for producing these results can be found on github [7] (note: all
> scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
> -N to memcached):
>
> - base:
> - no other options enabled
> - deferX:
> - set defer_hard_irqs to 100
> - set gro_flush_timeout to X,000
> - napibusy:
> - set defer_hard_irqs to 100
> - set gro_flush_timeout to 200,000
> - enable busy poll via the existing ioctl (busy_poll_usecs = 64,
> busy_poll_budget = 64, prefer_busy_poll = true)
> - fullbusy:
> - set defer_hard_irqs to 100
> - set gro_flush_timeout to 5,000,000
> - enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
> busy_poll_budget = 64, prefer_busy_poll = true)
> - change memcached's nonblocking epoll_wait invocation (via
> libevent) to using a 1 ms timeout
> - suspendX:
> - set defer_hard_irqs to 100
> - set gro_flush_timeout to X,000
> - set irq_suspend_timeout to 20,000,000
> - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
> busy_poll_budget = 64, prefer_busy_poll = true)
>
> ~ Benchmark results
>
> Tested on:
>
> Single socket AMD EPYC 7662 64-Core Processor
> Hyperthreading disabled
> 4 NUMA Zones (NPS=4)
> 16 CPUs per NUMA zone (64 cores total)
> 2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC
>
> The test machine is configured such that a single interface has 8 RX
> queues. The queues' IRQs and memcached are pinned to CPUs that are
> NUMA-local to the interface which is under test. The NIC's interrupt
> coalescing configuration is left at boot-time defaults.
>
> Results:
>
> Results are shown below. The mechanism added by this series is
> represented by the 'suspend' cases. Data presented shows a summary over
> at least 10 runs of each test case [8] using the scripts on github [7].
> For latency, the median is shown. For throughput and CPU utilization,
> the average is shown.
>
> The results also include cycles-per-query (cpq) and
> instruction-per-query (ipq) metrics, following the methodology proposed
> in [2], to augment the CPU utilization numbers, which could be skewed
> due to frequency scaling. We find that this does not appear to be the
> case as CPU utilization and low-level metrics show similar trends.
>
> These results were captured using the scripts on github [7] to
> illustrate how this approach compares with other pre-existing
> mechanisms. This data is not to be interpreted as scientific data
> captured in a fully isolated lab setting, but instead as best effort,
> illustrative information comparing and contrasting tradeoffs.
>
> The absolute QPS results are higher than our previous submission, but
> the relative differences between variants are equivalent. Because the
> patches have been rebased on 6.12, several factors have likely
> influenced the overall performance. Most importantly, we had to switch
> to a new set of basic kernel options, which has likely altered the
> baseline performance. Because the overall comparison of variants still
> holds, we have not attempted to recreate the exact set of kernel options
> from the previous submission.
>
> Compare:
> - Throughput (MAX) and latencies of base vs suspend.
> - CPU usage of napibusy and fullbusy during lower load (200K, 400K for
> example) vs suspend.
> - Latency of the defer variants vs suspend as timeout and load
> increases.
>
> The overall takeaway is that the suspend variants provide a superior
> combination of high throughput, low latency, and low cpu utilization
> compared to all other variants. Each of the suspend variants works very
> well, but some fine-tuning between latency and cpu utilization is still
> possible by tuning the small timeout (gro_flush_timeout).
>
> Note: we've reorganized the results to make comparison among testcases
> with the same load easier.
>
> testcase load qps avglat 95%lat 99%lat cpu cpq ipq
> base 200K 200024 127 254 458 25 12748 11289
> defer10 200K 199991 64 128 166 27 18763 16574
> defer20 200K 199986 72 135 178 25 15405 14173
> defer50 200K 200025 91 149 198 23 12275 12203
> defer200 200K 199996 182 266 326 18 8595 9183
> fullbusy 200K 200040 58 123 167 100 43641 23145
> napibusy 200K 200009 115 244 299 56 24797 24693
> suspend10 200K 200005 63 128 167 32 19559 17240
> suspend20 200K 199952 69 132 170 29 16324 14838
> suspend50 200K 200019 84 144 189 26 13106 12516
> suspend200 200K 199978 168 264 326 20 9331 9643
>
> testcase load qps avglat 95%lat 99%lat cpu cpq ipq
> base 400K 400017 157 292 762 39 9287 9325
> defer10 400K 400033 71 141 204 53 13950 12943
> defer20 400K 399935 79 150 212 47 12027 11673
> defer50 400K 399888 101 171 231 39 9556 9921
> defer200 400K 399993 200 287 358 32 7428 8576
> fullbusy 400K 400018 63 132 203 100 21827 16062
> napibusy 400K 399970 89 230 292 83 18156 16508
> suspend10 400K 400061 69 139 202 54 13576 13057
> suspend20 400K 399988 73 144 206 49 11930 11773
> suspend50 400K 399975 88 161 218 42 9996 10270
> suspend200 400K 399954 172 276 353 34 7847 8713
>
> testcase load qps avglat 95%lat 99%lat cpu cpq ipq
> base 600K 600031 166 289 631 61 9188 8787
> defer10 600K 599967 85 167 262 75 11833 10947
> defer20 600K 599888 89 165 243 66 10513 10362
> defer50 600K 600072 109 185 253 55 8664 9190
> defer200 600K 599951 222 315 393 45 6892 8213
> fullbusy 600K 600041 69 145 227 100 14549 13936
> napibusy 600K 599980 79 188 280 96 13927 14155
> suspend10 600K 600028 78 159 267 69 10877 11032
> suspend20 600K 600026 81 159 254 64 9922 10320
> suspend50 600K 600007 96 178 258 57 8681 9331
> suspend200 600K 599964 177 295 369 47 7115 8366
>
> testcase load qps avglat 95%lat 99%lat cpu cpq ipq
> base 800K 800034 198 329 698 84 9366 8338
> defer10 800K 799718 243 642 1457 95 10532 9007
> defer20 800K 800009 132 245 399 89 9956 8979
> defer50 800K 800024 136 228 378 80 9002 8598
> defer200 800K 799965 255 362 473 66 7481 8147
> fullbusy 800K 799927 78 157 253 100 10915 12533
> napibusy 800K 799870 81 173 273 99 10826 12532
> suspend10 800K 799991 84 167 269 83 9380 9802
> suspend20 800K 799979 90 172 290 78 8765 9404
> suspend50 800K 800031 106 191 307 71 7945 8805
> suspend200 800K 799905 182 307 411 62 6985 8242
>
> testcase load qps avglat 95%lat 99%lat cpu cpq ipq
> base 1000K 919543 3805 6390 14229 98 9324 7978
> defer10 1000K 850751 4574 7382 15370 99 10218 8470
> defer20 1000K 890296 4736 6862 14858 99 9708 8277
> defer50 1000K 932694 3463 6180 13251 97 9148 8053
> defer200 1000K 951311 3524 6052 13599 96 8875 7845
> fullbusy 1000K 1000011 90 181 278 100 8731 10686
> napibusy 1000K 1000050 93 184 280 100 8721 10547
> suspend10 1000K 999962 101 193 306 92 8138 8980
> suspend20 1000K 1000030 103 191 324 88 7844 8763
> suspend50 1000K 1000001 114 202 320 83 7396 8431
> suspend200 1000K 999965 185 314 428 76 6733 8072
>
> testcase load qps avglat 95%lat 99%lat cpu cpq ipq
> base MAX 1005592 4651 6594 14979 100 8679 7918
> defer10 MAX 928204 5106 7286 15199 100 9398 8380
> defer20 MAX 984663 4774 6518 14920 100 8861 8063
> defer50 MAX 1044099 4431 6368 14652 100 8350 7948
> defer200 MAX 1040451 4423 6610 14674 100 8380 7931
> fullbusy MAX 1236608 3715 3987 12805 100 7051 7936
> napibusy MAX 1077516 4345 10155 15957 100 8080 7842
> suspend10 MAX 1218344 3760 3990 12585 100 7150 7935
> suspend20 MAX 1220056 3752 4053 12602 100 7150 7961
> suspend50 MAX 1213666 3791 4103 12919 100 7183 7959
> suspend200 MAX 1217411 3768 3988 12863 100 7161 7954
>
> ~ FAQ
>
> - Can the new timeout value be threaded through the new epoll ioctl ?
>
> Only with difficulty. The epoll ioctl sets options on an epoll
> context and the NAPI ID associated with an epoll context can change
> based on what file descriptors a user app adds to the epoll context.
> This would introduce complexity in the API from the user perspective
> and also complexity in the kernel.
>
> - Can irq suspend be built by combining NIC coalescing and
> gro_flush_timeout ?
>
> No. The problem is that the long timeout must engage if and only if
> prefer-busy is active.
>
> When using NIC coalescing for the short timeout (without
> napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
> period will trigger softirq, which will run napi polling. At this
> point, prefer-busy is not active, so NIC interrupts would be
> re-enabled. Then it is not possible for the longer timeout to
> interject to switch control back to polling. In other words, only by
> using the software timer for the short timeout, it is possible to
> extend the timeout without having to reprogram the NIC timer or
> reach down directly and disable interrupts.
>
> Using gro_flush_timeout for the long timeout also has problems, for
> the same underlying reason. In the current napi implementation,
> gro_flush_timeout is not tied to prefer-busy. We'd either have to
> change that and in the process modify the existing deferral
> mechanism, or introduce a state variable to determine whether
> gro_flush_timeout is used as long timeout for irq suspend or whether
> it is used for its default purpose. In an earlier version, we did
> try something similar to the latter and made it work, but it ends up
> being a lot more convoluted than our current proposal.
>
> - Isn't it already possible to combine busy looping with irq deferral?
>
> Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
> gro_flush_timeout is a precondition for prefer_busy_poll to have an
> effect. If the application also uses a tight busy loop with
> essentially nonblocking epoll_wait (accomplished with a very short
> timeout parameter), this is the fullbusy case shown in the results.
> An application using blocking epoll_wait is shown as the napibusy
> case in the result. It's a hybrid approach that provides limited
> latency benefits compared to the base case and plain irq deferral,
> but not as good as fullbusy or suspend.
>
> ~ Special thanks
>
> Several people were involved in earlier stages of the development of this
> mechanism whom we'd like to thank:
>
> - Peter Cai (CC'd), for the initial kernel patch and his contributions
> to the paper.
>
> - Mohammadamin Shafie (CC'd), for testing various versions of the kernel
> patch and providing helpful feedback.
>
> Thanks,
> Martin and Joe
>
> [1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
> [2]: https://doi.org/10.1145/3626780
> [3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
> [4]: https://github.com/leverich/mutilate
> [5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/memcached.patch
> [6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/libevent.patch
> [7]: https://github.com/martinkarsten/irqsuspend
> [8]: https://github.com/martinkarsten/irqsuspend/tree/main/results
>
> v2:
> - Cover letter updated, including a re-run of test data.
> - Patch 1 rewritten to use netdev-genl instead of sysfs.
> - Patch 3 updated with a comment added to napi_resume_irqs.
> - Patch 4 rebased to apply now that commit b9ca079dd6b0 ("eventpoll:
> Annotate data-race of busy_poll_usecs") has been picked up from VFS.
> - Patch 6 updated the kernel documentation.
>
> rfc -> v1: https://lore.kernel.org/netdev/20240823173103.94978-1-jdamato@fastly.com/
> - Cover letter updated to include more details.
> - Patch 1 updated to remove the documentation added. This was moved to
> patch 6 with the rest of the docs (see below).
> - Patch 5 updated to fix an error uncovered by the kernel build robot.
> See patch 5's changelog for more details.
> - Patch 6 added which updates kernel documentation.
The changes makes sense to me, and I could not find any obvious issue in
the patches.
I think this deserve some - even basic - self-tests coverage. Note that
you can enable GRO on veth devices to make NAPI instances avail there.
Possibly you could opt for a drivers/net defaulting to veth usage and
allowing the user to select real H/W via env variables.
Thanks,
Paolo
Powered by blists - more mailing lists