netdev - Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5e52b556-fe49-4fe0-8bd3-543b3afd89fa@uwaterloo.ca>
Date: Mon, 12 Aug 2024 22:35:23 -0400
From: Martin Karsten <mkarsten@...terloo.ca>
To: Stanislav Fomichev <sdf@...ichev.me>
Cc: netdev@...r.kernel.org, Joe Damato <jdamato@...tly.com>,
 amritha.nambiar@...el.com, sridhar.samudrala@...el.com,
 Alexander Lobakin <aleksander.lobakin@...el.com>,
 Alexander Viro <viro@...iv.linux.org.uk>, Breno Leitao <leitao@...ian.org>,
 Christian Brauner <brauner@...nel.org>,
 Daniel Borkmann <daniel@...earbox.net>, "David S. Miller"
 <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Jakub Kicinski <kuba@...nel.org>, Jan Kara <jack@...e.cz>,
 Jiri Pirko <jiri@...nulli.us>, Johannes Berg <johannes.berg@...el.com>,
 Jonathan Corbet <corbet@....net>,
 "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>,
 "open list:FILESYSTEMS (VFS and infrastructure)"
 <linux-fsdevel@...r.kernel.org>, open list <linux-kernel@...r.kernel.org>,
 Lorenzo Bianconi <lorenzo@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll

On 2024-08-12 21:54, Stanislav Fomichev wrote:
> On 08/12, Martin Karsten wrote:
>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
>>> On 08/12, Martin Karsten wrote:
>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
>>>>> On 08/12, Joe Damato wrote:
>>>>>> Greetings:
>>>>>>
>>>>>> Martin Karsten (CC'd) and I have been collaborating on some ideas about
>>>>>> ways of reducing tail latency when using epoll-based busy poll and we'd
>>>>>> love to get feedback from the list on the code in this series. This is
>>>>>> the idea I mentioned at netdev conf, for those who were there. Barring
>>>>>> any major issues, we hope to submit this officially shortly after RFC.
>>>>>>
>>>>>> The basic idea for suspending IRQs in this manner was described in an
>>>>>> earlier paper presented at Sigmetrics 2024 [1].
>>>>>
>>>>> Let me explicitly call out the paper. Very nice analysis!
>>>>
>>>> Thank you!
>>>>
>>>> [snip]
>>>>
>>>>>> Here's how it is intended to work:
>>>>>>      - An administrator sets the existing sysfs parameters for
>>>>>>        defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.
>>>>>>
>>>>>>      - An administrator sets the new sysfs parameter irq_suspend_timeout
>>>>>>        to a larger value than gro-timeout to enable IRQ suspension.
>>>>>
>>>>> Can you expand more on what's the problem with the existing gro_flush_timeout?
>>>>> Is it defer_hard_irqs_count? Or you want a separate timeout only for the
>>>>> perfer_busy_poll case(why?)? Because looking at the first two patches,
>>>>> you essentially replace all usages of gro_flush_timeout with a new variable
>>>>> and I don't see how it helps.
>>>>
>>>> gro-flush-timeout (in combination with defer-hard-irqs) is the default irq
>>>> deferral mechanism and as such, always active when configured. Its static
>>>> periodic softirq processing leads to a situation where:
>>>>
>>>> - A long gro-flush-timeout causes high latencies when load is sufficiently
>>>> below capacity, or
>>>>
>>>> - a short gro-flush-timeout causes overhead when softirq execution
>>>> asynchronously competes with application processing at high load.
>>>>
>>>> The shortcomings of this are documented (to some extent) by our experiments.
>>>> See defer20 working well at low load, but having problems at high load,
>>>> while defer200 having higher latency at low load.
>>>>
>>>> irq-suspend-timeout is only active when an application uses
>>>> prefer-busy-polling and in that case, produces a nice alternating pattern of
>>>> application processing and networking processing (similar to what we
>>>> describe in the paper). This then works well with both low and high load.
>>>
>>> So you only want it for the prefer-busy-pollingc case, makes sense. I was
>>> a bit confused by the difference between defer200 and suspend200,
>>> but now I see that defer200 does not enable busypoll.
>>>
>>> I'm assuming that if you enable busypool in defer200 case, the numbers
>>> should be similar to suspend200 (ignoring potentially affecting
>>> non-busypolling queues due to higher gro_flush_timeout).
>>
>> defer200 + napi busy poll is essentially what we labelled "busy" and it does
>> not perform as well, since it still suffers interference between application
>> and softirq processing.
> 
> With all your patches applied? Why? Userspace not keeping up?

Note our "busy" case does not utilize our patches.

As illustrated by our performance numbers, its performance is better 
than the base case, but at the cost of higher cpu utilization and it's 
still not as good as suspend20.

Explanation (conjecture):

It boils down to having to set a particular static value for 
gro-flush-timeout that is then always active.

If busy-poll + application processing takes longer than this timeout, 
the next softirq runs while the application is still active, which 
causes interference.

Once a softirq runs, the irq-loop (Loop 2) takes control. When the 
application thread comes back to epoll_wait, it already finds data, thus 
ep_poll does not run napi_busy_poll at all, thus the irq-loop stays in 
control.

This continues until by chance the application finds no readily 
available data when calling epoll_wait and ep_poll runs another 
napi_busy_poll. Then the system switches back to busy-polling mode.

So essentially the system non-deterministically alternates between 
busy-polling and irq deferral. irq deferral determines the high-order 
tail latencies, but there is still enough interference to make a 
difference. It's not as bad as in the base case, but not as good as 
properly controlled irq suspension.

>>>>> Maybe expand more on what code paths are we trying to improve? Existing
>>>>> busy polling code is not super readable, so would be nice to simplify
>>>>> it a bit in the process (if possible) instead of adding one more tunable.
>>>>
>>>> There are essentially three possible loops for network processing:
>>>>
>>>> 1) hardirq -> softirq -> napi poll; this is the baseline functionality
>>>>
>>>> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme
>>>> with the shortcomings described above
>>>>
>>>> 3) epoll -> busy-poll -> napi poll
>>>>
>>>> If a system is configured for 1), not much can be done, as it is difficult
>>>> to interject anything into this loop without adding state and side effects.
>>>> This is what we tried for the paper, but it ended up being a hack.
>>>>
>>>> If however the system is configured for irq deferral, Loops 2) and 3)
>>>> "wrestle" with each other for control. Injecting the larger
>>>> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour
>>>> of Loop 3) and creates the nice pattern describe above.
>>>
>>> And you hit (2) when the epoll goes to sleep and/or when the userspace
>>> isn't fast enough to keep up with the timer, presumably? I wonder
>>> if need to use this opportunity and do proper API as Joe hints in the
>>> cover letter. Something over netlink to say "I'm gonna busy-poll on
>>> this queue / napi_id and with this timeout". And then we can essentially make
>>> gro_flush_timeout per queue (and avoid
>>> napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels
>>> too hacky already :-(
>>
>> If someone would implement the necessary changes to make these parameters
>> per-napi, this would improve things further, but note that the current
>> proposal gives strong performance across a range of workloads, which is
>> otherwise difficult to impossible to achieve.
> 
> Let's see what other people have to say. But we tried to do a similar
> setup at Google recently and getting all these parameters right
> was not trivial. Joe's recent patch series to push some of these into
> epoll context are a step in the right direction. It would be nice to
> have more explicit interface to express busy poling preference for
> the users vs chasing a bunch of global tunables and fighting against softirq
> wakups.

One of the goals of this patch set is to reduce parameter tuning and 
make the parameter setting independent of workload dynamics, so it 
should make things easier. This is of course notwithstanding that 
per-napi settings would be even better.

If you are able to share more details of your previous experiments (here 
or off-list), I would be very interested.

>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
>> an individual queue or application to make sure that IRQ suspension is
>> enabled/disabled right away when the state of the system changes from busy
>> to idle and back.
> 
> Can we not handle everything in napi_busy_loop? If we can mark some napi
> contexts as "explicitly polled by userspace with a larger defer timeout",
> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
> which is more like "this particular napi_poll call is user busy polling".

Then either the application needs to be polling all the time (wasting 
cpu cycles) or latencies will be determined by the timeout.

Only when switching back and forth between polling and interrupts is it 
possible to get low latencies across a large spectrum of offered loads 
without burning cpu cycles at 100%.

>>>> [snip]
>>>>
>>>>>>      - suspendX:
>>>>>>        - set defer_hard_irqs to 100
>>>>>>        - set gro_flush_timeout to X,000
>>>>>>        - set irq_suspend_timeout to 20,000,000
>>>>>>        - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
>>>>>>          busy_poll_budget = 64, prefer_busy_poll = true)
>>>>>
>>>>> What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback
>>>>> to busy_poll sysctl value?
>>>>
>>>> Before this patch set, ep_poll only calls napi_busy_poll, if busy_poll
>>>> (sysctl) or busy_poll_usecs is nonzero. However, this might lead to
>>>> busy-polling even when the application does not actually need or want it.
>>>> Only one iteration through the busy loop is needed to make the new scheme
>>>> work. Additional napi busy polling over and above is optional.
>>>
>>> Ack, thanks, was trying to understand why not stay with
>>> busy_poll_usecs=64 for consistency. But I guess you were just
>>> trying to show that patch 4/5 works.
>>
>> Right, and we would potentially be wasting CPU cycles by adding more
>> busy-looping.
> 
> Or potentially improving the latency more if you happen to get more packets
> during busy_poll_usecs duration? I'd imagine some applications might
> prefer to 100% busy poll without ever going to sleep (that would probably
> require getting rid of napi_id tracking in epoll, but that's a different story).

Yes, one could do full application-to-napi busy polling. The performance 
would be slightly better than irq suspension, but it would be quite 
wasteful during low load. One premise for our work is that saving cycles 
is a meaningful objective.

Thanks,
Martin