netdev - Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6f40b6df-4452-48f6-b552-0eceaa1f0bbc@uwaterloo.ca>
Date: Tue, 13 Aug 2024 09:18:13 -0400
From: Martin Karsten <mkarsten@...terloo.ca>
To: Stanislav Fomichev <sdf@...ichev.me>
Cc: netdev@...r.kernel.org, Joe Damato <jdamato@...tly.com>,
 amritha.nambiar@...el.com, sridhar.samudrala@...el.com,
 Alexander Lobakin <aleksander.lobakin@...el.com>,
 Alexander Viro <viro@...iv.linux.org.uk>, Breno Leitao <leitao@...ian.org>,
 Christian Brauner <brauner@...nel.org>,
 Daniel Borkmann <daniel@...earbox.net>, "David S. Miller"
 <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Jakub Kicinski <kuba@...nel.org>, Jan Kara <jack@...e.cz>,
 Jiri Pirko <jiri@...nulli.us>, Johannes Berg <johannes.berg@...el.com>,
 Jonathan Corbet <corbet@....net>,
 "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>,
 "open list:FILESYSTEMS (VFS and infrastructure)"
 <linux-fsdevel@...r.kernel.org>, open list <linux-kernel@...r.kernel.org>,
 Lorenzo Bianconi <lorenzo@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll

On 2024-08-13 00:07, Stanislav Fomichev wrote:
> On 08/12, Martin Karsten wrote:
>> On 2024-08-12 21:54, Stanislav Fomichev wrote:
>>> On 08/12, Martin Karsten wrote:
>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
>>>>> On 08/12, Martin Karsten wrote:
>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
>>>>>>> On 08/12, Joe Damato wrote:
>>>>>>>> Greetings:

[snip]

>>>>>>> Maybe expand more on what code paths are we trying to improve? Existing
>>>>>>> busy polling code is not super readable, so would be nice to simplify
>>>>>>> it a bit in the process (if possible) instead of adding one more tunable.
>>>>>>
>>>>>> There are essentially three possible loops for network processing:
>>>>>>
>>>>>> 1) hardirq -> softirq -> napi poll; this is the baseline functionality
>>>>>>
>>>>>> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme
>>>>>> with the shortcomings described above
>>>>>>
>>>>>> 3) epoll -> busy-poll -> napi poll
>>>>>>
>>>>>> If a system is configured for 1), not much can be done, as it is difficult
>>>>>> to interject anything into this loop without adding state and side effects.
>>>>>> This is what we tried for the paper, but it ended up being a hack.
>>>>>>
>>>>>> If however the system is configured for irq deferral, Loops 2) and 3)
>>>>>> "wrestle" with each other for control. Injecting the larger
>>>>>> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour
>>>>>> of Loop 3) and creates the nice pattern describe above.
>>>>>
>>>>> And you hit (2) when the epoll goes to sleep and/or when the userspace
>>>>> isn't fast enough to keep up with the timer, presumably? I wonder
>>>>> if need to use this opportunity and do proper API as Joe hints in the
>>>>> cover letter. Something over netlink to say "I'm gonna busy-poll on
>>>>> this queue / napi_id and with this timeout". And then we can essentially make
>>>>> gro_flush_timeout per queue (and avoid
>>>>> napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels
>>>>> too hacky already :-(
>>>>
>>>> If someone would implement the necessary changes to make these parameters
>>>> per-napi, this would improve things further, but note that the current
>>>> proposal gives strong performance across a range of workloads, which is
>>>> otherwise difficult to impossible to achieve.
>>>
>>> Let's see what other people have to say. But we tried to do a similar
>>> setup at Google recently and getting all these parameters right
>>> was not trivial. Joe's recent patch series to push some of these into
>>> epoll context are a step in the right direction. It would be nice to
>>> have more explicit interface to express busy poling preference for
>>> the users vs chasing a bunch of global tunables and fighting against softirq
>>> wakups.
>>
>> One of the goals of this patch set is to reduce parameter tuning and make
>> the parameter setting independent of workload dynamics, so it should make
>> things easier. This is of course notwithstanding that per-napi settings
>> would be even better.
>>
>> If you are able to share more details of your previous experiments (here or
>> off-list), I would be very interested.
> 
> We went through a similar exercise of trying to get the tail latencies down.
> Starting with SO_BUSY_POLL, then switching to the per-epoll variant (except
> we went with a hard-coded napi_id argument instead of tracking) and trying to
> get a workable set of budget/timeout/gro_flush. We were fine with burning all
> cpu capacity we had and no sleep at all, so we ended up having a bunch
> of special cases in epoll loop to avoid the sleep.
> 
> But we were trying to make a different model work (the one you mention in the
> paper as well) where the userspace busy-pollers are just running napi_poll
> on one cpu and the actual work is consumed by the userspace on a different cpu.
> (we had two epoll fds - one with napi_id=xxx and no sockets to drive napi_poll
> and another epoll fd with actual sockets for signaling).
> 
> This mode has a different set of challenges with socket lock, socket rx
> queue and the backlog processing :-(

I agree. That model has challenges and is extremely difficult to tune right.

>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
>>>> an individual queue or application to make sure that IRQ suspension is
>>>> enabled/disabled right away when the state of the system changes from busy
>>>> to idle and back.
>>>
>>> Can we not handle everything in napi_busy_loop? If we can mark some napi
>>> contexts as "explicitly polled by userspace with a larger defer timeout",
>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
>>> which is more like "this particular napi_poll call is user busy polling".
>>
>> Then either the application needs to be polling all the time (wasting cpu
>> cycles) or latencies will be determined by the timeout.
>>
>> Only when switching back and forth between polling and interrupts is it
>> possible to get low latencies across a large spectrum of offered loads
>> without burning cpu cycles at 100%.
> 
> Ah, I see what you're saying, yes, you're right. In this case ignore my comment
> about ep_suspend_napi_irqs/napi_resume_irqs.

Thanks for probing and double-checking everything! Feedback is important 
for us to properly document our proposal.

> Let's see how other people feel about per-dev irq_suspend_timeout. Properly
> disabling napi during busy polling is super useful, but it would still
> be nice to plumb irq_suspend_timeout via epoll context or have it set on
> a per-napi basis imho.

Fingers crossed. I hope this patch will be accepted, because it has 
practical performance and efficiency benefits, and that this will 
further increase the motivation to re-design the entire irq 
defer(/suspend) infrastructure for per-napi settings.

Thanks,
Martin