lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d53e8aa6-a5eb-41f4-9a4c-70d04a5ca748@uwaterloo.ca>
Date: Mon, 12 Aug 2024 20:04:13 -0400
From: Martin Karsten <mkarsten@...terloo.ca>
To: Stanislav Fomichev <sdf@...ichev.me>
Cc: netdev@...r.kernel.org, Joe Damato <jdamato@...tly.com>,
 amritha.nambiar@...el.com, sridhar.samudrala@...el.com,
 Alexander Lobakin <aleksander.lobakin@...el.com>,
 Alexander Viro <viro@...iv.linux.org.uk>, Breno Leitao <leitao@...ian.org>,
 Christian Brauner <brauner@...nel.org>,
 Daniel Borkmann <daniel@...earbox.net>, "David S. Miller"
 <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Jakub Kicinski <kuba@...nel.org>, Jan Kara <jack@...e.cz>,
 Jiri Pirko <jiri@...nulli.us>, Johannes Berg <johannes.berg@...el.com>,
 Jonathan Corbet <corbet@....net>,
 "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>,
 "open list:FILESYSTEMS (VFS and infrastructure)"
 <linux-fsdevel@...r.kernel.org>, open list <linux-kernel@...r.kernel.org>,
 Lorenzo Bianconi <lorenzo@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll

On 2024-08-12 19:03, Stanislav Fomichev wrote:
> On 08/12, Martin Karsten wrote:
>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
>>> On 08/12, Joe Damato wrote:
>>>> Greetings:
>>>>
>>>> Martin Karsten (CC'd) and I have been collaborating on some ideas about
>>>> ways of reducing tail latency when using epoll-based busy poll and we'd
>>>> love to get feedback from the list on the code in this series. This is
>>>> the idea I mentioned at netdev conf, for those who were there. Barring
>>>> any major issues, we hope to submit this officially shortly after RFC.
>>>>
>>>> The basic idea for suspending IRQs in this manner was described in an
>>>> earlier paper presented at Sigmetrics 2024 [1].
>>>
>>> Let me explicitly call out the paper. Very nice analysis!
>>
>> Thank you!
>>
>> [snip]
>>
>>>> Here's how it is intended to work:
>>>>     - An administrator sets the existing sysfs parameters for
>>>>       defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.
>>>>
>>>>     - An administrator sets the new sysfs parameter irq_suspend_timeout
>>>>       to a larger value than gro-timeout to enable IRQ suspension.
>>>
>>> Can you expand more on what's the problem with the existing gro_flush_timeout?
>>> Is it defer_hard_irqs_count? Or you want a separate timeout only for the
>>> perfer_busy_poll case(why?)? Because looking at the first two patches,
>>> you essentially replace all usages of gro_flush_timeout with a new variable
>>> and I don't see how it helps.
>>
>> gro-flush-timeout (in combination with defer-hard-irqs) is the default irq
>> deferral mechanism and as such, always active when configured. Its static
>> periodic softirq processing leads to a situation where:
>>
>> - A long gro-flush-timeout causes high latencies when load is sufficiently
>> below capacity, or
>>
>> - a short gro-flush-timeout causes overhead when softirq execution
>> asynchronously competes with application processing at high load.
>>
>> The shortcomings of this are documented (to some extent) by our experiments.
>> See defer20 working well at low load, but having problems at high load,
>> while defer200 having higher latency at low load.
>>
>> irq-suspend-timeout is only active when an application uses
>> prefer-busy-polling and in that case, produces a nice alternating pattern of
>> application processing and networking processing (similar to what we
>> describe in the paper). This then works well with both low and high load.
> 
> So you only want it for the prefer-busy-pollingc case, makes sense. I was
> a bit confused by the difference between defer200 and suspend200,
> but now I see that defer200 does not enable busypoll.
> 
> I'm assuming that if you enable busypool in defer200 case, the numbers
> should be similar to suspend200 (ignoring potentially affecting
> non-busypolling queues due to higher gro_flush_timeout).

defer200 + napi busy poll is essentially what we labelled "busy" and it 
does not perform as well, since it still suffers interference between 
application and softirq processing.

>>> Maybe expand more on what code paths are we trying to improve? Existing
>>> busy polling code is not super readable, so would be nice to simplify
>>> it a bit in the process (if possible) instead of adding one more tunable.
>>
>> There are essentially three possible loops for network processing:
>>
>> 1) hardirq -> softirq -> napi poll; this is the baseline functionality
>>
>> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme
>> with the shortcomings described above
>>
>> 3) epoll -> busy-poll -> napi poll
>>
>> If a system is configured for 1), not much can be done, as it is difficult
>> to interject anything into this loop without adding state and side effects.
>> This is what we tried for the paper, but it ended up being a hack.
>>
>> If however the system is configured for irq deferral, Loops 2) and 3)
>> "wrestle" with each other for control. Injecting the larger
>> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour
>> of Loop 3) and creates the nice pattern describe above.
> 
> And you hit (2) when the epoll goes to sleep and/or when the userspace
> isn't fast enough to keep up with the timer, presumably? I wonder
> if need to use this opportunity and do proper API as Joe hints in the
> cover letter. Something over netlink to say "I'm gonna busy-poll on
> this queue / napi_id and with this timeout". And then we can essentially make
> gro_flush_timeout per queue (and avoid
> napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels
> too hacky already :-(

If someone would implement the necessary changes to make these 
parameters per-napi, this would improve things further, but note that 
the current proposal gives strong performance across a range of 
workloads, which is otherwise difficult to impossible to achieve.

Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake 
of an individual queue or application to make sure that IRQ suspension 
is enabled/disabled right away when the state of the system changes from 
busy to idle and back.

>> [snip]
>>
>>>>     - suspendX:
>>>>       - set defer_hard_irqs to 100
>>>>       - set gro_flush_timeout to X,000
>>>>       - set irq_suspend_timeout to 20,000,000
>>>>       - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
>>>>         busy_poll_budget = 64, prefer_busy_poll = true)
>>>
>>> What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback
>>> to busy_poll sysctl value?
>>
>> Before this patch set, ep_poll only calls napi_busy_poll, if busy_poll
>> (sysctl) or busy_poll_usecs is nonzero. However, this might lead to
>> busy-polling even when the application does not actually need or want it.
>> Only one iteration through the busy loop is needed to make the new scheme
>> work. Additional napi busy polling over and above is optional.
> 
> Ack, thanks, was trying to understand why not stay with
> busy_poll_usecs=64 for consistency. But I guess you were just
> trying to show that patch 4/5 works.

Right, and we would potentially be wasting CPU cycles by adding more 
busy-looping.

Thanks,
Martin

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ