[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <02091b4b-de85-457d-993e-0548f788f4a1@uwaterloo.ca>
Date: Fri, 16 Aug 2024 16:03:26 -0400
From: Martin Karsten <mkarsten@...terloo.ca>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>,
Joe Damato <jdamato@...tly.com>
Cc: Samiullah Khawaja <skhawaja@...gle.com>,
Stanislav Fomichev <sdf@...ichev.me>, netdev@...r.kernel.org,
amritha.nambiar@...el.com, sridhar.samudrala@...el.com,
Alexander Lobakin <aleksander.lobakin@...el.com>,
Alexander Viro <viro@...iv.linux.org.uk>, Breno Leitao <leitao@...ian.org>,
Christian Brauner <brauner@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>, "David S. Miller"
<davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Jan Kara <jack@...e.cz>,
Jiri Pirko <jiri@...nulli.us>, Johannes Berg <johannes.berg@...el.com>,
Jonathan Corbet <corbet@....net>,
"open list:DOCUMENTATION" <linux-doc@...r.kernel.org>,
"open list:FILESYSTEMS (VFS and infrastructure)"
<linux-fsdevel@...r.kernel.org>, open list <linux-kernel@...r.kernel.org>,
Lorenzo Bianconi <lorenzo@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll
On 2024-08-16 13:01, Willem de Bruijn wrote:
> Joe Damato wrote:
>> On Fri, Aug 16, 2024 at 10:59:51AM -0400, Willem de Bruijn wrote:
>>> Willem de Bruijn wrote:
>>>> Martin Karsten wrote:
>>>>> On 2024-08-14 15:53, Samiullah Khawaja wrote:
>>>>>> On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@...terloo.ca> wrote:
>>>>>>>
>>>>>>> On 2024-08-13 00:07, Stanislav Fomichev wrote:
>>>>>>>> On 08/12, Martin Karsten wrote:
>>>>>>>>> On 2024-08-12 21:54, Stanislav Fomichev wrote:
>>>>>>>>>> On 08/12, Martin Karsten wrote:
>>>>>>>>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
>>>>>>>>>>>> On 08/12, Martin Karsten wrote:
>>>>>>>>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
>>>>>>>>>>>>>> On 08/12, Joe Damato wrote:
>>>>>>>>>>>>>>> Greetings:
>>>>>
>>>>> [snip]
>>>>>
>>>>>>>>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
>>>>>>>>>>> an individual queue or application to make sure that IRQ suspension is
>>>>>>>>>>> enabled/disabled right away when the state of the system changes from busy
>>>>>>>>>>> to idle and back.
>>>>>>>>>>
>>>>>>>>>> Can we not handle everything in napi_busy_loop? If we can mark some napi
>>>>>>>>>> contexts as "explicitly polled by userspace with a larger defer timeout",
>>>>>>>>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
>>>>>>>>>> which is more like "this particular napi_poll call is user busy polling".
>>>>>>>>>
>>>>>>>>> Then either the application needs to be polling all the time (wasting cpu
>>>>>>>>> cycles) or latencies will be determined by the timeout.
>>>>>> But if I understand correctly, this means that if the application
>>>>>> thread that is supposed
>>>>>> to do napi busy polling gets busy doing work on the new data/events in
>>>>>> userspace, napi polling
>>>>>> will not be done until the suspend_timeout triggers? Do you dispatch
>>>>>> work to a separate worker
>>>>>> threads, in userspace, from the thread that is doing epoll_wait?
>>>>>
>>>>> Yes, napi polling is suspended while the application is busy between
>>>>> epoll_wait calls. That's where the benefits are coming from.
>>>>>
>>>>> The consequences depend on the nature of the application and overall
>>>>> preferences for the system. If there's a "dominant" application for a
>>>>> number of queues and cores, the resulting latency for other background
>>>>> applications using the same queues might not be a problem at all.
>>>>>
>>>>> One other simple mitigation is limiting the number of events that each
>>>>> epoll_wait call accepts. Note that this batch size also determines the
>>>>> worst-case latency for the application in question, so there is a
>>>>> natural incentive to keep it limited.
>>>>>
>>>>> A more complex application design, like you suggest, might also be an
>>>>> option.
>>>>>
>>>>>>>>> Only when switching back and forth between polling and interrupts is it
>>>>>>>>> possible to get low latencies across a large spectrum of offered loads
>>>>>>>>> without burning cpu cycles at 100%.
>>>>>>>>
>>>>>>>> Ah, I see what you're saying, yes, you're right. In this case ignore my comment
>>>>>>>> about ep_suspend_napi_irqs/napi_resume_irqs.
>>>>>>>
>>>>>>> Thanks for probing and double-checking everything! Feedback is important
>>>>>>> for us to properly document our proposal.
>>>>>>>
>>>>>>>> Let's see how other people feel about per-dev irq_suspend_timeout. Properly
>>>>>>>> disabling napi during busy polling is super useful, but it would still
>>>>>>>> be nice to plumb irq_suspend_timeout via epoll context or have it set on
>>>>>>>> a per-napi basis imho.
>>>>>> I agree, this would allow each napi queue to tune itself based on
>>>>>> heuristics. But I think
>>>>>> doing it through epoll independent interface makes more sense as Stan
>>>>>> suggested earlier.
>>>>>
>>>>> The question is whether to add a useful mechanism (one sysfs parameter
>>>>> and a few lines of code) that is optional, but with demonstrable and
>>>>> significant performance/efficiency improvements for an important class
>>>>> of applications - or wait for an uncertain future?
>>>>
>>>> The issue is that this one little change can never be removed, as it
>>>> becomes ABI.
>>>>
>>>> Let's get the right API from the start.
>>>>
>>>> Not sure that a global variable, or sysfs as API, is the right one.
>>>
>>> Sorry per-device, not global.
>>>
>>> My main concern is that it adds yet another user tunable integer, for
>>> which the right value is not obvious.
>>
>> This is a feature for advanced users just like SO_INCOMING_NAPI_ID
>> and countless other features.
>>
>> The value may not be obvious, but guidance (in the form of
>> documentation) can be provided.
>
> Okay. Could you share a stab at what that would look like?
The timeout needs to be large enough that an application can get a
meaningful number of incoming requests processed without softirq
interference. At the same time, the timeout value determines the
worst-case delivery delay that a concurrent application using the same
queue(s) might experience. Please also see my response to Samiullah
quoted above. The specific circumstances and trade-offs might vary,
that's why a simple constant likely won't do.
>>> If the only goal is to safely reenable interrupts when the application
>>> stops calling epoll_wait, does this have to be user tunable?
>>>
>>> Can it be either a single good enough constant, or derived from
>>> another tunable, like busypoll_read.
>>
>> I believe you meant busy_read here, is that right?
>>
>> At any rate:
>>
>> - I don't think a single constant is appropriate, just as it
>> wasn't appropriate for the existing mechanism
>> (napi_defer_hard_irqs/gro_flush_timeout), and
>>
>> - Deriving the value from a pre-existing parameter to preserve the
>> ABI, like busy_read, makes using this more confusing for users
>> and complicates the API significantly.
>>
>> I agree we should get the API right from the start; that's why we've
>> submit this as an RFC ;)
>>
>> We are happy to take suggestions from the community, but, IMHO,
>> re-using an existing parameter for a different purpose only in
>> certain circumstances (if I understand your suggestions) is a much
>> worse choice than adding a new tunable that clearly states its
>> intended singular purpose.
>
> Ack. I was thinking whether an epoll flag through your new epoll
> ioctl interface to toggle the IRQ suspension (and timer start)
> would be preferable. Because more fine grained.
A value provided by an application through the epoll ioctl would not be
subject to admin oversight, so a misbehaving application could set an
arbitrary timeout value. A sysfs value needs to be set by an admin. The
ideal timeout value depends both on the particular target application as
well as concurrent applications using the same queue(s) - as sketched above.
> Also, the value is likely dependent more on the expected duration
> of userspace processing? If so, it would be the same for all
> devices, so does a per-netdev value make sense?
It is per-netdev in the current proposal to be at the same granularity
as gro_flush_timeout and napi_defer_hard_irqs, because irq suspension
operates at the same level/granularity. This allows for more control
than a global setting and it can be migrated to per-napi settings along
with gro_flush_timeout and napi_defer_hard_irqs when the time comes.
Thanks,
Martin
Powered by blists - more mailing lists