[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZrqU3kYgL4-OI-qj@mini-arch>
Date: Mon, 12 Aug 2024 16:03:58 -0700
From: Stanislav Fomichev <sdf@...ichev.me>
To: Martin Karsten <mkarsten@...terloo.ca>
Cc: netdev@...r.kernel.org, Joe Damato <jdamato@...tly.com>,
amritha.nambiar@...el.com, sridhar.samudrala@...el.com,
Alexander Lobakin <aleksander.lobakin@...el.com>,
Alexander Viro <viro@...iv.linux.org.uk>,
Breno Leitao <leitao@...ian.org>,
Christian Brauner <brauner@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Jan Kara <jack@...e.cz>,
Jiri Pirko <jiri@...nulli.us>,
Johannes Berg <johannes.berg@...el.com>,
Jonathan Corbet <corbet@....net>,
"open list:DOCUMENTATION" <linux-doc@...r.kernel.org>,
"open list:FILESYSTEMS (VFS and infrastructure)" <linux-fsdevel@...r.kernel.org>,
open list <linux-kernel@...r.kernel.org>,
Lorenzo Bianconi <lorenzo@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll
On 08/12, Martin Karsten wrote:
> On 2024-08-12 16:19, Stanislav Fomichev wrote:
> > On 08/12, Joe Damato wrote:
> > > Greetings:
> > >
> > > Martin Karsten (CC'd) and I have been collaborating on some ideas about
> > > ways of reducing tail latency when using epoll-based busy poll and we'd
> > > love to get feedback from the list on the code in this series. This is
> > > the idea I mentioned at netdev conf, for those who were there. Barring
> > > any major issues, we hope to submit this officially shortly after RFC.
> > >
> > > The basic idea for suspending IRQs in this manner was described in an
> > > earlier paper presented at Sigmetrics 2024 [1].
> >
> > Let me explicitly call out the paper. Very nice analysis!
>
> Thank you!
>
> [snip]
>
> > > Here's how it is intended to work:
> > > - An administrator sets the existing sysfs parameters for
> > > defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.
> > >
> > > - An administrator sets the new sysfs parameter irq_suspend_timeout
> > > to a larger value than gro-timeout to enable IRQ suspension.
> >
> > Can you expand more on what's the problem with the existing gro_flush_timeout?
> > Is it defer_hard_irqs_count? Or you want a separate timeout only for the
> > perfer_busy_poll case(why?)? Because looking at the first two patches,
> > you essentially replace all usages of gro_flush_timeout with a new variable
> > and I don't see how it helps.
>
> gro-flush-timeout (in combination with defer-hard-irqs) is the default irq
> deferral mechanism and as such, always active when configured. Its static
> periodic softirq processing leads to a situation where:
>
> - A long gro-flush-timeout causes high latencies when load is sufficiently
> below capacity, or
>
> - a short gro-flush-timeout causes overhead when softirq execution
> asynchronously competes with application processing at high load.
>
> The shortcomings of this are documented (to some extent) by our experiments.
> See defer20 working well at low load, but having problems at high load,
> while defer200 having higher latency at low load.
>
> irq-suspend-timeout is only active when an application uses
> prefer-busy-polling and in that case, produces a nice alternating pattern of
> application processing and networking processing (similar to what we
> describe in the paper). This then works well with both low and high load.
So you only want it for the prefer-busy-pollingc case, makes sense. I was
a bit confused by the difference between defer200 and suspend200,
but now I see that defer200 does not enable busypoll.
I'm assuming that if you enable busypool in defer200 case, the numbers
should be similar to suspend200 (ignoring potentially affecting
non-busypolling queues due to higher gro_flush_timeout).
> > Maybe expand more on what code paths are we trying to improve? Existing
> > busy polling code is not super readable, so would be nice to simplify
> > it a bit in the process (if possible) instead of adding one more tunable.
>
> There are essentially three possible loops for network processing:
>
> 1) hardirq -> softirq -> napi poll; this is the baseline functionality
>
> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme
> with the shortcomings described above
>
> 3) epoll -> busy-poll -> napi poll
>
> If a system is configured for 1), not much can be done, as it is difficult
> to interject anything into this loop without adding state and side effects.
> This is what we tried for the paper, but it ended up being a hack.
>
> If however the system is configured for irq deferral, Loops 2) and 3)
> "wrestle" with each other for control. Injecting the larger
> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour
> of Loop 3) and creates the nice pattern describe above.
And you hit (2) when the epoll goes to sleep and/or when the userspace
isn't fast enough to keep up with the timer, presumably? I wonder
if need to use this opportunity and do proper API as Joe hints in the
cover letter. Something over netlink to say "I'm gonna busy-poll on
this queue / napi_id and with this timeout". And then we can essentially make
gro_flush_timeout per queue (and avoid
napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels
too hacky already :-(
> [snip]
>
> > > - suspendX:
> > > - set defer_hard_irqs to 100
> > > - set gro_flush_timeout to X,000
> > > - set irq_suspend_timeout to 20,000,000
> > > - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
> > > busy_poll_budget = 64, prefer_busy_poll = true)
> >
> > What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback
> > to busy_poll sysctl value?
>
> Before this patch set, ep_poll only calls napi_busy_poll, if busy_poll
> (sysctl) or busy_poll_usecs is nonzero. However, this might lead to
> busy-polling even when the application does not actually need or want it.
> Only one iteration through the busy loop is needed to make the new scheme
> work. Additional napi busy polling over and above is optional.
Ack, thanks, was trying to understand why not stay with
busy_poll_usecs=64 for consistency. But I guess you were just
trying to show that patch 4/5 works.
Powered by blists - more mailing lists