[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZyuesOyJLI3U0C5e@LQ3V64L9R2>
Date: Wed, 6 Nov 2024 08:52:00 -0800
From: Joe Damato <jdamato@...tly.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org, corbet@....net, hdanton@...a.com,
bagasdotme@...il.com, pabeni@...hat.com, namangulati@...gle.com,
edumazet@...gle.com, amritha.nambiar@...el.com,
sridhar.samudrala@...el.com, sdf@...ichev.me, peter@...eblog.net,
m2shafiei@...terloo.ca, bjorn@...osinc.com, hch@...radead.org,
willy@...radead.org, willemdebruijn.kernel@...il.com,
skhawaja@...gle.com, Martin Karsten <mkarsten@...terloo.ca>,
"David S. Miller" <davem@...emloft.net>,
Simon Horman <horms@...nel.org>, David Ahern <dsahern@...nel.org>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
Lorenzo Bianconi <lorenzo@...nel.org>,
Alexander Lobakin <aleksander.lobakin@...el.com>,
open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net-next v6 2/7] net: Suspend softirq when
prefer_busy_poll is set
On Tue, Nov 05, 2024 at 09:03:38PM -0800, Jakub Kicinski wrote:
> On Mon, 4 Nov 2024 21:55:26 +0000 Joe Damato wrote:
> > From: Martin Karsten <mkarsten@...terloo.ca>
> >
> > When NAPI_F_PREFER_BUSY_POLL is set during busy_poll_stop and the
> > irq_suspend_timeout is nonzero, this timeout is used to defer softirq
> > scheduling, potentially longer than gro_flush_timeout. This can be used
> > to effectively suspend softirq processing during the time it takes for
> > an application to process data and return to the next busy-poll.
> >
> > The call to napi->poll in busy_poll_stop might lead to an invocation of
>
> The call to napi->poll when we're arming the timer is counter
> productive, right? Maybe we can take this opportunity to add
> the seemingly missing logic to skip over it?
It seems like the call to napi->poll in busy_poll_stop is counter
productive and we're not opposed to making an optimization like that
in the future.
When we tried it, it triggered several bugs/system hangs, so we left
as much of the original code in place as possible.
The existing patch works and streamlining busy_poll_stop to skip the
call to napi->poll is an optimization that can be added as a later
series that focuses solely on when/where/how napi->poll is called.
Our focus was on:
- Not breaking any of the existing mechanisms
- Adding a new mechanism
I think we should avoid pulling the optimization you suggest into
this particular series and save that for the future.
> > napi_complete_done, but the prefer-busy flag is still set at that time,
> > so the same logic is used to defer softirq scheduling for
> > irq_suspend_timeout.
> >
> > Signed-off-by: Martin Karsten <mkarsten@...terloo.ca>
> > Co-developed-by: Joe Damato <jdamato@...tly.com>
> > Signed-off-by: Joe Damato <jdamato@...tly.com>
> > Tested-by: Joe Damato <jdamato@...tly.com>
> > Tested-by: Martin Karsten <mkarsten@...terloo.ca>
> > Acked-by: Stanislav Fomichev <sdf@...ichev.me>
> > Reviewed-by: Sridhar Samudrala <sridhar.samudrala@...el.com>
> > ---
> > v3:
> > - Removed reference to non-existent sysfs parameter from commit
> > message. No functional/code changes.
> >
> > net/core/dev.c | 17 +++++++++++++----
> > 1 file changed, 13 insertions(+), 4 deletions(-)
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 4d910872963f..51d88f758e2e 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6239,7 +6239,12 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
> > timeout = napi_get_gro_flush_timeout(n);
> > n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
> > }
> > - if (n->defer_hard_irqs_count > 0) {
> > + if (napi_prefer_busy_poll(n)) {
> > + timeout = napi_get_irq_suspend_timeout(n);
>
> Why look at the suspend timeout in napi_complete_done()?
> We are unlikely to be exiting busy poll here.
The idea is similar to commit 7fd3253a7de6 ("net: Introduce
preferred busy-polling"); continue to defer IRQs as long as forward
progress is being made. In this case, napi->poll ran, called
napi_complete_done -- the system is moving forward with processing
so prevent IRQs from interrupting us.
epoll_wait will re-enable IRQs (by calling napi_schedule) if
there are no events ready for processing.
> Is it because we need more time than gro_flush_timeout
> for the application to take over the polling?
That's right; we want the application to retain control of packet
processing. That's why we connected this to the "prefer_busy_poll"
flag.
> > + if (timeout)
> > + ret = false;
> > + }
> > + if (ret && n->defer_hard_irqs_count > 0) {
> > n->defer_hard_irqs_count--;
> > timeout = napi_get_gro_flush_timeout(n);
> > if (timeout)
> > @@ -6375,9 +6380,13 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
> > bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
> >
> > if (flags & NAPI_F_PREFER_BUSY_POLL) {
> > - napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
> > - timeout = napi_get_gro_flush_timeout(napi);
> > - if (napi->defer_hard_irqs_count && timeout) {
> > + timeout = napi_get_irq_suspend_timeout(napi);
>
> Even here I'm not sure if we need to trigger suspend.
> I don't know the eventpoll code well but it seems like you suspend
> and resume based on events when exiting epoll. Why also here?
There's two questions wrapped up here and an overall point to make:
1. Suspend and resume based on events when exiting epoll - that's
right and as you'll see in those patches that happens by:
- arming the suspend timer (via a call to napi_suspend_irqs)
when a positive number of events are retrieved
- calling napi_schedule (via napi_resume_irqs) when there are
no events or the epoll context is being freed.
2. Why defer the suspend timer here in busy_poll_stop? Note that the
original code would set the timer to gro_flush_timeout, which
would introduce the trade offs we mention in the cover letter
(latency for large values, IRQ interruption for small values).
We don't want the gro_flush_timeout to take over yet because we
want to avoid these tradeoffs up until the point where epoll_wait
finds no events for processing.
Does that make sense? If we skipped the IRQ suspend deferral
here, we'd be giving packet processing control back to
gro_flush_timeout and napi_defer_hard_irqs, but the system might
still have packets that can be processed in the next call to
epoll_wait.
The overall point to make is that: the suspend timer is used to
prevent misbehaving userland applications from taking too long. It's
essentially a backstop and, as long as the app is making forward
progress, allows the app to continue running its busy poll loop
undisturbed (via napi_complete_done preventing the driver from
enabling IRQs).
Does that make sense?
> > + if (!timeout) {
> > + napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
> > + if (napi->defer_hard_irqs_count)
> > + timeout = napi_get_gro_flush_timeout(napi);
> > + }
> > + if (timeout) {
> > hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
> > skip_schedule = true;
> > }
>
Powered by blists - more mailing lists