lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZyuesOyJLI3U0C5e@LQ3V64L9R2>
Date: Wed, 6 Nov 2024 08:52:00 -0800
From: Joe Damato <jdamato@...tly.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org, corbet@....net, hdanton@...a.com,
	bagasdotme@...il.com, pabeni@...hat.com, namangulati@...gle.com,
	edumazet@...gle.com, amritha.nambiar@...el.com,
	sridhar.samudrala@...el.com, sdf@...ichev.me, peter@...eblog.net,
	m2shafiei@...terloo.ca, bjorn@...osinc.com, hch@...radead.org,
	willy@...radead.org, willemdebruijn.kernel@...il.com,
	skhawaja@...gle.com, Martin Karsten <mkarsten@...terloo.ca>,
	"David S. Miller" <davem@...emloft.net>,
	Simon Horman <horms@...nel.org>, David Ahern <dsahern@...nel.org>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
	Lorenzo Bianconi <lorenzo@...nel.org>,
	Alexander Lobakin <aleksander.lobakin@...el.com>,
	open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net-next v6 2/7] net: Suspend softirq when
 prefer_busy_poll is set

On Tue, Nov 05, 2024 at 09:03:38PM -0800, Jakub Kicinski wrote:
> On Mon,  4 Nov 2024 21:55:26 +0000 Joe Damato wrote:
> > From: Martin Karsten <mkarsten@...terloo.ca>
> > 
> > When NAPI_F_PREFER_BUSY_POLL is set during busy_poll_stop and the
> > irq_suspend_timeout is nonzero, this timeout is used to defer softirq
> > scheduling, potentially longer than gro_flush_timeout. This can be used
> > to effectively suspend softirq processing during the time it takes for
> > an application to process data and return to the next busy-poll.
> > 
> > The call to napi->poll in busy_poll_stop might lead to an invocation of
> 
> The call to napi->poll when we're arming the timer is counter
> productive, right? Maybe we can take this opportunity to add
> the seemingly missing logic to skip over it?

It seems like the call to napi->poll in busy_poll_stop is counter
productive and we're not opposed to making an optimization like that
in the future.

When we tried it, it triggered several bugs/system hangs, so we left
as much of the original code in place as possible.

The existing patch works and streamlining busy_poll_stop to skip the
call to napi->poll is an optimization that can be added as a later
series that focuses solely on when/where/how napi->poll is called.

Our focus was on:
  - Not breaking any of the existing mechanisms
  - Adding a new mechanism

I think we should avoid pulling the optimization you suggest into
this particular series and save that for the future.

> > napi_complete_done, but the prefer-busy flag is still set at that time,
> > so the same logic is used to defer softirq scheduling for
> > irq_suspend_timeout.
> > 
> > Signed-off-by: Martin Karsten <mkarsten@...terloo.ca>
> > Co-developed-by: Joe Damato <jdamato@...tly.com>
> > Signed-off-by: Joe Damato <jdamato@...tly.com>
> > Tested-by: Joe Damato <jdamato@...tly.com>
> > Tested-by: Martin Karsten <mkarsten@...terloo.ca>
> > Acked-by: Stanislav Fomichev <sdf@...ichev.me>
> > Reviewed-by: Sridhar Samudrala <sridhar.samudrala@...el.com>
> > ---
> >  v3:
> >    - Removed reference to non-existent sysfs parameter from commit
> >      message. No functional/code changes.
> > 
> >  net/core/dev.c | 17 +++++++++++++----
> >  1 file changed, 13 insertions(+), 4 deletions(-)
> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 4d910872963f..51d88f758e2e 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6239,7 +6239,12 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
> >  			timeout = napi_get_gro_flush_timeout(n);
> >  		n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
> >  	}
> > -	if (n->defer_hard_irqs_count > 0) {
> > +	if (napi_prefer_busy_poll(n)) {
> > +		timeout = napi_get_irq_suspend_timeout(n);
> 
> Why look at the suspend timeout in napi_complete_done()?
> We are unlikely to be exiting busy poll here.

The idea is similar to commit 7fd3253a7de6 ("net: Introduce
preferred busy-polling"); continue to defer IRQs as long as forward
progress is being made. In this case, napi->poll ran, called
napi_complete_done -- the system is moving forward with processing
so prevent IRQs from interrupting us.

epoll_wait will re-enable IRQs (by calling napi_schedule) if
there are no events ready for processing.

> Is it because we need more time than gro_flush_timeout
> for the application to take over the polling?

That's right; we want the application to retain control of packet
processing. That's why we connected this to the "prefer_busy_poll"
flag.

> > +		if (timeout)
> > +			ret = false;
> > +	}
> > +	if (ret && n->defer_hard_irqs_count > 0) {
> >  		n->defer_hard_irqs_count--;
> >  		timeout = napi_get_gro_flush_timeout(n);
> >  		if (timeout)
> > @@ -6375,9 +6380,13 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
> >  	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
> >  
> >  	if (flags & NAPI_F_PREFER_BUSY_POLL) {
> > -		napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
> > -		timeout = napi_get_gro_flush_timeout(napi);
> > -		if (napi->defer_hard_irqs_count && timeout) {
> > +		timeout = napi_get_irq_suspend_timeout(napi);
> 
> Even here I'm not sure if we need to trigger suspend.
> I don't know the eventpoll code well but it seems like you suspend 
> and resume based on events when exiting epoll. Why also here?

There's two questions wrapped up here and an overall point to make:

1. Suspend and resume based on events when exiting epoll - that's
   right and as you'll see in those patches that happens by:
     - arming the suspend timer (via a call to napi_suspend_irqs)
       when a positive number of events are retrieved
     - calling napi_schedule (via napi_resume_irqs) when there are
       no events or the epoll context is being freed.

2. Why defer the suspend timer here in busy_poll_stop? Note that the
   original code would set the timer to gro_flush_timeout, which
   would introduce the trade offs we mention in the cover letter
   (latency for large values, IRQ interruption for small values).

   We don't want the gro_flush_timeout to take over yet because we
   want to avoid these tradeoffs up until the point where epoll_wait
   finds no events for processing.

   Does that make sense? If we skipped the IRQ suspend deferral
   here, we'd be giving packet processing control back to
   gro_flush_timeout and napi_defer_hard_irqs, but the system might
   still have packets that can be processed in the next call to
   epoll_wait.

The overall point to make is that: the suspend timer is used to
prevent misbehaving userland applications from taking too long. It's
essentially a backstop and, as long as the app is making forward
progress, allows the app to continue running its busy poll loop
undisturbed (via napi_complete_done preventing the driver from
enabling IRQs).

Does that make sense?

> > +		if (!timeout) {
> > +			napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
> > +			if (napi->defer_hard_irqs_count)
> > +				timeout = napi_get_gro_flush_timeout(napi);
> > +		}
> > +		if (timeout) {
> >  			hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
> >  			skip_schedule = true;
> >  		}
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ