linux-kernel - Re: [PATCH net-next v6 2/7] net: Suspend softirq when prefer_busy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zywy8PQDljS5r_rX@LQ3V64L9R2>
Date: Wed, 6 Nov 2024 19:24:32 -0800
From: Joe Damato <jdamato@...tly.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org, corbet@....net, hdanton@...a.com,
	bagasdotme@...il.com, pabeni@...hat.com, namangulati@...gle.com,
	edumazet@...gle.com, amritha.nambiar@...el.com,
	sridhar.samudrala@...el.com, sdf@...ichev.me, peter@...eblog.net,
	m2shafiei@...terloo.ca, bjorn@...osinc.com, hch@...radead.org,
	willy@...radead.org, willemdebruijn.kernel@...il.com,
	skhawaja@...gle.com, Martin Karsten <mkarsten@...terloo.ca>,
	"David S. Miller" <davem@...emloft.net>,
	Simon Horman <horms@...nel.org>, David Ahern <dsahern@...nel.org>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
	Lorenzo Bianconi <lorenzo@...nel.org>,
	Alexander Lobakin <aleksander.lobakin@...el.com>,
	open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net-next v6 2/7] net: Suspend softirq when
 prefer_busy_poll is set

On Wed, Nov 06, 2024 at 03:31:00PM -0800, Jakub Kicinski wrote:
> On Wed, 6 Nov 2024 08:52:00 -0800 Joe Damato wrote:
> > On Tue, Nov 05, 2024 at 09:03:38PM -0800, Jakub Kicinski wrote:
> > > On Mon,  4 Nov 2024 21:55:26 +0000 Joe Damato wrote:  
> > > > From: Martin Karsten <mkarsten@...terloo.ca>
> > > > 
> > > > When NAPI_F_PREFER_BUSY_POLL is set during busy_poll_stop and the
> > > > irq_suspend_timeout is nonzero, this timeout is used to defer softirq
> > > > scheduling, potentially longer than gro_flush_timeout. This can be used
> > > > to effectively suspend softirq processing during the time it takes for
> > > > an application to process data and return to the next busy-poll.
> > > > 
> > > > The call to napi->poll in busy_poll_stop might lead to an invocation of  
> > > 
> > > The call to napi->poll when we're arming the timer is counter
> > > productive, right? Maybe we can take this opportunity to add
> > > the seemingly missing logic to skip over it?  
> > 
> > It seems like the call to napi->poll in busy_poll_stop is counter
> > productive and we're not opposed to making an optimization like that
> > in the future.
> > 
> > When we tried it, it triggered several bugs/system hangs, so we left
> > as much of the original code in place as possible.
> 
> You don't happen to have the patch you used? Many ways to get the
> skipping wrong.

Please see below; I think we've found a solution.
 
> > The existing patch works and streamlining busy_poll_stop to skip the
> > call to napi->poll is an optimization that can be added as a later
> > series that focuses solely on when/where/how napi->poll is called.
> 
> The reason I brought it up is that it rearms the timer, if driver 
> ends up calling napi_complete_done(). So we arm the timer in
> napi_poll_stop(), then call the driver which may rearm again, 
> making the already complex code even harder to reason about.

Agreed that the timer is unnecessarily re-armed twice.

Martin ran some initial tests of this series but with this patch
(patch 2) dropped and the initial results from a small number of
runs seem fine.

In other words: I think we can simply drop this patch entirely,
re-run our tests to regenerate the data, update the documentation,
and send a v7.

But please continue reading below.

> > Our focus was on:
> >   - Not breaking any of the existing mechanisms
> >   - Adding a new mechanism
> > 
> > I think we should avoid pulling the optimization you suggest into
> > this particular series and save that for the future.
> 
> I'm primarily worried about maintainability of this code.

Of course and we're open to figuring out how to help with that.

> > > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > > index 4d910872963f..51d88f758e2e 100644
> > > > --- a/net/core/dev.c
> > > > +++ b/net/core/dev.c
> > > > @@ -6239,7 +6239,12 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
> > > >  			timeout = napi_get_gro_flush_timeout(n);
> > > >  		n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
> > > >  	}
> > > > -	if (n->defer_hard_irqs_count > 0) {
> > > > +	if (napi_prefer_busy_poll(n)) {
> > > > +		timeout = napi_get_irq_suspend_timeout(n);  
> > > 
> > > Why look at the suspend timeout in napi_complete_done()?
> > > We are unlikely to be exiting busy poll here.  
> > 
> > The idea is similar to commit 7fd3253a7de6 ("net: Introduce
> > preferred busy-polling"); continue to defer IRQs as long as forward
> > progress is being made. In this case, napi->poll ran, called
> > napi_complete_done -- the system is moving forward with processing
> > so prevent IRQs from interrupting us.
> 
> We should clarify the mental models. You're describing IRQ deferal,
> but say prefer busy poll.

OK; we're open to using different language if that would be helpful.

> Prefer busy poll has only one function - if we are at 100% busy
> and always see >= budget of packets on the ring, we never call
> napi_complete_done(). Drivers don't call napi_complete_done() if they
> consumed full budget. So we need a way to break that re-polling loop,
> release the NAPI ownership and give busy poll a chance to claim the
> NAPI instance ownership (the SCHED bit). We check for prefer
> busy poll in __napi_poll(), because, again, in the target scenario
> napi_complete_done() is never called.
> 
> The IRQ deferal mechanism is necessary for prefer busy poll to work,
> but it's separate and used by some drivers without good IRQ coalescing,
> no user space polling involved.

Sure, we agree and are on the same page on the above about what
prefer busy poll is and its interaction with IRQ deferral.

> In your case, when machine is not melting under 100% load - prefer busy
> poll will be set once or not at all.

I am not sure what you mean by that last sentence, because the
prefer busy poll flag is set by the application?

Similar to prefer busy poll piggybacking on IRQ deferral, we
piggyback on prefer busy polling by allowing the application to use
an even larger timeout while it is processing incoming data, i.e.,
deferring IRQs for a longer period, but only after a successful busy
poll. This makes prefer busy poll + irq suspend useful when
utilization is below 100%.

> > epoll_wait will re-enable IRQs (by calling napi_schedule) if
> > there are no events ready for processing.
> 
> To be 100% precise calling napi_schedule will not reenable IRQs 
> if IRQ deferal is active. It only guarantees one NAPI run in 
> softirq (or thread if threaded).

Yes, I should have been more precise.

Calling napi_schedule doesn't enable IRQs, but runs NAPI which sets
about the process of *eventually* causing napi_complete_done to
return true which triggers the re-arming of IRQs by the driver.

> > > Is it because we need more time than gro_flush_timeout
> > > for the application to take over the polling?  
> > 
> > That's right; we want the application to retain control of packet
> > processing. That's why we connected this to the "prefer_busy_poll"
> > flag.
> > 
> > > > +		if (timeout)
> > > > +			ret = false;
> > > > +	}
> > > > +	if (ret && n->defer_hard_irqs_count > 0) {
> > > >  		n->defer_hard_irqs_count--;
> > > >  		timeout = napi_get_gro_flush_timeout(n);
> > > >  		if (timeout)
> > > > @@ -6375,9 +6380,13 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
> > > >  	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
> > > >  
> > > >  	if (flags & NAPI_F_PREFER_BUSY_POLL) {
> > > > -		napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
> > > > -		timeout = napi_get_gro_flush_timeout(napi);
> > > > -		if (napi->defer_hard_irqs_count && timeout) {
> > > > +		timeout = napi_get_irq_suspend_timeout(napi);  
> > > 
> > > Even here I'm not sure if we need to trigger suspend.
> > > I don't know the eventpoll code well but it seems like you suspend 
> > > and resume based on events when exiting epoll. Why also here?  
> > 
> > There's two questions wrapped up here and an overall point to make:
> > 
> > 1. Suspend and resume based on events when exiting epoll - that's
> >    right and as you'll see in those patches that happens by:
> >      - arming the suspend timer (via a call to napi_suspend_irqs)
> >        when a positive number of events are retrieved
> >      - calling napi_schedule (via napi_resume_irqs) when there are
> >        no events or the epoll context is being freed.
> > 
> > 2. Why defer the suspend timer here in busy_poll_stop? Note that the
> >    original code would set the timer to gro_flush_timeout, which
> >    would introduce the trade offs we mention in the cover letter
> >    (latency for large values, IRQ interruption for small values).
> > 
> >    We don't want the gro_flush_timeout to take over yet because we
> >    want to avoid these tradeoffs up until the point where epoll_wait
> >    finds no events for processing.
> > 
> >    Does that make sense? If we skipped the IRQ suspend deferral
> >    here, we'd be giving packet processing control back to
> >    gro_flush_timeout and napi_defer_hard_irqs, but the system might
> >    still have packets that can be processed in the next call to
> >    epoll_wait.
> 
> Let me tell you what I think happens and then you can correct me.
> 
> 0 epoll
> 1   # ..does its magic..
> 2   __napi_busy_loop() 
> 3     # finds a waking packet
> 4     busy_poll_stop()
> 5       # arms the timer for long suspend
> 6   # epoll sees events
> 7     ep_suspend_napi_irqs()
> 8       napi_suspend_irqs()
> 9         # arms for long timer again
> 
> The timer we arm here only has to survive from line 5 to line 9,
> because we will override the timeout on line 9.

Yes, you are right. Thanks for highlighting and catching this.

> > The overall point to make is that: the suspend timer is used to
> > prevent misbehaving userland applications from taking too long. It's
> > essentially a backstop and, as long as the app is making forward
> > progress, allows the app to continue running its busy poll loop
> > undisturbed (via napi_complete_done preventing the driver from
> > enabling IRQs).
> > 
> > Does that make sense?
> 
> My mental model put in yet another way is that only epoll knows if it
> has events, and therefore whether the timeout should be short or long.
> So the suspend timer should only be applied by epoll.

Here's what we are thinking, can you let me know if you agree with
this?

  - We can drop patch 2 entirely
  - Update the documentation about IRQ suspension as needed now
    that patch 2 has been dropped
  - Leave the rest of the series as is
  - Re-run our tests to gather sufficient data for the test cases
    outlined in the cover letter to ensure that the performance
    numbers hold over several iterations

Does that seem reasonable for the v7 to you?

I am asking because generating the amount of data over the number of
scenarios we are testing takes a long time and I want to make sure
we are as aligned as we can be before I kick off another run :)