linux-kernel - Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <66bcc87d605_b1f942948@willemb.c.googlers.com.notmuch>
Date: Wed, 14 Aug 2024 11:08:45 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Joe Damato <jdamato@...tly.com>, 
 Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: Martin Karsten <mkarsten@...terloo.ca>, 
 Stanislav Fomichev <sdf@...ichev.me>, 
 netdev@...r.kernel.org, 
 amritha.nambiar@...el.com, 
 sridhar.samudrala@...el.com, 
 Alexander Lobakin <aleksander.lobakin@...el.com>, 
 Alexander Viro <viro@...iv.linux.org.uk>, 
 Breno Leitao <leitao@...ian.org>, 
 Christian Brauner <brauner@...nel.org>, 
 Daniel Borkmann <daniel@...earbox.net>, 
 "David S. Miller" <davem@...emloft.net>, 
 Eric Dumazet <edumazet@...gle.com>, 
 Jakub Kicinski <kuba@...nel.org>, 
 Jan Kara <jack@...e.cz>, 
 Jiri Pirko <jiri@...nulli.us>, 
 Johannes Berg <johannes.berg@...el.com>, 
 Jonathan Corbet <corbet@....net>, 
 "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>, 
 "open list:FILESYSTEMS (VFS and infrastructure)" <linux-fsdevel@...r.kernel.org>, 
 open list <linux-kernel@...r.kernel.org>, 
 Lorenzo Bianconi <lorenzo@...nel.org>, 
 Paolo Abeni <pabeni@...hat.com>, 
 Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll

Joe Damato wrote:
> On Tue, Aug 13, 2024 at 11:16:07PM -0400, Willem de Bruijn wrote:
> > Martin Karsten wrote:
> > > On 2024-08-13 00:07, Stanislav Fomichev wrote:
> > > > On 08/12, Martin Karsten wrote:
> > > >> On 2024-08-12 21:54, Stanislav Fomichev wrote:
> > > >>> On 08/12, Martin Karsten wrote:
> > > >>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
> > > >>>>> On 08/12, Martin Karsten wrote:
> > > >>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
> > > >>>>>>> On 08/12, Joe Damato wrote:
> 
> [...]
> 
> > > >>
> > > >> One of the goals of this patch set is to reduce parameter tuning and make
> > > >> the parameter setting independent of workload dynamics, so it should make
> > > >> things easier. This is of course notwithstanding that per-napi settings
> > > >> would be even better.
> > 
> > I don't follow how adding another tunable reduces parameter tuning.
> 
> Thanks for taking a look and providing valuable feedback, Willem.
> 
> An early draft of the cover letter included some paragraphs which were removed
> for the sake of brevity that we can add back in, which addresses your comment
> above:
> 
>  The existing mechanism in the kernel (defer_hard_irqs and gro_flush_timeout)
>  is useful, but picking the correct values for these settings is difficult and
>  the ideal values change as the type of traffic and workload changes.
> 
>  For example: picking a large timeout value is good for batching packet
>  processing, but induces latency. Picking a small timeout value will interrupt
>  user app processing and reduce efficiency. The value chosen would be different
>  depending on whether the system is under high load (large timeout) or when less
>  busy (small timeout).
> 
> As such, adding the new tunable makes it much easier to use the existing ones
> and also produces better performance as shown in the results we presented. 
> 
> Please let me know if you have any questions; I know that the change we are
> introducing is very subtle and I am happy to expand the cover letter if it'd be
> helpful for you.
> 
> My concern was that the cover letter was too long already, but a big takeaway
> for me thus far has been that we should expand the cover letter.
> 
> [...]
> 
> > > > Let's see how other people feel about per-dev irq_suspend_timeout. Properly
> > > > disabling napi during busy polling is super useful, but it would still
> > > > be nice to plumb irq_suspend_timeout via epoll context or have it set on
> > > > a per-napi basis imho.
> > > 
> > > Fingers crossed. I hope this patch will be accepted, because it has 
> > > practical performance and efficiency benefits, and that this will 
> > > further increase the motivation to re-design the entire irq 
> > > defer(/suspend) infrastructure for per-napi settings.
> > 
> > Overall, the idea of keeping interrupts disabled during event
> > processing is very interesting.
> 
> Thanks; I'm happy to hear we are aligned on this.
> 
> > Hopefully the interface can be made more intuitive. Or documented more
> > easily. I had to read the kernel patches to fully (perhaps) grasp it.
> > 
> > Another +1 on the referenced paper. Pointing out a specific difference
> > in behavior that is unrelated to the protection domain, rather than a
> > straightforward kernel vs user argument. The paper also had some
> > explanation that may be clearer for a commit message than the current
> > cover letter:
> > 
> > "user-level network stacks put the application in charge of the entire
> > network stack processing (cf. Section 2). Interrupts are disabled and
> > the application coordinates execution by alternating between
> > processing existing requests and polling the RX queues for new data"
> > " [This series extends this behavior to kernel busy polling, while
> > falling back onto interrupt processing to limit CPU overhead.]
> > 
> > "Instead of re-enabling the respective interrupt(s) as soon as
> > epoll_wait() returns from its NAPI busy loop, the relevant IRQs stay
> > masked until a subsequent epoll_wait() call comes up empty, i.e., no
> > events of interest are found and the application thread is about to be
> > blocked."
> > 
> > "A fallback technical approach would use a kernel timeout set on the
> > return path from epoll_wait(). If necessary, the timeout re-enables
> > interrupts regardless of the application´s (mis)behaviour."
> > [Where misbehavior is not calling epoll_wait again]
> > 
> > "The resulting execution model mimics the execution model of typical
> > user-level network stacks and does not add any requirements compared
> > to user-level networking. In fact, it is slightly better, because it
> > can resort to blocking and interrupt delivery, instead of having to
> > continuously busyloop during idle times."
> >
> > This last part shows a preference on your part to a trade-off:
> > you want low latency, but also low cpu utilization if possible.
> > This also came up in this thread. Please state that design decision
> > explicitly.
> 
> Sure, we can include that in the list of cover letter updates we
> need to make. I could have called it out more clearly, but in the cover
> letter [1] I mentioned that latency improved for compared CPU usage (i.e.
> CPU efficiency improved):
> 
>   The overall takeaway from the results below is that the new mechanism
>   (suspend20, see below) results in reduced 99th percentile latency and
>   increased QPS in the MAX QPS case (compared to the other cases), and
>   reduced latency in the lower QPS cases for comparable CPU usage to the
>   base case (and less CPU than the busy case).
> 
> > There are plenty of workloads where burning a core is acceptable
> > (especially as core count continues increasing), not "slightly worse".
> 
> Respectfully, I don't think I'm on board with this argument. "Burning a core"
> has side effects even as core counts increase (power, cooling, etc) and it
> seems the counter argument is equally valid, as well: there are plenty of
> workloads where burning a core is undesirable.

Even more, likely. As this might be usable for standard efficiency
focused production services.

> Using less CPU to get comparable performance is strictly better, even if a
> system can theoretically support the increased CPU/power/cooling load.

If it is always a strict win yes. But falling back onto interrupts
with standard moderation will not match busy polling in all cases.

Different solutions for different workloads. No need to stack rank
them. My request is just to be explicit which design point this
chooses, and that the other design point (continuous busy polling) is
already addressed in Linux kernel busypolling.

> Either way: this is not an either/or. Adding support for the code we've
> proposed will be very beneficial for an important set of workloads without
> taking anything anyway.
> 
> - Joe
> 
> [1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/