netdev - Re: [RFC, RESEND] UDP receive path batching improvement

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <aKh_yi0gASYajhev@bzorp3>
Date: Fri, 22 Aug 2025 16:33:46 +0200
From: Balazs Scheidler <bazsi77@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: netdev@...r.kernel.org, pabeni@...hat.com
Subject: Re: [RFC, RESEND] UDP receive path batching improvement

On Fri, Aug 22, 2025 at 06:56:03AM -0700, Eric Dumazet wrote:
> On Fri, Aug 22, 2025 at 6:33 AM Balazs Scheidler <bazsi77@...il.com> wrote:
> >
> > On Fri, Aug 22, 2025 at 06:10:28AM -0700, Eric Dumazet wrote:
> > > On Fri, Aug 22, 2025 at 5:56 AM Balazs Scheidler <bazsi77@...il.com> wrote:
> > > >
> > > > On Fri, Aug 22, 2025 at 02:37:28AM -0700, Eric Dumazet wrote:
> > > > > On Fri, Aug 22, 2025 at 2:15 AM Balazs Scheidler <bazsi77@...il.com> wrote:
> > > > > >
> > > > > > On Fri, Aug 22, 2025 at 01:18:36AM -0700, Eric Dumazet wrote:
> > > > > > > On Fri, Aug 22, 2025 at 1:15 AM Balazs Scheidler <bazsi77@...il.com> wrote:
> > > > > > > > The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
> > > > > > > > done to the counter.
> > > > > > > >
> > > > > > > > In our case (syslog receive path via udp), socket buffers are generally
> > > > > > > > tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
> > > > > > > > the senders can generate spikes in their traffic and a lot of senders send
> > > > > > > > to the same port. Due to latencies, sometimes these buffers take MBs of data
> > > > > > > > before the user-space process even has a chance to consume them.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > This seems very high usage for a single UDP socket.
> > > > > > >
> > > > > > > Have you tried SO_REUSEPORT to spread incoming packets to more sockets
> > > > > > > (and possibly more threads) ?
> > > > > >
> > > > > > Yes.  I use SO_REUSEPORT (16 sockets), I even use eBPF to distribute the
> > > > > > load over multiple sockets evenly, instead of the normal load balancing
> > > > > > algorithm built into SO_REUSEPORT.
> > > > > >
> > > > >
> > > > > Great. But if you have many receive queues, are you sure this choice does not
> > > > > add false sharing ?
> > > >
> > > > I am not sure how that could trigger false sharing here.  I am using a
> > > > "socket" filter, which generates a random number modulo the number of
> > > > sockets:
> > > >
> > > > ```
> > > > #include "vmlinux.h"
> > > > #include <bpf/bpf_helpers.h>
> > > >
> > > > int number_of_sockets;
> > > >
> > > > SEC("socket")
> > > > int random_choice(struct __sk_buff *skb)
> > > > {
> > > >   if (number_of_sockets == 0)
> > > >     return -1;
> > > >
> > > >   return bpf_get_prandom_u32() % number_of_sockets;
> > > > }
> > > > ```
> > >
> > > How many receive queues does your NIC have (ethtool -l eth0) ?
> > >
> > > This filter causes huge contention on the receive queues and various
> > > socket fields, accessed by different cpus.
> > >
> > > You should instead perform a choice based on the napi_id (skb->napi_id)
> >
> > I don't have ssh access to the box, unfortunately.  I'll look into napi_id,
> > my historical knowledge of the IP stack is that we are using a single thread
> > to handle incoming datagrams, but I have to realize that information did not
> > age well. Also, the kernel is ancient, 4.18 something, RHEL8 (no, I didn't
> > have a say in that...).
> >
> > This box is a VM, but I am not even sure about the virtualization stack used, I
> > am finding it out the number of receive queues.
> 
> I think this is the critical part. The optimal eBPF program depends on this.
> 
> In anycase, the 25% threshold makes the usable capacity smaller,
> so I would advise setting bigger SO_RCVBUF values.

Thank you, that's exactly what we are doing.  The box was powecycled and we
lost the settings.  I am now improving the eBPF load balancing algorithm so
we get a better use of caches on the kernel receive side.

What do you think about the recovery-from-drop part?  I mean if I could get
sk_rmem_alloc updated faster as the userspace consumes packets, a single
packet drop would not cause a this many packets to be lost, at the cost of
loss events to be more spread out in time.

Would something like my original posting be acceptable?

-- 
Bazsi