netdev - Re: [RFC, RESEND] UDP receive path batching improvement

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aKhxpuawARQlCj29@bzorp3>
Date: Fri, 22 Aug 2025 15:33:26 +0200
From: Balazs Scheidler <bazsi77@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: netdev@...r.kernel.org, pabeni@...hat.com
Subject: Re: [RFC, RESEND] UDP receive path batching improvement

On Fri, Aug 22, 2025 at 06:10:28AM -0700, Eric Dumazet wrote:
> On Fri, Aug 22, 2025 at 5:56 AM Balazs Scheidler <bazsi77@...il.com> wrote:
> >
> > On Fri, Aug 22, 2025 at 02:37:28AM -0700, Eric Dumazet wrote:
> > > On Fri, Aug 22, 2025 at 2:15 AM Balazs Scheidler <bazsi77@...il.com> wrote:
> > > >
> > > > On Fri, Aug 22, 2025 at 01:18:36AM -0700, Eric Dumazet wrote:
> > > > > On Fri, Aug 22, 2025 at 1:15 AM Balazs Scheidler <bazsi77@...il.com> wrote:
> > > > > > The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
> > > > > > done to the counter.
> > > > > >
> > > > > > In our case (syslog receive path via udp), socket buffers are generally
> > > > > > tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
> > > > > > the senders can generate spikes in their traffic and a lot of senders send
> > > > > > to the same port. Due to latencies, sometimes these buffers take MBs of data
> > > > > > before the user-space process even has a chance to consume them.
> > > > > >
> > > > >
> > > > >
> > > > > This seems very high usage for a single UDP socket.
> > > > >
> > > > > Have you tried SO_REUSEPORT to spread incoming packets to more sockets
> > > > > (and possibly more threads) ?
> > > >
> > > > Yes.  I use SO_REUSEPORT (16 sockets), I even use eBPF to distribute the
> > > > load over multiple sockets evenly, instead of the normal load balancing
> > > > algorithm built into SO_REUSEPORT.
> > > >
> > >
> > > Great. But if you have many receive queues, are you sure this choice does not
> > > add false sharing ?
> >
> > I am not sure how that could trigger false sharing here.  I am using a
> > "socket" filter, which generates a random number modulo the number of
> > sockets:
> >
> > ```
> > #include "vmlinux.h"
> > #include <bpf/bpf_helpers.h>
> >
> > int number_of_sockets;
> >
> > SEC("socket")
> > int random_choice(struct __sk_buff *skb)
> > {
> >   if (number_of_sockets == 0)
> >     return -1;
> >
> >   return bpf_get_prandom_u32() % number_of_sockets;
> > }
> > ```
> 
> How many receive queues does your NIC have (ethtool -l eth0) ?
> 
> This filter causes huge contention on the receive queues and various
> socket fields, accessed by different cpus.
> 
> You should instead perform a choice based on the napi_id (skb->napi_id)

I don't have ssh access to the box, unfortunately.  I'll look into napi_id,
my historical knowledge of the IP stack is that we are using a single thread
to handle incoming datagrams, but I have to realize that information did not
age well. Also, the kernel is ancient, 4.18 something, RHEL8 (no, I didn't
have a say in that...).

This box is a VM, but I am not even sure about the virtualization stack used, I
am finding it out the number of receive queues.

But with that said, I was under the impression that the bottleneck is in
userspace and the userspace's roundtrip to get back to receiving UDP. 
The same event loop is processing a number of connections/UDP sockets in
parallel.

Sometimes syslog-ng just doesn't get around quickly enough if there's too much to do
with a specific datagram.  My assumption has been that it is this latency
that causes datagrams to be dropped.

> 
> 
> >
> > Last I've checked the code, all it did was putting the incoming packet into
> > the right socket buffer, as returned by the filter. What would be the false
> > sharing in this case?
> >
> > >
> > > > Sometimes the processing on the userspace side is heavy enough (think of
> > > > parsing, heuristics, data normalization) and the load on the box heavy
> > > > enough that I still see drops from time to time.
> > > >
> > > > If a client sends 100k messages in a tight loop for a while, that's going to
> > > > use a lot of buffer space.  What bothers me further is that it could be ok
> > > > to lose a single packet, but any time we drop one packet, we will continue
> > > > to lose all of them, at least until we fetch 25% of SO_RCVBUF (or if the
> > > > receive buffer is completely emptied).  This problem, combined with small
> > > > packets (think of 100-150 byte payload) can easily cause excessive drops. 25%
> > > > of the socket buffer is a huge offset.
> > >
> > > sock_writeable() uses a 50% threshold.
> >
> > I am not sure why this is relevant here, the write side of sockets can
> > easily be flow controlled (e.g. the process waiting until it can send more
> > data). Also my clients are not necessarily client boxes. PaloAlto firewalls
> > can generate 70k events-per-second in syslog alone. And that does leave the
> > firewall, and my challenge is to read all of that.
> >
> > >
> > > >
> > > > I am not sure how many packets warrants a sk_rmem_alloc update, but I'd
> > > > assume that 1 update every 100 packets should still be OK.
> > >
> > > Maybe, but some UDP packets have a truesize around 128 KB or even more.
> >
> > I understand that the truesize incorporates struct sk_buff header and we may
> > also see non-linear SKBs, which could inflate the number (saying this without really
> > understanding all the specifics there).
> >
> > >
> > > Perhaps add a new UDP socket option to let the user decide on what
> > > they feel is better for them ?
> >
> > I wanted to avoid a knob for this, but I can easily implement this way. So
> > should I create a patch for a setsockopt() that allows setting
> > udp_sk->forward_threshold?
> >
> > >
> > > I suspect that the main issue is about having a single drop in the first place,
> > > because of false sharing on sk->sk_drops
> > >
> > > Perhaps we should move sk_drops on a dedicated cache line,
> > > and perhaps have two counters for NUMA servers.
> >
> > I am looking into sk_drops, I don't know what it does at the moment, it's
> > been a while I've last read this codebase :)
> >
> 
> Can you post
> 
> ss -aum src :1000  <replace 1000 with your UDP source port>
> 
> We will check the dXXXX output (number of drops), per socket.

I don't have access to "ss", but I have this screenshot about a similar
metrics that we collect every 30 seconds:

https://drive.google.com/file/d/1HrMHSrbrkwCILQiBgAZw-J1r39PBED0f/view?usp=sharing

These metrics are collected via SK_MEMINFO from each of the sockets.

Simmilar to this case, drops usually happen on all the threads at once, even
if the receive rate is really low. Right now (when this screenshot was
taken), the UDP socket buffer remained at ~400kB (the default, as the sysctl
knobs were not persisted).

-- 
Bazsi