netdev - Re: [RFC, RESEND] UDP receive path batching improvement

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89i+S1hyPbo5io2khLk_UTfoQgEtnjYUUJTzreYufmbii+A@mail.gmail.com>
Date: Fri, 22 Aug 2025 06:10:28 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Balazs Scheidler <bazsi77@...il.com>
Cc: netdev@...r.kernel.org, pabeni@...hat.com
Subject: Re: [RFC, RESEND] UDP receive path batching improvement

On Fri, Aug 22, 2025 at 5:56 AM Balazs Scheidler <bazsi77@...il.com> wrote:
>
> On Fri, Aug 22, 2025 at 02:37:28AM -0700, Eric Dumazet wrote:
> > On Fri, Aug 22, 2025 at 2:15 AM Balazs Scheidler <bazsi77@...il.com> wrote:
> > >
> > > On Fri, Aug 22, 2025 at 01:18:36AM -0700, Eric Dumazet wrote:
> > > > On Fri, Aug 22, 2025 at 1:15 AM Balazs Scheidler <bazsi77@...il.com> wrote:
> > > > > The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
> > > > > done to the counter.
> > > > >
> > > > > In our case (syslog receive path via udp), socket buffers are generally
> > > > > tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
> > > > > the senders can generate spikes in their traffic and a lot of senders send
> > > > > to the same port. Due to latencies, sometimes these buffers take MBs of data
> > > > > before the user-space process even has a chance to consume them.
> > > > >
> > > >
> > > >
> > > > This seems very high usage for a single UDP socket.
> > > >
> > > > Have you tried SO_REUSEPORT to spread incoming packets to more sockets
> > > > (and possibly more threads) ?
> > >
> > > Yes.  I use SO_REUSEPORT (16 sockets), I even use eBPF to distribute the
> > > load over multiple sockets evenly, instead of the normal load balancing
> > > algorithm built into SO_REUSEPORT.
> > >
> >
> > Great. But if you have many receive queues, are you sure this choice does not
> > add false sharing ?
>
> I am not sure how that could trigger false sharing here.  I am using a
> "socket" filter, which generates a random number modulo the number of
> sockets:
>
> ```
> #include "vmlinux.h"
> #include <bpf/bpf_helpers.h>
>
> int number_of_sockets;
>
> SEC("socket")
> int random_choice(struct __sk_buff *skb)
> {
>   if (number_of_sockets == 0)
>     return -1;
>
>   return bpf_get_prandom_u32() % number_of_sockets;
> }
> ```

How many receive queues does your NIC have (ethtool -l eth0) ?

This filter causes huge contention on the receive queues and various
socket fields, accessed by different cpus.

You should instead perform a choice based on the napi_id (skb->napi_id)


>
> Last I've checked the code, all it did was putting the incoming packet into
> the right socket buffer, as returned by the filter. What would be the false
> sharing in this case?
>
> >
> > > Sometimes the processing on the userspace side is heavy enough (think of
> > > parsing, heuristics, data normalization) and the load on the box heavy
> > > enough that I still see drops from time to time.
> > >
> > > If a client sends 100k messages in a tight loop for a while, that's going to
> > > use a lot of buffer space.  What bothers me further is that it could be ok
> > > to lose a single packet, but any time we drop one packet, we will continue
> > > to lose all of them, at least until we fetch 25% of SO_RCVBUF (or if the
> > > receive buffer is completely emptied).  This problem, combined with small
> > > packets (think of 100-150 byte payload) can easily cause excessive drops. 25%
> > > of the socket buffer is a huge offset.
> >
> > sock_writeable() uses a 50% threshold.
>
> I am not sure why this is relevant here, the write side of sockets can
> easily be flow controlled (e.g. the process waiting until it can send more
> data). Also my clients are not necessarily client boxes. PaloAlto firewalls
> can generate 70k events-per-second in syslog alone. And that does leave the
> firewall, and my challenge is to read all of that.
>
> >
> > >
> > > I am not sure how many packets warrants a sk_rmem_alloc update, but I'd
> > > assume that 1 update every 100 packets should still be OK.
> >
> > Maybe, but some UDP packets have a truesize around 128 KB or even more.
>
> I understand that the truesize incorporates struct sk_buff header and we may
> also see non-linear SKBs, which could inflate the number (saying this without really
> understanding all the specifics there).
>
> >
> > Perhaps add a new UDP socket option to let the user decide on what
> > they feel is better for them ?
>
> I wanted to avoid a knob for this, but I can easily implement this way. So
> should I create a patch for a setsockopt() that allows setting
> udp_sk->forward_threshold?
>
> >
> > I suspect that the main issue is about having a single drop in the first place,
> > because of false sharing on sk->sk_drops
> >
> > Perhaps we should move sk_drops on a dedicated cache line,
> > and perhaps have two counters for NUMA servers.
>
> I am looking into sk_drops, I don't know what it does at the moment, it's
> been a while I've last read this codebase :)
>

Can you post

ss -aum src :1000  <replace 1000 with your UDP source port>

We will check the dXXXX output (number of drops), per socket.