netdev - Re: [RFC, RESEND] UDP receive path batching improvement

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aKho5v5VwxdNstYy@bzorp3>
Date: Fri, 22 Aug 2025 14:56:06 +0200
From: Balazs Scheidler <bazsi77@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: netdev@...r.kernel.org, pabeni@...hat.com
Subject: Re: [RFC, RESEND] UDP receive path batching improvement

On Fri, Aug 22, 2025 at 02:37:28AM -0700, Eric Dumazet wrote:
> On Fri, Aug 22, 2025 at 2:15 AM Balazs Scheidler <bazsi77@...il.com> wrote:
> >
> > On Fri, Aug 22, 2025 at 01:18:36AM -0700, Eric Dumazet wrote:
> > > On Fri, Aug 22, 2025 at 1:15 AM Balazs Scheidler <bazsi77@...il.com> wrote:
> > > > The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
> > > > done to the counter.
> > > >
> > > > In our case (syslog receive path via udp), socket buffers are generally
> > > > tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
> > > > the senders can generate spikes in their traffic and a lot of senders send
> > > > to the same port. Due to latencies, sometimes these buffers take MBs of data
> > > > before the user-space process even has a chance to consume them.
> > > >
> > >
> > >
> > > This seems very high usage for a single UDP socket.
> > >
> > > Have you tried SO_REUSEPORT to spread incoming packets to more sockets
> > > (and possibly more threads) ?
> >
> > Yes.  I use SO_REUSEPORT (16 sockets), I even use eBPF to distribute the
> > load over multiple sockets evenly, instead of the normal load balancing
> > algorithm built into SO_REUSEPORT.
> >
> 
> Great. But if you have many receive queues, are you sure this choice does not
> add false sharing ?

I am not sure how that could trigger false sharing here.  I am using a
"socket" filter, which generates a random number modulo the number of
sockets:

```
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>

int number_of_sockets;

SEC("socket")
int random_choice(struct __sk_buff *skb)
{
  if (number_of_sockets == 0)
    return -1;

  return bpf_get_prandom_u32() % number_of_sockets;
}
```

Last I've checked the code, all it did was putting the incoming packet into
the right socket buffer, as returned by the filter. What would be the false
sharing in this case?

> 
> > Sometimes the processing on the userspace side is heavy enough (think of
> > parsing, heuristics, data normalization) and the load on the box heavy
> > enough that I still see drops from time to time.
> >
> > If a client sends 100k messages in a tight loop for a while, that's going to
> > use a lot of buffer space.  What bothers me further is that it could be ok
> > to lose a single packet, but any time we drop one packet, we will continue
> > to lose all of them, at least until we fetch 25% of SO_RCVBUF (or if the
> > receive buffer is completely emptied).  This problem, combined with small
> > packets (think of 100-150 byte payload) can easily cause excessive drops. 25%
> > of the socket buffer is a huge offset.
> 
> sock_writeable() uses a 50% threshold.

I am not sure why this is relevant here, the write side of sockets can
easily be flow controlled (e.g. the process waiting until it can send more
data). Also my clients are not necessarily client boxes. PaloAlto firewalls
can generate 70k events-per-second in syslog alone. And that does leave the
firewall, and my challenge is to read all of that.

> 
> >
> > I am not sure how many packets warrants a sk_rmem_alloc update, but I'd
> > assume that 1 update every 100 packets should still be OK.
> 
> Maybe, but some UDP packets have a truesize around 128 KB or even more.

I understand that the truesize incorporates struct sk_buff header and we may
also see non-linear SKBs, which could inflate the number (saying this without really
understanding all the specifics there).

> 
> Perhaps add a new UDP socket option to let the user decide on what
> they feel is better for them ?

I wanted to avoid a knob for this, but I can easily implement this way. So
should I create a patch for a setsockopt() that allows setting
udp_sk->forward_threshold?

> 
> I suspect that the main issue is about having a single drop in the first place,
> because of false sharing on sk->sk_drops
> 
> Perhaps we should move sk_drops on a dedicated cache line,
> and perhaps have two counters for NUMA servers.

I am looking into sk_drops, I don't know what it does at the moment, it's
been a while I've last read this codebase :)

-- 
Bazsi