netdev - Re: [RFC, RESEND] UDP receive path batching improvement

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89i+GMqF91FkjxfGp3KGJ-dC6-Snu3DoBdGuxZqrq=iOOcQ@mail.gmail.com>
Date: Fri, 22 Aug 2025 02:37:28 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Balazs Scheidler <bazsi77@...il.com>
Cc: netdev@...r.kernel.org, pabeni@...hat.com
Subject: Re: [RFC, RESEND] UDP receive path batching improvement

On Fri, Aug 22, 2025 at 2:15 AM Balazs Scheidler <bazsi77@...il.com> wrote:
>
> On Fri, Aug 22, 2025 at 01:18:36AM -0700, Eric Dumazet wrote:
> > On Fri, Aug 22, 2025 at 1:15 AM Balazs Scheidler <bazsi77@...il.com> wrote:
> > > The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
> > > done to the counter.
> > >
> > > In our case (syslog receive path via udp), socket buffers are generally
> > > tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
> > > the senders can generate spikes in their traffic and a lot of senders send
> > > to the same port. Due to latencies, sometimes these buffers take MBs of data
> > > before the user-space process even has a chance to consume them.
> > >
> >
> >
> > This seems very high usage for a single UDP socket.
> >
> > Have you tried SO_REUSEPORT to spread incoming packets to more sockets
> > (and possibly more threads) ?
>
> Yes.  I use SO_REUSEPORT (16 sockets), I even use eBPF to distribute the
> load over multiple sockets evenly, instead of the normal load balancing
> algorithm built into SO_REUSEPORT.
>

Great. But if you have many receive queues, are you sure this choice does not
add false sharing ?

> Sometimes the processing on the userspace side is heavy enough (think of
> parsing, heuristics, data normalization) and the load on the box heavy
> enough that I still see drops from time to time.
>
> If a client sends 100k messages in a tight loop for a while, that's going to
> use a lot of buffer space.  What bothers me further is that it could be ok
> to lose a single packet, but any time we drop one packet, we will continue
> to lose all of them, at least until we fetch 25% of SO_RCVBUF (or if the
> receive buffer is completely emptied).  This problem, combined with small
> packets (think of 100-150 byte payload) can easily cause excessive drops. 25%
> of the socket buffer is a huge offset.

sock_writeable() uses a 50% threshold.

>
> I am not sure how many packets warrants a sk_rmem_alloc update, but I'd
> assume that 1 update every 100 packets should still be OK.

Maybe, but some UDP packets have a truesize around 128 KB or even more.

Perhaps add a new UDP socket option to let the user decide on what
they feel is better for them ?

I suspect that the main issue is about having a single drop in the first place,
because of false sharing on sk->sk_drops

Perhaps we should move sk_drops on a dedicated cache line,
and perhaps have two counters for NUMA servers.