netdev - Re: [RFC, RESEND] UDP receive path batching improvement

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89iLy4znFBLK2bENWMfhPyjTc_gkLRswAf92uV7KY3bTdYg@mail.gmail.com>
Date: Fri, 22 Aug 2025 01:18:36 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Balazs Scheidler <bazsi77@...il.com>
Cc: netdev@...r.kernel.org, pabeni@...hat.com
Subject: Re: [RFC, RESEND] UDP receive path batching improvement

On Fri, Aug 22, 2025 at 1:15 AM Balazs Scheidler <bazsi77@...il.com> wrote:
>
> Hi,
>
> There's this patch from 2018:
>
> commit 6b229cf77d683f634f0edd876c6d1015402303ad
> Author: Eric Dumazet <edumazet@...gle.com>
> Date:   Thu Dec 8 11:41:56 2016 -0800
>
>     udp: add batching to udp_rmem_release()
>
> This patch is delaying updates to the current size of the socket buffer
> (sk->sk_rmem_alloc) to avoid a cache ping-pong between the network receive
> path and the user-space process.
>
> This change in particular causes an issue for us in our use-case:
>
> +       if (likely(partial)) {
> +               up->forward_deficit += size;
> +               size = up->forward_deficit;
> +               if (size < (sk->sk_rcvbuf >> 2) &&
> +                   !skb_queue_empty(&sk->sk_receive_queue))
> +                       return;
> +       } else {
> +               size += up->forward_deficit;
> +       }
> +       up->forward_deficit = 0;
>
> The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
> done to the counter.
>
> In our case (syslog receive path via udp), socket buffers are generally
> tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
> the senders can generate spikes in their traffic and a lot of senders send
> to the same port. Due to latencies, sometimes these buffers take MBs of data
> before the user-space process even has a chance to consume them.
>


This seems very high usage for a single UDP socket.

Have you tried SO_REUSEPORT to spread incoming packets to more sockets
(and possibly more threads) ?


> If we were talking about video or voice streams sent over UDP, the current
> behaviour makes a lot of sense: upon the very first drop, also drop
> subsequent packets until things recover.
>
> However in the case of syslog, every message is an isolated datapoint and
> subsequent packets are not related at all.
>
> Due to this batching, the kernel always "overestimates" how full the receive
> buffer is.
>
> Instead of using 25% of the receive buffer, couldn't we use a different
> trigger mechanism? These are my thoughts:
>   1) simple packet counter, if the datagrams are small, byte based estimates
>      can vary in number of packets (which ultimately drives the overhead here)
>   2) limit the byte based limit to 64k-128k or so, is we might be in the MBs
>      range with typical buffer sizes.
>
> Both of these solutions should improve UDP syslog data loss on reception and
> still amortize the modification overhead (e.g.  cache ping pong) of
> sk->sk_rmem_alloc.
>
> Here's a POC patch that implements the 2nd solution, but I think I would
> prefer the first one.
>
> Feedback welcome.
>
> diff --git a/include/net/udp.h b/include/net/udp.h
> index e2af3bda90c9..222c0267af17 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -284,13 +284,18 @@ INDIRECT_CALLABLE_DECLARE(int udpv6_rcv(struct sk_buff *));
>  struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
>                                   netdev_features_t features, bool is_ipv6);
>
> +static inline int udp_lib_forward_threshold(struct sock *sk)
> +{
> +       return min(sk->sk_rcvbuf >> 2, 65536);
> +}
> +
>  static inline void udp_lib_init_sock(struct sock *sk)
>  {
>         struct udp_sock *up = udp_sk(sk);
>
>         skb_queue_head_init(&up->reader_queue);
>         INIT_HLIST_NODE(&up->tunnel_list);
> -       up->forward_threshold = sk->sk_rcvbuf >> 2;
> +       up->forward_threshold = udp_lib_forward_threshold(sk);
>         set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
>  }
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index cc3ce0f762ec..00647213db86 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2953,7 +2953,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname,
>                 if (optname == SO_RCVBUF || optname == SO_RCVBUFFORCE) {
>                         sockopt_lock_sock(sk);
>                         /* paired with READ_ONCE in udp_rmem_release() */
> -                       WRITE_ONCE(up->forward_threshold, sk->sk_rcvbuf >> 2);
> +                       WRITE_ONCE(up->forward_threshold, udp_lib_forward_threshold(sk));
>                         sockopt_release_sock(sk);
>                 }
>                 return err;
>
> I am happy to submit a proper patch if this is something feasible. Thank you.
>
> --
> Bazsi
> Happy Logging!