[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <aKgnLcw6yzq78CIP@bzorp3>
Date: Fri, 22 Aug 2025 10:15:41 +0200
From: Balazs Scheidler <bazsi77@...il.com>
To: netdev@...r.kernel.org
Cc: Eric Dumazet <edumazet@...gle.com>, pabeni@...hat.com
Subject: [RFC, RESEND] UDP receive path batching improvement
Hi,
There's this patch from 2018:
commit 6b229cf77d683f634f0edd876c6d1015402303ad
Author: Eric Dumazet <edumazet@...gle.com>
Date: Thu Dec 8 11:41:56 2016 -0800
udp: add batching to udp_rmem_release()
This patch is delaying updates to the current size of the socket buffer
(sk->sk_rmem_alloc) to avoid a cache ping-pong between the network receive
path and the user-space process.
This change in particular causes an issue for us in our use-case:
+ if (likely(partial)) {
+ up->forward_deficit += size;
+ size = up->forward_deficit;
+ if (size < (sk->sk_rcvbuf >> 2) &&
+ !skb_queue_empty(&sk->sk_receive_queue))
+ return;
+ } else {
+ size += up->forward_deficit;
+ }
+ up->forward_deficit = 0;
The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
done to the counter.
In our case (syslog receive path via udp), socket buffers are generally
tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
the senders can generate spikes in their traffic and a lot of senders send
to the same port. Due to latencies, sometimes these buffers take MBs of data
before the user-space process even has a chance to consume them.
If we were talking about video or voice streams sent over UDP, the current
behaviour makes a lot of sense: upon the very first drop, also drop
subsequent packets until things recover.
However in the case of syslog, every message is an isolated datapoint and
subsequent packets are not related at all.
Due to this batching, the kernel always "overestimates" how full the receive
buffer is.
Instead of using 25% of the receive buffer, couldn't we use a different
trigger mechanism? These are my thoughts:
1) simple packet counter, if the datagrams are small, byte based estimates
can vary in number of packets (which ultimately drives the overhead here)
2) limit the byte based limit to 64k-128k or so, is we might be in the MBs
range with typical buffer sizes.
Both of these solutions should improve UDP syslog data loss on reception and
still amortize the modification overhead (e.g. cache ping pong) of
sk->sk_rmem_alloc.
Here's a POC patch that implements the 2nd solution, but I think I would
prefer the first one.
Feedback welcome.
diff --git a/include/net/udp.h b/include/net/udp.h
index e2af3bda90c9..222c0267af17 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -284,13 +284,18 @@ INDIRECT_CALLABLE_DECLARE(int udpv6_rcv(struct sk_buff *));
struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
netdev_features_t features, bool is_ipv6);
+static inline int udp_lib_forward_threshold(struct sock *sk)
+{
+ return min(sk->sk_rcvbuf >> 2, 65536);
+}
+
static inline void udp_lib_init_sock(struct sock *sk)
{
struct udp_sock *up = udp_sk(sk);
skb_queue_head_init(&up->reader_queue);
INIT_HLIST_NODE(&up->tunnel_list);
- up->forward_threshold = sk->sk_rcvbuf >> 2;
+ up->forward_threshold = udp_lib_forward_threshold(sk);
set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
}
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index cc3ce0f762ec..00647213db86 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2953,7 +2953,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname,
if (optname == SO_RCVBUF || optname == SO_RCVBUFFORCE) {
sockopt_lock_sock(sk);
/* paired with READ_ONCE in udp_rmem_release() */
- WRITE_ONCE(up->forward_threshold, sk->sk_rcvbuf >> 2);
+ WRITE_ONCE(up->forward_threshold, udp_lib_forward_threshold(sk));
sockopt_release_sock(sk);
}
return err;
I am happy to submit a proper patch if this is something feasible. Thank you.
--
Bazsi
Happy Logging!
Powered by blists - more mailing lists