[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <willemdebruijn.kernel.16b82d42e11ef@gmail.com>
Date: Mon, 22 Sep 2025 08:38:42 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Eric Dumazet <edumazet@...gle.com>,
Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: "David S . Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Simon Horman <horms@...nel.org>,
Willem de Bruijn <willemb@...gle.com>,
Kuniyuki Iwashima <kuniyu@...gle.com>,
netdev@...r.kernel.org,
eric.dumazet@...il.com
Subject: Re: [PATCH v3 net-next] udp: remove busylock and add per NUMA queues
Eric Dumazet wrote:
> On Sun, Sep 21, 2025 at 7:21 PM Willem de Bruijn
> <willemdebruijn.kernel@...il.com> wrote:
> >
> > Eric Dumazet wrote:
> > > busylock was protecting UDP sockets against packet floods,
> > > but unfortunately was not protecting the host itself.
> > >
> > > Under stress, many cpus could spin while acquiring the busylock,
> > > and NIC had to drop packets. Or packets would be dropped
> > > in cpu backlog if RPS/RFS were in place.
> > >
> > > This patch replaces the busylock by intermediate
> > > lockless queues. (One queue per NUMA node).
> > >
> > > This means that fewer number of cpus have to acquire
> > > the UDP receive queue lock.
> > >
> > > Most of the cpus can either:
> > > - immediately drop the packet.
> > > - or queue it in their NUMA aware lockless queue.
> > >
> > > Then one of the cpu is chosen to process this lockless queue
> > > in a batch.
> > >
> > > The batch only contains packets that were cooked on the same
> > > NUMA node, thus with very limited latency impact.
> > >
> > > Tested:
> > >
> > > DDOS targeting a victim UDP socket, on a platform with 6 NUMA nodes
> > > (Intel(R) Xeon(R) 6985P-C)
> > >
> > > Before:
> > >
> > > nstat -n ; sleep 1 ; nstat | grep Udp
> > > Udp6InDatagrams 1004179 0.0
> > > Udp6InErrors 3117 0.0
> > > Udp6RcvbufErrors 3117 0.0
> > >
> > > After:
> > > nstat -n ; sleep 1 ; nstat | grep Udp
> > > Udp6InDatagrams 1116633 0.0
> > > Udp6InErrors 14197275 0.0
> > > Udp6RcvbufErrors 14197275 0.0
> > >
> > > We can see this host can now proces 14.2 M more packets per second
> > > while under attack, and the victim socket can receive 11 % more
> > > packets.
> >
> > Impressive gains under DoS!
> >
> > Main concern is that it adds an extra queue/dequeue and thus some
> > cycle cost for all udp sockets in the common case where they are not
> > contended. These are simple linked list operations, so I suppose the
> > only cost may be the cacheline if not warm. Busylock had the nice
> > property of only being used under mem pressure. Could this benefit
> > from the same?
>
> I hear you, but the extra cache line is local to the node, compared to prior
> non-local and shared cache line, especially on modern cpus (AMD Turin
> / Venice or Intel Granite Rapids)
>
> This is a very minor cost, compared to the average delay between
> packet being received
> on the wire and being presented on __udp_enqueue_schedule_skb().
>
> And the core of the lockless algorithm is that all packets go through
> the same logic.
> See my following answer.
>
> >
> > > I used a small bpftrace program measuring time (in us) spent in
> > > __udp_enqueue_schedule_skb().
> > >
> > > Before:
> > >
> > > @udp_enqueue_us[398]:
> > > [0] 24901 |@@@ |
> > > [1] 63512 |@@@@@@@@@ |
> > > [2, 4) 344827 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > [4, 8) 244673 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > > [8, 16) 54022 |@@@@@@@@ |
> > > [16, 32) 222134 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > > [32, 64) 232042 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > > [64, 128) 4219 | |
> > > [128, 256) 188 | |
> > >
> > > After:
> > >
> > > @udp_enqueue_us[398]:
> > > [0] 5608855 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > [1] 1111277 |@@@@@@@@@@ |
> > > [2, 4) 501439 |@@@@ |
> > > [4, 8) 102921 | |
> > > [8, 16) 29895 | |
> > > [16, 32) 43500 | |
> > > [32, 64) 31552 | |
> > > [64, 128) 979 | |
> > > [128, 256) 13 | |
> > >
> > > Note that the remaining bottleneck for this platform is in
> > > udp_drops_inc() because we limited struct numa_drop_counters
> > > to only two nodes so far.
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > > ---
> > > v3: - Moved kfree(up->udp_prod_queue) to udp_destruct_common(),
> > > addressing reports from Jakub and syzbot.
> > >
> > > - Perform SKB_DROP_REASON_PROTO_MEM drops after the queue
> > > spinlock is released.
> > >
> > > v2: https://lore.kernel.org/netdev/20250920080227.3674860-1-edumazet@google.com/
> > > - Added a kfree(up->udp_prod_queue) in udpv6_destroy_sock() (Jakub feedback on v1)
> > > - Added bpftrace histograms in changelog.
> > >
> > > v1: https://lore.kernel.org/netdev/20250919164308.2455564-1-edumazet@google.com/
> > >
> > > include/linux/udp.h | 9 +++-
> > > include/net/udp.h | 11 ++++-
> > > net/ipv4/udp.c | 114 ++++++++++++++++++++++++++------------------
> > > net/ipv6/udp.c | 5 +-
> > > 4 files changed, 88 insertions(+), 51 deletions(-)
> > >
> > > diff --git a/include/linux/udp.h b/include/linux/udp.h
> > > index e554890c4415b411f35007d3ece9e6042db7a544..58795688a18636ea79aa1f5d06eacc676a2e7849 100644
> > > --- a/include/linux/udp.h
> > > +++ b/include/linux/udp.h
> > > @@ -44,6 +44,12 @@ enum {
> > > UDP_FLAGS_UDPLITE_RECV_CC, /* set via udplite setsockopt */
> > > };
> > >
> > > +/* per NUMA structure for lockless producer usage. */
> > > +struct udp_prod_queue {
> > > + struct llist_head ll_root ____cacheline_aligned_in_smp;
> > > + atomic_t rmem_alloc;
> > > +};
> > > +
> > > struct udp_sock {
> > > /* inet_sock has to be the first member */
> > > struct inet_sock inet;
> > > @@ -90,6 +96,8 @@ struct udp_sock {
> > > struct sk_buff *skb,
> > > int nhoff);
> > >
> > > + struct udp_prod_queue *udp_prod_queue;
> > > +
> > > /* udp_recvmsg try to use this before splicing sk_receive_queue */
> > > struct sk_buff_head reader_queue ____cacheline_aligned_in_smp;
> > >
> > > @@ -109,7 +117,6 @@ struct udp_sock {
> > > */
> > > struct hlist_node tunnel_list;
> > > struct numa_drop_counters drop_counters;
> > > - spinlock_t busylock ____cacheline_aligned_in_smp;
> > > };
> > >
> > > #define udp_test_bit(nr, sk) \
> > > diff --git a/include/net/udp.h b/include/net/udp.h
> > > index 059a0cee5f559b8d75e71031a00d0aa2769e257f..cffedb3e40f24513e44fb7598c0ad917fd15b616 100644
> > > --- a/include/net/udp.h
> > > +++ b/include/net/udp.h
> > > @@ -284,16 +284,23 @@ INDIRECT_CALLABLE_DECLARE(int udpv6_rcv(struct sk_buff *));
> > > struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
> > > netdev_features_t features, bool is_ipv6);
> > >
> > > -static inline void udp_lib_init_sock(struct sock *sk)
> > > +static inline int udp_lib_init_sock(struct sock *sk)
> > > {
> > > struct udp_sock *up = udp_sk(sk);
> > >
> > > sk->sk_drop_counters = &up->drop_counters;
> > > - spin_lock_init(&up->busylock);
> > > skb_queue_head_init(&up->reader_queue);
> > > INIT_HLIST_NODE(&up->tunnel_list);
> > > up->forward_threshold = sk->sk_rcvbuf >> 2;
> > > set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
> > > +
> > > + up->udp_prod_queue = kcalloc(nr_node_ids, sizeof(*up->udp_prod_queue),
> > > + GFP_KERNEL);
> > > + if (!up->udp_prod_queue)
> > > + return -ENOMEM;
> > > + for (int i = 0; i < nr_node_ids; i++)
> > > + init_llist_head(&up->udp_prod_queue[i].ll_root);
> > > + return 0;
> > > }
> > >
> > > static inline void udp_drops_inc(struct sock *sk)
> > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > index 85cfc32eb2ccb3e229177fb37910fefde0254ffe..fce1d0ffd6361d271ae3528fea026a8d6c07ac6e 100644
> > > --- a/net/ipv4/udp.c
> > > +++ b/net/ipv4/udp.c
> > > @@ -1685,25 +1685,6 @@ static void udp_skb_dtor_locked(struct sock *sk, struct sk_buff *skb)
> > > udp_rmem_release(sk, udp_skb_truesize(skb), 1, true);
> > > }
> > >
> > > -/* Idea of busylocks is to let producers grab an extra spinlock
> > > - * to relieve pressure on the receive_queue spinlock shared by consumer.
> > > - * Under flood, this means that only one producer can be in line
> > > - * trying to acquire the receive_queue spinlock.
> > > - */
> > > -static spinlock_t *busylock_acquire(struct sock *sk)
> > > -{
> > > - spinlock_t *busy = &udp_sk(sk)->busylock;
> > > -
> > > - spin_lock(busy);
> > > - return busy;
> > > -}
> > > -
> > > -static void busylock_release(spinlock_t *busy)
> > > -{
> > > - if (busy)
> > > - spin_unlock(busy);
> > > -}
> > > -
> > > static int udp_rmem_schedule(struct sock *sk, int size)
> > > {
> > > int delta;
> > > @@ -1718,14 +1699,23 @@ static int udp_rmem_schedule(struct sock *sk, int size)
> > > int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> > > {
> > > struct sk_buff_head *list = &sk->sk_receive_queue;
> > > + struct udp_prod_queue *udp_prod_queue;
> > > + struct sk_buff *next, *to_drop = NULL;
> > > + struct llist_node *ll_list;
> > > unsigned int rmem, rcvbuf;
> > > - spinlock_t *busy = NULL;
> > > int size, err = -ENOMEM;
> > > + int total_size = 0;
> > > + int q_size = 0;
> > > + int nb = 0;
> > >
> > > rmem = atomic_read(&sk->sk_rmem_alloc);
> > > rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> > > size = skb->truesize;
> > >
> > > + udp_prod_queue = &udp_sk(sk)->udp_prod_queue[numa_node_id()];
> >
> > There is a small chance that a cpu enqueues to this queue and no
> > further arrivals on that numa node happen, stranding skbs on this
> > intermediate queue, right? If so, those are leaked on
> > udp_destruct_common.
>
> There is absolutely 0 chance this can occur.
>
> This is because I use llist_add() return value to decide to either :
>
> A) Queue was empty, I am the first cpu to add a packet, I am elected
> to drain this queue.
>
> A.1 Grab the spinlock (competing with recvmsg() or other cpus on
> other NUMA nodes)
> While waiting my turn in this spinlock acquisition, other
> cpus can add more packets
> to the per-NUMA queue I was elected to drain.
>
> A.2 llist_del_all() to drain the Queue. I got my skb but
> possibly others as well.
>
> After llist_del_all(), one other cpu trying to add a new
> packet will eventually see the queue was empty again
> and will be elected to serve it. It will go to A.1 and A.2
>
> B) Return immediately because there were other packets in the Queue,
> so I know one other
> cpu is in A.1 or before A.2
>
>
> All A) and B) are under rcu lock, so udp_destruct_common() will not run.
Of course. Thanks. Nice lockless multi-producer multi-consumer struct.
I had not seen it before. A try_cmpxchg and xchg pair, but on an
uncontended NUMA local cacheline in the normal case.
>
> >
> > > +
> > > + rmem += atomic_read(&udp_prod_queue->rmem_alloc);
> > > +
> > > /* Immediately drop when the receive queue is full.
> > > * Cast to unsigned int performs the boundary check for INT_MAX.
> > > */
> > > @@ -1747,45 +1737,75 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> > > if (rmem > (rcvbuf >> 1)) {
> > > skb_condense(skb);
> > > size = skb->truesize;
> > > - rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > > - if (rmem > rcvbuf)
> > > - goto uncharge_drop;
> > > - busy = busylock_acquire(sk);
> > > - } else {
> > > - atomic_add(size, &sk->sk_rmem_alloc);
> > > }
> > >
> > > udp_set_dev_scratch(skb);
> > >
> > > + atomic_add(size, &udp_prod_queue->rmem_alloc);
> > > +
> > > + if (!llist_add(&skb->ll_node, &udp_prod_queue->ll_root))
> > > + return 0;
> > > +
> > > spin_lock(&list->lock);
> > > - err = udp_rmem_schedule(sk, size);
> > > - if (err) {
> > > - spin_unlock(&list->lock);
> > > - goto uncharge_drop;
> > > - }
> > >
> > > - sk_forward_alloc_add(sk, -size);
> > > + ll_list = llist_del_all(&udp_prod_queue->ll_root);
> > >
> > > - /* no need to setup a destructor, we will explicitly release the
> > > - * forward allocated memory on dequeue
> > > - */
> > > - sock_skb_set_dropcount(sk, skb);
> > > + ll_list = llist_reverse_order(ll_list);
> > > +
> > > + llist_for_each_entry_safe(skb, next, ll_list, ll_node) {
> > > + size = udp_skb_truesize(skb);
> > > + total_size += size;
> > > + err = udp_rmem_schedule(sk, size);
> > > + if (unlikely(err)) {
> > > + /* Free the skbs outside of locked section. */
> > > + skb->next = to_drop;
> > > + to_drop = skb;
> > > + continue;
> > > + }
> > > +
> > > + q_size += size;
> > > + sk_forward_alloc_add(sk, -size);
> > > +
> > > + /* no need to setup a destructor, we will explicitly release the
> > > + * forward allocated memory on dequeue
> > > + */
> > > + sock_skb_set_dropcount(sk, skb);
> >
> > Since drop counters are approximate, read these once and report the
> > same for all packets in a batch?
>
> Good idea, thanks !
> (My test receiver was not setting SOCK_RXQ_OVFL)
>
> I can squash in V4 , or add this as a separate patch.
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index fce1d0ffd6361d271ae3528fea026a8d6c07ac6e..95241093b7f01b2dc31d9520b693f46400e545ff
> 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1706,6 +1706,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk,
> struct sk_buff *skb)
> int size, err = -ENOMEM;
> int total_size = 0;
> int q_size = 0;
> + int dropcount;
> int nb = 0;
>
> rmem = atomic_read(&sk->sk_rmem_alloc);
> @@ -1746,6 +1747,8 @@ int __udp_enqueue_schedule_skb(struct sock *sk,
> struct sk_buff *skb)
> if (!llist_add(&skb->ll_node, &udp_prod_queue->ll_root))
> return 0;
>
> + dropcount = sock_flag(sk, SOCK_RXQ_OVFL) ? sk_drops_read(sk) : 0;
> +
> spin_lock(&list->lock);
>
> ll_list = llist_del_all(&udp_prod_queue->ll_root);
> @@ -1769,7 +1772,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk,
> struct sk_buff *skb)
> /* no need to setup a destructor, we will explicitly release the
> * forward allocated memory on dequeue
> */
> - sock_skb_set_dropcount(sk, skb);
> + SOCK_SKB_CB(skb)->dropcount = dropcount;
> nb++;
> __skb_queue_tail(list, skb);
> }
Powered by blists - more mailing lists