netdev - Re: [PATCH v3 net-next] udp: remove busylock and add per NUMA queues

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKcye6Zsij4=jQ2V9ofbCwRB45HPJUdn7YbFQU1TmQVbw@mail.gmail.com>
Date: Mon, 22 Sep 2025 02:34:41 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Paolo Abeni <pabeni@...hat.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Simon Horman <horms@...nel.org>, Willem de Bruijn <willemb@...gle.com>, 
	Kuniyuki Iwashima <kuniyu@...gle.com>, netdev@...r.kernel.org, eric.dumazet@...il.com
Subject: Re: [PATCH v3 net-next] udp: remove busylock and add per NUMA queues

On Mon, Sep 22, 2025 at 1:47 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Mon, Sep 22, 2025 at 1:37 AM Paolo Abeni <pabeni@...hat.com> wrote:
> >
> > Hi,
> >
> > On 9/21/25 11:58 AM, Eric Dumazet wrote:
> > > @@ -1718,14 +1699,23 @@ static int udp_rmem_schedule(struct sock *sk, int size)
> > >  int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> > >  {
> > >       struct sk_buff_head *list = &sk->sk_receive_queue;
> > > +     struct udp_prod_queue *udp_prod_queue;
> > > +     struct sk_buff *next, *to_drop = NULL;
> > > +     struct llist_node *ll_list;
> > >       unsigned int rmem, rcvbuf;
> > > -     spinlock_t *busy = NULL;
> > >       int size, err = -ENOMEM;
> > > +     int total_size = 0;
> > > +     int q_size = 0;
> > > +     int nb = 0;
> > >
> > >       rmem = atomic_read(&sk->sk_rmem_alloc);
> > >       rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> > >       size = skb->truesize;
> > >
> > > +     udp_prod_queue = &udp_sk(sk)->udp_prod_queue[numa_node_id()];
> > > +
> > > +     rmem += atomic_read(&udp_prod_queue->rmem_alloc);
> > > +
> > >       /* Immediately drop when the receive queue is full.
> > >        * Cast to unsigned int performs the boundary check for INT_MAX.
> > >        */
> >
> > Double checking I'm reading the code correctly... AFAICS the rcvbuf size
> > check is now only per NUMA node, that means that each node can now add
> > at most sk_rcvbuf bytes to the socket receive queue simultaneously, am I
> > correct?
>
> This is a transient condition. In my tests with 6 NUMA nodes pushing
> packets very hard,
> I was not able to see a  significant bump of sk_rmem_alloc (over sk_rcvbuf)
>
>
>
> >
> > What if the user-space process never reads the packets (or is very
> > slow)? I'm under the impression the max rcvbuf occupation will be
> > limited only by the memory accounting?!? (and not by sk_rcvbuf)
>
> Well, as soon as sk->sk_rmem_alloc is bigger than sk_rcvbuf, all
> further incoming packets are dropped.
>
> As you said, memory accounting is there.
>
> This could matter if we had thousands of UDP sockets under flood at
> the same time,
> but that would require thousands of cpus and/or NIC rx queues.
>
>
>
> >
> > Side note: I'm wondering if we could avoid the numa queue for connected
> > sockets? With early demux, and no nft/bridge in between the path from
> > NIC to socket should be pretty fast and possibly the additional queuing
> > visible?
>
> I tried this last week and got no difference in performance on my test machines.
>
> I can retry this and give you precise numbers before sending V4.

I did my experiment again.

Very little difference (1 or 2 %, but would need many runs to have a
confirmation)

Also loopback traffic would be unprotected (Only RSS on a physical NIC
would properly use a single cpu for all packets)

Looking at the performance profile of the cpus

Always using the per-numa queue

    16.06%  [kernel]                                 [k] skb_release_data
    10.59%  [kernel]                                 [k] dev_gro_receive
     9.50%  [kernel]                                 [k]
idpf_rx_process_skb_fields
     5.37%  [kernel]                                 [k]
__udp_enqueue_schedule_skb
     3.93%  [kernel]                                 [k] net_rx_action
     2.93%  [kernel]                                 [k] ip6t_do_table
     2.84%  [kernel]                                 [k] napi_alloc_skb
     2.41%  [kernel]                                 [k]
queued_spin_lock_slowpath
     2.01%  [kernel]                                 [k]
__netif_receive_skb_core
     1.95%  [kernel]                                 [k]
idpf_vport_splitq_napi_poll
     1.90%  [kernel]                                 [k] __memcpy
     1.88%  [kernel]                                 [k] napi_gro_receive
     1.86%  [kernel]                                 [k]
kmem_cache_alloc_bulk_noprof
     1.50%  [kernel]                                 [k] napi_consume_skb
     1.28%  [kernel]                                 [k] sock_def_readable
     1.00%  [kernel]                                 [k] llist_add_batch
     0.93%  [kernel]                                 [k] ip6_rcv_core
     0.91%  [kernel]                                 [k]
call_function_single_prep_ipi
     0.82%  [kernel]                                 [k] ipv6_gro_receive
     0.81%  [kernel]                                 [k]
ip6_protocol_deliver_rcu
     0.81%  [kernel]                                 [k] fib6_node_lookup
     0.79%  [kernel]                                 [k] ip6_route_input
     0.78%  [kernel]                                 [k] eth_type_trans
     0.75%  [kernel]                                 [k] ip6_sublist_rcv
     0.75%  [kernel]                                 [k] __try_to_wake_up
     0.73%  [kernel]                                 [k] udp6_csum_init
     0.70%  [kernel]                                 [k] _raw_spin_lock
     0.70%  [kernel]                                 [k] __wake_up_common_lock
     0.69%  [kernel]                                 [k] read_tsc
     0.62%  [kernel]                                 [k] ttwu_queue_wakelist
     0.58%  [kernel]                                 [k] udp6_gro_receive
     0.57%  [kernel]                                 [k]
netif_receive_skb_list_internal
     0.55%  [kernel]                                 [k] llist_reverse_order
     0.53%  [kernel]                                 [k] available_idle_cpu
     0.52%  [kernel]                                 [k] sched_clock_noinstr
     0.51%  [kernel]                                 [k]
__sk_mem_raise_allocated


Avoiding it for connected sockets:

    14.75%  [kernel]                                 [k] skb_release_data
    10.76%  [kernel]                                 [k] dev_gro_receive
     9.48%  [kernel]                                 [k]
idpf_rx_process_skb_fields
     4.29%  [kernel]                                 [k]
__udp_enqueue_schedule_skb
     4.02%  [kernel]                                 [k] net_rx_action
     3.17%  [kernel]                                 [k] ip6t_do_table
     2.55%  [kernel]                                 [k] napi_alloc_skb
     2.20%  [kernel]                                 [k] __memcpy
     2.04%  [kernel]                                 [k]
queued_spin_lock_slowpath
     1.99%  [kernel]                                 [k]
__netif_receive_skb_core
     1.98%  [kernel]                                 [k]
kmem_cache_alloc_bulk_noprof
     1.76%  [kernel]                                 [k] napi_gro_receive
     1.74%  [kernel]                                 [k]
idpf_vport_splitq_napi_poll
     1.55%  [kernel]                                 [k] napi_consume_skb
     1.36%  [kernel]                                 [k] sock_def_readable
     1.18%  [kernel]                                 [k] llist_add_batch
     1.04%  [kernel]                                 [k] udp6_csum_init
     0.92%  [kernel]                                 [k] fib6_node_lookup
     0.92%  [kernel]                                 [k] _raw_spin_lock
     0.91%  [kernel]                                 [k]
call_function_single_prep_ipi
     0.88%  [kernel]                                 [k] ip6_rcv_core
     0.86%  [kernel]                                 [k]
ip6_protocol_deliver_rcu
     0.84%  [kernel]                                 [k] __try_to_wake_up
     0.81%  [kernel]                                 [k] ip6_route_input
     0.80%  [kernel]                                 [k] ipv6_gro_receive
     0.75%  [kernel]                                 [k] __skb_flow_dissect
     0.70%  [kernel]                                 [k] read_tsc
     0.70%  [kernel]                                 [k] ttwu_queue_wakelist
     0.69%  [kernel]                                 [k] eth_type_trans
     0.69%  [kernel]                                 [k] __wake_up_common_lock
     0.64%  [kernel]                                 [k] sched_clock_noinstr

I guess we have bigger fish to fry ;)