netdev - Re: [RFC PATCH bpf-next 00/14] bpf: Efficient socket destruction

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <80b309fe-6ba0-4ca5-a0b7-b04485964f5d@linux.dev>
Date: Fri, 12 Sep 2025 10:29:39 -0700
From: Martin KaFai Lau <martin.lau@...ux.dev>
To: Jordan Rife <jordan@...fe.io>
Cc: Alexei Starovoitov <ast@...nel.org>,
 Daniel Borkmann <daniel@...earbox.net>, Stanislav Fomichev
 <sdf@...ichev.me>, Willem de Bruijn <willemdebruijn.kernel@...il.com>,
 Kuniyuki Iwashima <kuniyu@...gle.com>, Aditi Ghag
 <aditi.ghag@...valent.com>, bpf@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [RFC PATCH bpf-next 00/14] bpf: Efficient socket destruction

On 9/9/25 9:59 AM, Jordan Rife wrote:
> MOTIVATION
> ==========
> In Cilium we use SOCK_ADDR hooks (cgroup/connect4, cgroup/sendmsg4, ...)
> to do socket-level load balancing, translating service VIPs to real
> backend IPs. This is more efficient than per-packet service VIP
> translation, but there's a consequence: UDP sockets connected to a stale
> backend will keep trying to talk to it once its gone instead of traffic
> being redirected to an active backend. To bridge this gap, we forcefully
> terminate such sockets from the control plane, forcing applications to
> recreate these sockets and start talking to an active backend. In the
> past, we've used netlink + sock_diag for this purpose, but have started
> using BPF socket iterators coupled with bpf_sock_destroy() in an effort
> to do most dataplane management in BPF and improve the efficiency of
> socket termination. bpf_sock_destroy() was introduced by Aditi for this
> very purpose in [1]. More recently, this kind of forceful socket
> destruction was extended to cover TCP sockets as well so that they more
> quickly receive a reset when the backend they're connected to goes away
> instead of relying on timeouts [2].
> 
> When a backend goes away, the process to destroy all sockets connected
> to that backend looks roughly like this:
> 
> for each network namespace:
>      enter the network namespace
>      create a socket iterator
>      for each socket in the network namespace:
>          run the iterator BPF program:
>              if sk was connected to the backend:
>                  bpf_sock_destroy(sk)
> 
> Clearly, this creates a lot of repeated work, and it became evident in
> scale tests that create many sockets or frequent service backend churn
> that this approach won't scale well.
> 
> For a simple illustration, I set up a scenario where there are one
> hundred different workloads each running in their own network namespace
> and observed the time it took to iterate through all namespaces and
> sockets to destroy a handful of connected sockets in those namespaces.

How many sockets were destroyed?

> I repeated this five times, each time increasing the number of sockets
> in the system's UDP hash by 10x using a script that creates lots of
> connected sockets.
> 
>                      +---------+----------------+
>                      | Sockets | Iteration Time |
>                      +---------+----------------+
>                      | 100     | 6.35ms         |
>                      | 1000    | 4.03ms         |
>                      | 10000   | 20.0ms         |
>                      | 100000  | 103ms          |
>                      | 1000000 | 9.38s          |
>                      +---------+----------------+
>                        Namespaces = 100
>                        [CPU] AMD Ryzen 9 9900X
> 
> Iteration takes longer as more sockets are added. All the while, CPU
> utilization is high with `perf top` showing `bpf_iter_udp_batch` at the
> top:
> 
>    70.58%  [kernel]                 [k] bpf_iter_udp_batch
> 
> Although this example uses UDP sockets, a similar trend should be
> present with TCP sockets and iterators as well. Even low numbers of
> sockets and sub-second times can be problematic in clusters with high
> churn or where a burst of backend deletions occurs.


For TCP, is it possible to abort the connection in BPF_SOCK_OPS_RTO_CB to stop 
the retry? RTO is not a per packet event.

Does it have a lot of UDP connected sockets left to iterate in production?

> 
> This can be slightly improved by doing some extra bookkeeping that lets
> us skip certain namespaces that we know don't contain sockets connected
> to the backend, but in general we're boxed in by three limitations:
> 
> 1. BPF socket iterators scan through every socket in the system's UDP or
>     TCP socket hash tables to find those belonging to the current network
>     namespace, since by default all namespaces share the same set of
>     global tables. As the number of sockets in a system grows, more time
>     will be spent filtering out unrelated sockets. You could use
>     udp_child_hash_entries and tcp_child_ehash_entries to give each

I assume the sockets that need to be destroyed could be in different child 
hashtables (i.e. in different netns) even child_[e]hash is used?

>     namespace its own table and avoid these noisy neighbor effects, but
>     managing this automatically for each workload is tricky, uses more
>     memory than necessary, and still doesn't avoid unnecessary filtering,
>     because...
> 2. ...it's necessary to visit all sockets in a network namespace to find
>     the one(s) you're looking for, since there's no predictible order in
>     the system hash tables. Similar to the last point, this creates
>     unnecessary work.
> 3. bpf_sock_destroy() only works from BPF socket iterator contexts
>     currently.
> 
> OVERVIEW
> ========
> It would be ideal if we could visit only the set of sockets we're
> interested in without lots of wasteful filtering. This patch series
> seeks to enable this with the following changes:
> 
> * Making bpf_sock_destroy() work with BPF_MAP_TYPE_SOCKHASH map
>    iterators.
> * Enabling control over bucketing behavior of BPF_MAP_TYPE_SOCKHASH to
>    ensure that all sockets sharing the same key prefix are grouped in
>    the same bucket.
> * Adding a key prefix filter to BPF_MAP_TYPE_SOCKHASH map iterators that
>    limits iteration to only the bucket containing keys with the given
>    prefix, and therefore, a single bucket.
> * A new sockops event, BPF_SOCK_OPS_UDP_CONNECTED_CB, that allows us to
>    automatically insert connected UDP sockets into a
>    BPF_MAP_TYPE_SOCKHASH in the same way
>    BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB does for connect()ed TCP sockets.
> 
> This gives us the means to maintain a socket index where we can
> efficiently retrieve and destroy the set of sockets sharing some common
> property, in our case the backend address, without any additional
> iteration or filtering.
> 
> The basic idea looks like this:
> 
> * `map_extra` may be used to specify the number of bytes from the key
>    that a BPF_MAP_TYPE_SOCKHASH uses to determine a socket's hash bucket.
> 
>    ```
>    struct sock_hash_key {
>            __u32 bucket_key;
>            __u64 cookie;
>    } __packed;
> 
>    struct {
>            __uint(type, BPF_MAP_TYPE_SOCKHASH);
>            __uint(max_entries, 16);
>            __ulong(map_extra, offsetof(struct sock_hash_key, cookie));
>            __type(key, struct sock_hash_key);
>            __type(value, __u64);
>    } sock_hash SEC(".maps");
>    ```
> 
>    In this example, all keys sharing the same `bucket_key` would be
>    bucketed together. In our case, `bucket_key` would be replaced with a
>    backend ID or (destination address, port) tuple.

Before diving into the discussion whether it is a good idea to add another key 
to a bpf hashmap, it seems that a hashmap does not actually fit your use case. A 
different data structure (or at least a different way of grouping sk) is needed. 
Have you considered using the bpf_list_head/bpf_rb_root/bpf_arena? Potentially, 
the sk could be stored as a __kptr but I don't think it is supported yet, aside 
from considerations when sk is closed, etc. However, it can store the numeric 
ip/port and then use the bpf_sk_lookup helper, which can take netns_id. 
Iteration could potentially be done in a sleepable SEC("syscall") program in 
test_prog_run, where lock_sock is allowed. TCP sockops has a state change 
callback (i.e. for tracking TCP_CLOSE) but connected udp does not have it now.

> * `key_prefix` may be used to parametrize a BPF_MAP_TYPE_SOCKHASH map
>    iterator so that it only visits the bucket matching that key prefix.
> 
>    ```
>    union bpf_iter_link_info {
>            struct {
>                   __u32 map_fd;
>                   union {
>                           /* Parameters for socket hash iterators. */
>                           struct {
>                                    __aligned_u64 key_prefix;
>                                    __u32         key_prefix_len;
>                           } sock_hash;
> 	         };
>            } map;
> 	...
>    };
>    ```
> * The contents of the BPF_MAP_TYPE_SOCKHASH are automatically managed
>    using a sockops program that inserts connected TCP and UDP sockets
>    into the map.
>