netdev - Re: [RFC PATCH bpf-next 00/14] bpf: Efficient socket destruction

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ilrnfpmoawkbsz2qnyne7haznfjxek4oqeyl7x5cmtds5sdvxe@dy6fs3ej4rbr>
Date: Sun, 21 Sep 2025 09:03:35 -0700
From: Jordan Rife <jordan@...fe.io>
To: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: Alexei Starovoitov <ast@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, Stanislav Fomichev <sdf@...ichev.me>, 
	Willem de Bruijn <willemdebruijn.kernel@...il.com>, Kuniyuki Iwashima <kuniyu@...gle.com>, 
	Aditi Ghag <aditi.ghag@...valent.com>, bpf@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [RFC PATCH bpf-next 00/14] bpf: Efficient socket destruction

Hi Martin,

Thanks for taking a look.

> How many sockets were destroyed?

Between 1 and 5 per trial IIRC during this test. Generally, there would
be a small set of sockets to destroy for a given backend relative to the
total number UDP/TCP sockets on a system.

> For TCP, is it possible to abort the connection in BPF_SOCK_OPS_RTO_CB to
> stop the retry? RTO is not a per packet event.

To clarify, are you suggesting bpf_sock_destroy() from that context or
something else? If the former, bpf_sock_destroy() only works from socket
iterator contexts today, so that's one adjustment that would have to be
made. It seems like this could work, but I'd have to think more about
how to mark certain sockets for destruction (possibly using socket
storage or some auxiliary map).

> Does it have a lot of UDP connected sockets left to iterate in production?

It's hard to say for certain (my perspective is as a cloud provider,
but I've seen customers do strange things). It seems unlikely anyone is
creating, e.g., 1M UDP sockets on the same host. In practice, TCP would
be more of a concern. Still, it would be nice to have a more efficient
means to destroy a small set of sockets vs doing full UDP/TCP hash
traversals.

> I assume the sockets that need to be destroyed could be in different child
> hashtables (i.e. in different netns) even child_[e]hash is used?

Correct. You would have to do a hash traversal in all namespaces that
contain at least one connection to a given backend. This might hurt or
help depending on the use case and depending on how sparse the hashes
are, but might cut down on visiting / filtering out sockets from other
namespaces.

> Before diving into the discussion whether it is a good idea to add another
> key to a bpf hashmap, it seems that a hashmap does not actually fit your use
> case. A different data structure (or at least a different way of grouping
> sk) is needed. Have you considered using the

If I were to design my ideal data structure for grouping sockets
(ignoring current BPF limitations), it would look quite similar to the
modified SOCK_HASH in this series. Really what would be ideal is
something more like a multihash where a single key maps to a set of
sockets, but that felt much too specific to this use case and doesn't
fit well within the BPF map paradigm. The modification to SOCK_HASH with
the key prefix stuff kind of achieves and felt like a good starting
point.

> bpf_list_head/bpf_rb_root/bpf_arena? Potentially, the sk could be stored as
> a __kptr but I don't think it is supported yet, aside from considerations
> when sk is closed, etc. However, it can store the numeric ip/port and then
> use the bpf_sk_lookup helper, which can take netns_id. Iteration could
> potentially be done in a sleepable SEC("syscall") program in test_prog_run,
> where lock_sock is allowed. TCP sockops has a state change callback (i.e.

You could create a data structure tailored for efficient iteration over
a group of ip/port pairs, although I'm not sure how you would acquire
the socket lock unless, e.g., bpf_sock_destroy or a sleepable variant
thereof acquires the lock itself in that context after the sk lookup?
E.g. (pseudocode):

...
for each (ip,port,ns) in my custom data structure:
    sk = bpf_sk_lookup_tcp(ip, port, ns)
    if (sk)
    	bpf_sock_destroy_sleepable(sk) // acquires socket lock?
...

Or maybe just mark the socket for destruction in the test_prog_run
program (sock storage?) and later call bpf_sock_destroy in a sockops
context next time the socket is used.

Either way, I think the constraints around which contexts
bpf_sock_destroy supports might need to be relaxed.

> for tracking TCP_CLOSE) but connected udp does not have it now.

Overall, the SOCK_HASH changes felt like a natural starting point, but
I'm happy to discuss some alternatives. I like the idea of being able to
combine bpf_rb_root/bpf_arena + bpf_sk_lookup + bpf_sock_destroy, and it
seems like an interesting direction to explore.

Thanks again, I really appreciate the input.

Jordan