netdev - Re: [RFC PATCH bpf-next 00/14] bpf: Efficient socket destruction

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <df4c8852-f6d1-4278-84d8-441aad1f9994@linux.dev>
Date: Tue, 30 Sep 2025 15:51:26 -0700
From: Martin KaFai Lau <martin.lau@...ux.dev>
To: Jordan Rife <jordan@...fe.io>
Cc: Alexei Starovoitov <ast@...nel.org>,
 Daniel Borkmann <daniel@...earbox.net>, Stanislav Fomichev
 <sdf@...ichev.me>, Willem de Bruijn <willemdebruijn.kernel@...il.com>,
 Kuniyuki Iwashima <kuniyu@...gle.com>, Aditi Ghag
 <aditi.ghag@...valent.com>, bpf@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [RFC PATCH bpf-next 00/14] bpf: Efficient socket destruction

On 9/21/25 9:03 AM, Jordan Rife wrote:
> Hi Martin,
> 
> Thanks for taking a look.
> 
>> How many sockets were destroyed?
> 
> Between 1 and 5 per trial IIRC during this test. Generally, there would
> be a small set of sockets to destroy for a given backend relative to the
> total number UDP/TCP sockets on a system.
> 
>> For TCP, is it possible to abort the connection in BPF_SOCK_OPS_RTO_CB to
>> stop the retry? RTO is not a per packet event.
> 
> To clarify, are you suggesting bpf_sock_destroy() from that context or
> something else? If the former, bpf_sock_destroy() only works from socket
> iterator contexts today, so that's one adjustment that would have to be

Regarding how to abort, I was thinking something (simpler?) like having the 
BPF_SOCK_OPS_RTO_CB prog to enforce the "expired" logic in the 
tcp_write_timeout() by using the prog's return value. The caveat is the return 
value of the BPF_SOCK_OPS_RTO_CB prog is currently ignored, so it may 
potentially break existing use cases. However, I think only checking retval == 
ETIMEOUT should be something reasonable. The retval can be set by 
bpf_set_retval(). I have only briefly looked at tcp_write_timeout, so please check.

The bpf_sock_destroy() may also work but a few things need to be 
considered/adjusted. Its tcp_send_active_reset() seems unnecessary during RTO. 
Maybe only tcp_done_with_err() is enough which seems to be a new kfunc itself. 
It also needs bh_lock_sock() which I am not sure it is true for all sock_ops 
callback. This could be tricky to filter out by the cb enum. Passing "struct 
bpf_sock_ops *" instead of "struct sock *" to a new kfunc seems not generic 
enough. It also has a tcp_set_state() call which will recur to the 
BPF_SOCK_OPS_STATE_CB prog. This can use more thought if the above "expired" 
idea in tcp_write_timeout() does not work out.

[ Unrelated, but in case it needs a new BPF_SOCK_OPS_*_CB enum. I would mostly 
freeze any new BPF_SOCK_OPS_*_CB addition and requiring to move the bpf_sock_ops 
to the struct_ops infrastructure first before adding new ops. ]

> made. It seems like this could work, but I'd have to think more about
> how to mark certain sockets for destruction (possibly using socket
> storage or some auxiliary map).

The BPF_SOCK_OPS_RTO_CB should have the sk which then should have all 4 tuples 
for an established connection.


>> Before diving into the discussion whether it is a good idea to add another
>> key to a bpf hashmap, it seems that a hashmap does not actually fit your use
>> case. A different data structure (or at least a different way of grouping
>> sk) is needed. Have you considered using the
> 
> If I were to design my ideal data structure for grouping sockets
> (ignoring current BPF limitations), it would look quite similar to the
> modified SOCK_HASH in this series. Really what would be ideal is
> something more like a multihash where a single key maps to a set of
> sockets, but that felt much too specific to this use case and doesn't
> fit well within the BPF map paradigm. The modification to SOCK_HASH with
> the key prefix stuff kind of achieves and felt like a good starting
> point.

imo, I don't think it justifies to cross this bridge only for sock_hash map and 
then later being copied to other bpf map like xsk/dev/cpumap...etc. Lets stay 
with the existing bpf map semantic. The bpf rb/list/arena is created for this. 
Lets try it and improve what is missing.
  >
>> bpf_list_head/bpf_rb_root/bpf_arena? Potentially, the sk could be stored as
>> a __kptr but I don't think it is supported yet, aside from considerations
>> when sk is closed, etc. However, it can store the numeric ip/port and then
>> use the bpf_sk_lookup helper, which can take netns_id. Iteration could
>> potentially be done in a sleepable SEC("syscall") program in test_prog_run,
>> where lock_sock is allowed. TCP sockops has a state change callback (i.e.
> 
> You could create a data structure tailored for efficient iteration over
> a group of ip/port pairs, although I'm not sure how you would acquire
> the socket lock unless, e.g., bpf_sock_destroy or a sleepable variant
> thereof acquires the lock itself in that context after the sk lookup?
> E.g. (pseudocode):
> 
> ...
> for each (ip,port,ns) in my custom data structure:
>      sk = bpf_sk_lookup_tcp(ip, port, ns)
>      if (sk)
>      	bpf_sock_destroy_sleepable(sk) // acquires socket lock?
> ...
> 

The verifier could override the kfunc call to another function pointer but I am 
not sure we should complicate verifier further for this case, so adding a 
bpf_sock_destroy_sleepable() is fine, imo. Not sure about the naming though, may 
be bpf_sock_destroy_might_sleep()?