[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <df4c8852-f6d1-4278-84d8-441aad1f9994@linux.dev>
Date: Tue, 30 Sep 2025 15:51:26 -0700
From: Martin KaFai Lau <martin.lau@...ux.dev>
To: Jordan Rife <jordan@...fe.io>
Cc: Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>, Stanislav Fomichev
<sdf@...ichev.me>, Willem de Bruijn <willemdebruijn.kernel@...il.com>,
Kuniyuki Iwashima <kuniyu@...gle.com>, Aditi Ghag
<aditi.ghag@...valent.com>, bpf@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [RFC PATCH bpf-next 00/14] bpf: Efficient socket destruction
On 9/21/25 9:03 AM, Jordan Rife wrote:
> Hi Martin,
>
> Thanks for taking a look.
>
>> How many sockets were destroyed?
>
> Between 1 and 5 per trial IIRC during this test. Generally, there would
> be a small set of sockets to destroy for a given backend relative to the
> total number UDP/TCP sockets on a system.
>
>> For TCP, is it possible to abort the connection in BPF_SOCK_OPS_RTO_CB to
>> stop the retry? RTO is not a per packet event.
>
> To clarify, are you suggesting bpf_sock_destroy() from that context or
> something else? If the former, bpf_sock_destroy() only works from socket
> iterator contexts today, so that's one adjustment that would have to be
Regarding how to abort, I was thinking something (simpler?) like having the
BPF_SOCK_OPS_RTO_CB prog to enforce the "expired" logic in the
tcp_write_timeout() by using the prog's return value. The caveat is the return
value of the BPF_SOCK_OPS_RTO_CB prog is currently ignored, so it may
potentially break existing use cases. However, I think only checking retval ==
ETIMEOUT should be something reasonable. The retval can be set by
bpf_set_retval(). I have only briefly looked at tcp_write_timeout, so please check.
The bpf_sock_destroy() may also work but a few things need to be
considered/adjusted. Its tcp_send_active_reset() seems unnecessary during RTO.
Maybe only tcp_done_with_err() is enough which seems to be a new kfunc itself.
It also needs bh_lock_sock() which I am not sure it is true for all sock_ops
callback. This could be tricky to filter out by the cb enum. Passing "struct
bpf_sock_ops *" instead of "struct sock *" to a new kfunc seems not generic
enough. It also has a tcp_set_state() call which will recur to the
BPF_SOCK_OPS_STATE_CB prog. This can use more thought if the above "expired"
idea in tcp_write_timeout() does not work out.
[ Unrelated, but in case it needs a new BPF_SOCK_OPS_*_CB enum. I would mostly
freeze any new BPF_SOCK_OPS_*_CB addition and requiring to move the bpf_sock_ops
to the struct_ops infrastructure first before adding new ops. ]
> made. It seems like this could work, but I'd have to think more about
> how to mark certain sockets for destruction (possibly using socket
> storage or some auxiliary map).
The BPF_SOCK_OPS_RTO_CB should have the sk which then should have all 4 tuples
for an established connection.
>> Before diving into the discussion whether it is a good idea to add another
>> key to a bpf hashmap, it seems that a hashmap does not actually fit your use
>> case. A different data structure (or at least a different way of grouping
>> sk) is needed. Have you considered using the
>
> If I were to design my ideal data structure for grouping sockets
> (ignoring current BPF limitations), it would look quite similar to the
> modified SOCK_HASH in this series. Really what would be ideal is
> something more like a multihash where a single key maps to a set of
> sockets, but that felt much too specific to this use case and doesn't
> fit well within the BPF map paradigm. The modification to SOCK_HASH with
> the key prefix stuff kind of achieves and felt like a good starting
> point.
imo, I don't think it justifies to cross this bridge only for sock_hash map and
then later being copied to other bpf map like xsk/dev/cpumap...etc. Lets stay
with the existing bpf map semantic. The bpf rb/list/arena is created for this.
Lets try it and improve what is missing.
>
>> bpf_list_head/bpf_rb_root/bpf_arena? Potentially, the sk could be stored as
>> a __kptr but I don't think it is supported yet, aside from considerations
>> when sk is closed, etc. However, it can store the numeric ip/port and then
>> use the bpf_sk_lookup helper, which can take netns_id. Iteration could
>> potentially be done in a sleepable SEC("syscall") program in test_prog_run,
>> where lock_sock is allowed. TCP sockops has a state change callback (i.e.
>
> You could create a data structure tailored for efficient iteration over
> a group of ip/port pairs, although I'm not sure how you would acquire
> the socket lock unless, e.g., bpf_sock_destroy or a sleepable variant
> thereof acquires the lock itself in that context after the sk lookup?
> E.g. (pseudocode):
>
> ...
> for each (ip,port,ns) in my custom data structure:
> sk = bpf_sk_lookup_tcp(ip, port, ns)
> if (sk)
> bpf_sock_destroy_sleepable(sk) // acquires socket lock?
> ...
>
The verifier could override the kfunc call to another function pointer but I am
not sure we should complicate verifier further for this case, so adding a
bpf_sock_destroy_sleepable() is fine, imo. Not sure about the naming though, may
be bpf_sock_destroy_might_sleep()?
Powered by blists - more mailing lists