netdev - Re: [RFC PATCH bpf-next 00/14] bpf: Efficient socket destruction

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABi4-ogK1zaupzpRppGEdM0v+4BSJHbrC4Fg=j1zBSGLbkx1rQ@mail.gmail.com>
Date: Mon, 6 Oct 2025 13:19:14 -0700
From: Jordan Rife <jordan@...fe.io>
To: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, 
	Stanislav Fomichev <sdf@...ichev.me>, Willem de Bruijn <willemdebruijn.kernel@...il.com>, 
	Kuniyuki Iwashima <kuniyu@...gle.com>, Aditi Ghag <aditi.ghag@...valent.com>, bpf@...r.kernel.org, 
	netdev@...r.kernel.org
Subject: Re: [RFC PATCH bpf-next 00/14] bpf: Efficient socket destruction

On Tue, Sep 30, 2025 at 3:51 PM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
>
> On 9/21/25 9:03 AM, Jordan Rife wrote:
> > Hi Martin,
> >
> > Thanks for taking a look.
> >
> >> How many sockets were destroyed?
> >
> > Between 1 and 5 per trial IIRC during this test. Generally, there would
> > be a small set of sockets to destroy for a given backend relative to the
> > total number UDP/TCP sockets on a system.
> >
> >> For TCP, is it possible to abort the connection in BPF_SOCK_OPS_RTO_CB to
> >> stop the retry? RTO is not a per packet event.
> >
> > To clarify, are you suggesting bpf_sock_destroy() from that context or
> > something else? If the former, bpf_sock_destroy() only works from socket
> > iterator contexts today, so that's one adjustment that would have to be
>
> Regarding how to abort, I was thinking something (simpler?) like having the
> BPF_SOCK_OPS_RTO_CB prog to enforce the "expired" logic in the
> tcp_write_timeout() by using the prog's return value. The caveat is the return
> value of the BPF_SOCK_OPS_RTO_CB prog is currently ignored, so it may
> potentially break existing use cases. However, I think only checking retval ==
> ETIMEOUT should be something reasonable. The retval can be set by
> bpf_set_retval(). I have only briefly looked at tcp_write_timeout, so please check.

After thinking about this some more, I like the idea of handling
socket destruction inside a callback like this, since it avoids the
extra memory and bookkeeping needed to maintain another data structure
as with the explicit iteration approach. I'm imagining the logic would
look something like this for the use case I have in mind:

switch (op) {
case BPF_SOCK_OPS_RTO_CB:
    /* backend_id is derived from the socket's daddr + dport */
    v = bpf_map_lookup_elem(&backends_map, &backend_id);
    if (!v) /* backend no longer exists */
        /* either bpf_sock_destroy or return -ETIMEOUT */
    break;
}

I'll have to think more about the use case and see if there's anything
I'm missing, but on the face of it this seems like it would work well
(at least for TCP). I have some doubts about the ETIMEOUT thing vs
just extending bpf_sock_destroy to work for both TCP and UDP contexts
(more on this below).

> The bpf_sock_destroy() may also work but a few things need to be
> considered/adjusted. Its tcp_send_active_reset() seems unnecessary during RTO.
> Maybe only tcp_done_with_err() is enough which seems to be a new kfunc itself.
> It also needs bh_lock_sock() which I am not sure it is true for all sock_ops
> callback. This could be tricky to filter out by the cb enum. Passing "struct
> bpf_sock_ops *" instead of "struct sock *" to a new kfunc seems not generic
> enough. It also has a tcp_set_state() call which will recur to the
> BPF_SOCK_OPS_STATE_CB prog. This can use more thought if the above "expired"
> idea in tcp_write_timeout() does not work out.

Yeah, this is a bit tricky. I'll have to think a bit more about how
this would work. The ETIMEOUT thing would work for TCP, but if I'm
trying to extend this to UDP sockets I think you may need an explicit
bpf_sock_destroy() call anyway? And if you're making
bpf_sock_destroy() work in that context then maybe supporting ETIMEOUT
is redundant?

> [ Unrelated, but in case it needs a new BPF_SOCK_OPS_*_CB enum. I would mostly
> freeze any new BPF_SOCK_OPS_*_CB addition and requiring to move the bpf_sock_ops
> to the struct_ops infrastructure first before adding new ops. ]

Thanks, I'll look into this. One aspect I'm uncertain about is
applying this kind of approach to UDP sockets. The BPF_SOCK_OPS_RTO_CB
callback provides a convenient place to handle this for TCP, but UDP
doesn't exactly have any timeouts where a similar callback makes
sense. Instead, you'd need to have something like a callback for UDP
that executes on every sendmsg call where you run some logic similar
to the code above. This is less ideal, since you need to do extra work
on every sendmsg call instead of just when a timeout occurs as with
BPF_SOCK_OPS_RTO_CB, but maybe the extra cost here would be
negligible. Combined, I imagine something like this:

switch (op) {
case BPF_SOCK_OPS_RTO_CB:
case BPF_SOCK_OPS_UDP_SENDMSG_CB:
    /* backend_id is derived from the socket's daddr + dport */
    v = bpf_map_lookup_elem(&backends_map, &backend_id);
    if (!v) { /* backend no longer exists */
        if (sockop == BPF_SOCK_OPS_RTO_CB)
            /* return -ETIMEOUT or maybe just bpf_sock_destroy? */
        else
            bpf_sock_destroy()
    }
    break;
}

> > made. It seems like this could work, but I'd have to think more about
> > how to mark certain sockets for destruction (possibly using socket
> > storage or some auxiliary map).
>
> The BPF_SOCK_OPS_RTO_CB should have the sk which then should have all 4 tuples
> for an established connection.
>
>
> >> Before diving into the discussion whether it is a good idea to add another
> >> key to a bpf hashmap, it seems that a hashmap does not actually fit your use
> >> case. A different data structure (or at least a different way of grouping
> >> sk) is needed. Have you considered using the
> >
> > If I were to design my ideal data structure for grouping sockets
> > (ignoring current BPF limitations), it would look quite similar to the
> > modified SOCK_HASH in this series. Really what would be ideal is
> > something more like a multihash where a single key maps to a set of
> > sockets, but that felt much too specific to this use case and doesn't
> > fit well within the BPF map paradigm. The modification to SOCK_HASH with
> > the key prefix stuff kind of achieves and felt like a good starting
> > point.
>
> imo, I don't think it justifies to cross this bridge only for sock_hash map and
> then later being copied to other bpf map like xsk/dev/cpumap...etc. Lets stay
> with the existing bpf map semantic. The bpf rb/list/arena is created for this.
> Lets try it and improve what is missing.

Yeah, makes sense.

>   >
> >> bpf_list_head/bpf_rb_root/bpf_arena? Potentially, the sk could be stored as
> >> a __kptr but I don't think it is supported yet, aside from considerations
> >> when sk is closed, etc. However, it can store the numeric ip/port and then
> >> use the bpf_sk_lookup helper, which can take netns_id. Iteration could
> >> potentially be done in a sleepable SEC("syscall") program in test_prog_run,
> >> where lock_sock is allowed. TCP sockops has a state change callback (i.e.
> >
> > You could create a data structure tailored for efficient iteration over
> > a group of ip/port pairs, although I'm not sure how you would acquire
> > the socket lock unless, e.g., bpf_sock_destroy or a sleepable variant
> > thereof acquires the lock itself in that context after the sk lookup?
> > E.g. (pseudocode):
> >
> > ...
> > for each (ip,port,ns) in my custom data structure:
> >      sk = bpf_sk_lookup_tcp(ip, port, ns)
> >      if (sk)
> >       bpf_sock_destroy_sleepable(sk) // acquires socket lock?
> > ...
> >
>
> The verifier could override the kfunc call to another function pointer but I am
> not sure we should complicate verifier further for this case, so adding a
> bpf_sock_destroy_sleepable() is fine, imo. Not sure about the naming though, may
> be bpf_sock_destroy_might_sleep()?

I think using socket callbacks like BPF_SOCK_OPS_RTO_CB would make for
a more elegant solution and wouldn't require as much bookkeeping,
provided we can work something similar out for UDP. This would make
any rb/list/arena manipulation and a sleepable variant of
bpf_sock_destroy unnecessary. Maybe it's worth exploring that
direction first?

Thanks!

Jordan