netdev - Re: [RFC PATCH bpf-next 0/3] Avoid skipping sockets with socket iterators

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <384c31a4-f0d7-449b-a7a4-2994f936d049@linux.dev>
Date: Mon, 17 Mar 2025 15:06:11 -0700
From: Martin KaFai Lau <martin.lau@...ux.dev>
To: Jordan Rife <jrife@...gle.com>
Cc: netdev@...r.kernel.org, bpf@...r.kernel.org,
 Daniel Borkmann <daniel@...earbox.net>,
 Yonghong Song <yonghong.song@...ux.dev>,
 Aditi Ghag <aditi.ghag@...valent.com>
Subject: Re: [RFC PATCH bpf-next 0/3] Avoid skipping sockets with socket
 iterators

On 3/13/25 4:35 PM, Jordan Rife wrote:
> I was recently looking into using BPF socket iterators in conjunction
> with the bpf_sock_destroy() kfunc as a means to forcefully destroy a
> set of UDP sockets connected to a deleted backend [1]. Aditi describes
> the scenario in [2], the patch series that introduced bpf_sock_destroy()
> for this very purpose:
> 
>> This patch set adds the capability to destroy sockets in BPF. We plan
>> to use the capability in Cilium to force client sockets to reconnect
>> when their remote load-balancing backends are deleted. The other use
>> case is on-the-fly policy enforcement where existing socket
>> connections prevented by policies need to be terminated.
> 
> One would want and expect an iterator to visit every socket that existed
> before the iterator was created, if not exactly once, then at least
> once, otherwise we could accidentally skip a socket that we intended to
> destroy. With the iterator implementation as it exists today, this is
> the behavior you would observe in the vast majority of cases.
> 
> However, in the process of reviewing [2] and some follow up fixes to
> bpf_iter_udp_batch() ([3] [4]) by Martin, it occurred to me that there
> are situations where BPF socket iterators may repeat, or worse, skip
> sockets altogether even if they existed prior to iterator creation,
> making BPF iterators as a mechanism to achieve the goal stated above
> possibly buggy.
> 
> Admittedly, this is probably an edge case of an edge case, but I had
> been pondering a few ways to to close this gap. This RFC highlights
> some of these scenarios, extending prog_tests/sock_iter_batch.c to
> illustrate conditions under which sockets can be skipped or repeated,
> and proposes a possible improvement to iterator progress tracking to
> achieve exactly-once semantics even in the face of concurrent changes
> to the iterator's current bucket.
> 
> THE PROBLEM
> ===========
> Both UDP and TCP socket iterators use iter->offset to track progress
> through a bucket, which is a measure of the number of matching sockets
> from the current bucket that have been seen or processed by the
> iterator. However, iter->offset isn't always an accurate measure of
> "things already seen" when the underlying bucket changes between reads.
> 
> Scenario 1: Skip A Socket
> +------+--------------------+--------------+---------------+
> | Time | Event              | Bucket State | Bucket Offset |
> +------+--------------------+--------------+---------------+
> | 1    | read(iter_fd) -> A | A->B->C->D   | 1             |
> | 2    | close(A)           | B->C->D      | 1             |
> | 3    | read(iter_fd) -> C | B->C->D      | 2             |
> | 4    | read(iter_fd) -> D | B->C->D      | 3             |
> | 5    | read(iter_fd) -> 0 | B->C->D      | -             |
> +------+--------------------+--------------+---------------+
> 
> Iteration sees these buckets: [A, C, D]
> B is skipped.
> 
> Scenario 2: Repeat A Socket
> +------+--------------------+---------------+---------------+
> | Time | Event              | Bucket State  | Bucket Offset |
> +------+--------------------+---------------+---------------+
> | 1    | read(iter_fd) -> A | A->B->C->D    | 1             |
> | 2    | connect(E)         | E->A->B->C->D | 1             |
> | 3    | read(iter_fd) -> A | E->A->B->C->D | 2             |
> | 3    | read(iter_fd) -> B | E->A->B->C->D | 3             |
> | 3    | read(iter_fd) -> C | E->A->B->C->D | 4             |
> | 4    | read(iter_fd) -> D | E->A->B->C->D | 5             |
> | 5    | read(iter_fd) -> 0 | E->A->B->C->D | -             |
> +------+--------------------+---------------+---------------+
> 
> Iteration sees these buckets: [A, A, B, C, D]
> A is repeated.
> 
> PROPOSAL
> ========
> If we ignore the possibility of signed 64 bit rollover*, then we can
> achieve exactly-once semantics. This series replaces the current
> offset-based scheme used for progress tracking with a scheme based on a
> monotonically increasing version number. It works as follows:
> 
> * Assign index numbers on sockets in the bucket's linked list such that
>    they are monotonically increasing as you read from the head to tail.
> 
>    * Every time a socket is added to a bucket, increment the hash
>      table's version number, ver.
>    * If the socket is being added to the head of the bucket's linked
>      list, set sk->idx to -1*ver.
>    * If the socket is being added to the tail of the bucket's linked
>      list, set sk->idx to ver.
> 
>    Ex: append_head(C), append_head(B), append_tail(D), append_head(A),
>        append_tail(E) results in the following state.
>      
>        A -> B -> C -> D -> E
>       -4   -2   -1    3    5
> * As we iterate through a bucket, keep track of the last index number
>    we've seen for that bucket, iter->prev_idx.
> * On subsequent iterations, skip ahead in the bucket until we see a
>    socket whose index, sk->idx, is greater than iter->prev_idx.
> 
> Indexes are globally and temporally unique within a table, and
> by extension each bucket, and always increase as we iterate from head
> to tail. Since the relative order of items within the linked list
> doesn't change, and we don't insert new items into the middle, we can
> be sure that any socket whose index is greater than iter->prev_idx has
> not yet been seen. Any socket whose index is less than or equal to
> iter->prev_idx has either been seen+ before or was added to the head of
> the bucket's list since we last saw that bucket. In either case, it's
> safe to skip them (any new sockets did not exist when we created the
> iterator, so skipping them doesn't create an "inconsistent view").

imo, I am not sure we should add this on top of what the bucket already has, 
spin_lock and rcu protection. It feels a bit overkill for this edge case.

 > > * Practically speaking, I'm not sure if rollover is a very real concern,
>    but we could potentially extend the version/index field to 128 bits
>    or have some rollover detection built in as well (although this
>    introduces the possibility of repeated sockets again) if there are any
>    doubts.
> + If you really wanted to, I guess you could even iterate over a sort of
>    "consistent snapshot" of the collection by remembering the initial
>    ver in the iterator state, iter->ver, and filtering out any items
>    where abs(sk->idx) > iter->ver, but this is probably of little
>    practical use and more of an interesting side effect.
> 
> SOME ALTERNATIVES
> =================
> 1. One alternative I considered was simply counting the number of
>     removals that have occurred per bucket, remembering this between
>     calls to bpf_iter_(tcp|udp)_batch() as part of the iterator state,
>     then using it to detect if it has changed. If any removals have
>     occurred, we would need to walk back iter->offset by at least that
>     much to avoid skips. This approach is simpler but may repeat sockets.
> 2. Don't allow partial batches; always make sure we capture all sockets
>     in a bucket in one batch. bpf_iter_(tcp|udp)_batch() already has some
>     logic to try one time to resize the batch, but still needs to contend
>     with the fact that bpf_iter_(tcp|udp)_realloc_batch() may not be able
>     grab more memory, and bpf_iter_(tcp|udp)_put_batch() is called
>     between reads anyway, making it necessary to seek to the previous
>     offset next time around.

The batch should have a snapshot of the bucket. Practically, it should not have 
the "repeating" or "missing" a sk issue as long as seq->stop() is not called in 
the middle of the iteration of that batch.

I think this guarantee is enough for the bpf_sock_destroy() and the 
bpf_setsockopt() use case if the bpf prog ends up not seq_write()-ing anything.

For printing/dumping, I am not sure if other iterating interface (/proc or 
sock_diag) provide better guarantee than bpf_iter_{tcp,udp} but yeah, we can 
explore if something better can be done.

> 3. Error out if a scenario is detected where skips may be possible and
>     force the application layer to restart iteration. This doesn't seem
>     great.

I don't know what may be a good interface to restart the iteration. Restarting 
from the very beginning of the hashtable sounds bad.

It has been a while, I may not recall all the details of bpf_seq_read() but some 
fruits of thoughts...

One thought is to avoid seq->stop() which will require to do batching again next 
time, in particular, when the user has provided large buf to read() to ensure it 
is large enough for one bucket. May be we can return whatever seq->buf has to 
the userspace whenever a bucket/batch is done. This will have perf tradeoff 
though and not sure how the userspace can optin.

Another random thought is in seq->stop (bpf_iter_{tcp,udp}_seq_stop). It has to 
release the sk refcnt because we don't know when will the userspace come back to 
read(). Will it be useful if it stores the cookies of the sk(s) that have not 
yet seq->show?