[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <639b1f3f-cb53-4058-8426-14bd50f2b78f@linux.dev>
Date: Thu, 21 Dec 2023 06:58:04 -0800
From: Martin KaFai Lau <martin.lau@...ux.dev>
To: Daniel Borkmann <daniel@...earbox.net>
Cc: 'Alexei Starovoitov ' <ast@...nel.org>,
'Andrii Nakryiko ' <andrii@...nel.org>, netdev@...r.kernel.org,
kernel-team@...a.com, Aditi Ghag <aditi.ghag@...valent.com>,
bpf@...r.kernel.org
Subject: Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress
in bpf_iter_udp
On 12/21/23 5:21 AM, Daniel Borkmann wrote:
> On 12/21/23 5:45 AM, Martin KaFai Lau wrote:
>> On 12/20/23 11:10 AM, Martin KaFai Lau wrote:
>>> Good catch. It will unnecessary skip in the following batch/bucket if there
>>> is changes in the current batch/bucket.
>>>
>>> From looking at the loop again, I think it is better not to change the
>>> iter->offset during the for loop. Only update iter->offset after the for loop
>>> has concluded.
>>>
>>> The non-zero iter->offset is only useful for the first bucket, so does a test
>>> on the first bucket (state->bucket == bucket) before skipping sockets.
>>> Something like this:
>>>
>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>> index 89e5a806b82e..a993f364d6ae 100644
>>> --- a/net/ipv4/udp.c
>>> +++ b/net/ipv4/udp.c
>>> @@ -3139,6 +3139,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file
>>> *seq)
>>> struct net *net = seq_file_net(seq);
>>> struct udp_table *udptable;
>>> unsigned int batch_sks = 0;
>>> + int bucket, bucket_offset;
>>> bool resized = false;
>>> struct sock *sk;
>>>
>>> @@ -3162,14 +3163,14 @@ static struct sock *bpf_iter_udp_batch(struct
>>> seq_file *seq)
>>> iter->end_sk = 0;
>>> iter->st_bucket_done = false;
>>> batch_sks = 0;
>>> + bucket = state->bucket;
>>> + bucket_offset = 0;
>>>
>>> for (; state->bucket <= udptable->mask; state->bucket++) {
>>> struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
>>>
>>> - if (hlist_empty(&hslot2->head)) {
>>> - iter->offset = 0;
>>> + if (hlist_empty(&hslot2->head))
>>> continue;
>>> - }
>>>
>>> spin_lock_bh(&hslot2->lock);
>>> udp_portaddr_for_each_entry(sk, &hslot2->head) {
>>> @@ -3177,8 +3178,9 @@ static struct sock *bpf_iter_udp_batch(struct seq_file
>>> *seq)
>>> /* Resume from the last iterated socket at the
>>> * offset in the bucket before iterator was stopped.
>>> */
>>> - if (iter->offset) {
>>> - --iter->offset;
>>> + if (state->bucket == bucket &&
>>> + bucket_offset < iter->offset) {
>>> + ++bucket_offset;
>>> continue;
>>> }
>>> if (iter->end_sk < iter->max_sk) {
>>> @@ -3192,10 +3194,10 @@ static struct sock *bpf_iter_udp_batch(struct
>>> seq_file *seq)
>>>
>>> if (iter->end_sk)
>>> break;
>>> + }
>>>
>>> - /* Reset the current bucket's offset before moving to the next
>>> bucket. */
>>> + if (state->bucket != bucket)
>>> iter->offset = 0;
>>> - }
>>>
>>> /* All done: no batch made. */
>>> if (!iter->end_sk)
>>
>> I think I found another bug in the current bpf_iter_udp_batch(). The
>> "state->bucket--;" at the end of the batch() function is wrong also. It does
>> not need to go back to the previous bucket. After realloc with a larger batch
>> array, it should retry on the "state->bucket" as is. I tried to force the
>> bind() to use bucket 0 and bind a larger so_reuseport set (24 sockets).
>> WARN_ON(state->bucket < 0) triggered.
>>
>> Going back to this bug (backward progress on --iter->offset), I think it is a
>> bit cleaner to always reset iter->offset to 0 and advance iter->offset to the
>> resume_offset only when needed. Something like this:
>
> Hm, my assumption was.. why not do something like the below, and fully start over?
>
> I'm mostly puzzled about the side-effects here, in particular, if for the rerun
> the sockets
> in the bucket could already have changed.. maybe I'm still missing something -
> what do
> we need to deal with exactly worst case when we need to go and retry everything,
> and what
> guarantees do we have?
>
> (only compile tested)
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 89e5a806b82e..ca62a4bb7bec 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -3138,7 +3138,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
> struct udp_iter_state *state = &iter->state;
> struct net *net = seq_file_net(seq);
> struct udp_table *udptable;
> - unsigned int batch_sks = 0;
> + int orig_bucket, orig_offset;
> + unsigned int i, batch_sks = 0;
> bool resized = false;
> struct sock *sk;
>
> @@ -3149,7 +3150,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
> }
>
> udptable = udp_get_table_seq(seq, net);
> -
> + orig_bucket = state->bucket;
> + orig_offset = iter->offset;
> again:
> /* New batch for the next bucket.
> * Iterate over the hash table to find a bucket with sockets matching
> @@ -3211,9 +3213,15 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
> if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
> resized = true;
> /* After allocating a larger batch, retry one more time to grab
> - * the whole bucket.
> + * the whole bucket. Drop the current refs since for the next
> + * attempt the composition could have changed, thus start over.
> */
> - state->bucket--;
> + for (i = 0; i < iter->end_sk; i++) {
> + sock_put(iter->batch[i]);
> + iter->batch[i] = NULL;
> + }
> + state->bucket = orig_bucket;
> + iter->offset = orig_offset;
It does not need to start over from the orig_bucket. Once it advanced to the
next bucket (state->bucket++), the orig_bucket is done. Otherwise, it may need
to make backward progress here on the state->bucket. The batch size too small
happens on the current state->bucket, so it should retry with the same
state->bucket after realloc_batch(). If the state->bucket happens to be the
orig_bucket (mean it has not advanced), it will skip the same orig_offset.
If the orig_bucket had changed (e.g. having more sockets than the last time it
was batched) after state->bucket++, it is arguably fine because it was added
after the orig_bucket was completely captured in a batch before. The same goes
for (orig_bucket-1) that could have changed during the whole udp_table iteration.
> goto again;
> }
> done:
Powered by blists - more mailing lists