netdev - Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fb4068fd-6f69-845f-813f-d8691a966010@iogearbox.net>
Date: Thu, 21 Dec 2023 21:27:40 +0100
From: Daniel Borkmann <daniel@...earbox.net>
To: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: 'Alexei Starovoitov ' <ast@...nel.org>,
 'Andrii Nakryiko ' <andrii@...nel.org>, netdev@...r.kernel.org,
 kernel-team@...a.com, Aditi Ghag <aditi.ghag@...valent.com>,
 bpf@...r.kernel.org
Subject: Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress
 in bpf_iter_udp

On 12/21/23 3:58 PM, Martin KaFai Lau wrote:
> On 12/21/23 5:21 AM, Daniel Borkmann wrote:
>> On 12/21/23 5:45 AM, Martin KaFai Lau wrote:
>>> On 12/20/23 11:10 AM, Martin KaFai Lau wrote:
>>>> Good catch. It will unnecessary skip in the following batch/bucket if there is changes in the current batch/bucket.
>>>>
>>>>  From looking at the loop again, I think it is better not to change the iter->offset during the for loop. Only update iter->offset after the for loop has concluded.
>>>>
>>>> The non-zero iter->offset is only useful for the first bucket, so does a test on the first bucket (state->bucket == bucket) before skipping sockets. Something like this:
>>>>
>>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>>> index 89e5a806b82e..a993f364d6ae 100644
>>>> --- a/net/ipv4/udp.c
>>>> +++ b/net/ipv4/udp.c
>>>> @@ -3139,6 +3139,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>       struct net *net = seq_file_net(seq);
>>>>       struct udp_table *udptable;
>>>>       unsigned int batch_sks = 0;
>>>> +    int bucket, bucket_offset;
>>>>       bool resized = false;
>>>>       struct sock *sk;
>>>>
>>>> @@ -3162,14 +3163,14 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>       iter->end_sk = 0;
>>>>       iter->st_bucket_done = false;
>>>>       batch_sks = 0;
>>>> +    bucket = state->bucket;
>>>> +    bucket_offset = 0;
>>>>
>>>>       for (; state->bucket <= udptable->mask; state->bucket++) {
>>>>           struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
>>>>
>>>> -        if (hlist_empty(&hslot2->head)) {
>>>> -            iter->offset = 0;
>>>> +        if (hlist_empty(&hslot2->head))
>>>>               continue;
>>>> -        }
>>>>
>>>>           spin_lock_bh(&hslot2->lock);
>>>>           udp_portaddr_for_each_entry(sk, &hslot2->head) {
>>>> @@ -3177,8 +3178,9 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>                   /* Resume from the last iterated socket at the
>>>>                    * offset in the bucket before iterator was stopped.
>>>>                    */
>>>> -                if (iter->offset) {
>>>> -                    --iter->offset;
>>>> +                if (state->bucket == bucket &&
>>>> +                    bucket_offset < iter->offset) {
>>>> +                    ++bucket_offset;
>>>>                       continue;
>>>>                   }
>>>>                   if (iter->end_sk < iter->max_sk) {
>>>> @@ -3192,10 +3194,10 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>
>>>>           if (iter->end_sk)
>>>>               break;
>>>> +    }
>>>>
>>>> -        /* Reset the current bucket's offset before moving to the next bucket. */
>>>> +    if (state->bucket != bucket)
>>>>           iter->offset = 0;
>>>> -    }
>>>>
>>>>       /* All done: no batch made. */
>>>>       if (!iter->end_sk)
>>>
>>> I think I found another bug in the current bpf_iter_udp_batch(). The "state->bucket--;" at the end of the batch() function is wrong also. It does not need to go back to the previous bucket. After realloc with a larger batch array, it should retry on the "state->bucket" as is. I tried to force the bind() to use bucket 0 and bind a larger so_reuseport set (24 sockets). WARN_ON(state->bucket < 0) triggered.
>>>
>>> Going back to this bug (backward progress on --iter->offset), I think it is a bit cleaner to always reset iter->offset to 0 and advance iter->offset to the resume_offset only when needed. Something like this:
>>
>> Hm, my assumption was.. why not do something like the below, and fully start over?
>>
>> I'm mostly puzzled about the side-effects here, in particular, if for the rerun the sockets
>> in the bucket could already have changed.. maybe I'm still missing something - what do
>> we need to deal with exactly worst case when we need to go and retry everything, and what
>> guarantees do we have?
>>
>> (only compile tested)
>>
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index 89e5a806b82e..ca62a4bb7bec 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -3138,7 +3138,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>       struct udp_iter_state *state = &iter->state;
>>       struct net *net = seq_file_net(seq);
>>       struct udp_table *udptable;
>> -    unsigned int batch_sks = 0;
>> +    int orig_bucket, orig_offset;
>> +    unsigned int i, batch_sks = 0;
>>       bool resized = false;
>>       struct sock *sk;
>>
>> @@ -3149,7 +3150,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>       }
>>
>>       udptable = udp_get_table_seq(seq, net);
>> -
>> +    orig_bucket = state->bucket;
>> +    orig_offset = iter->offset;
>>   again:
>>       /* New batch for the next bucket.
>>        * Iterate over the hash table to find a bucket with sockets matching
>> @@ -3211,9 +3213,15 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>       if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
>>           resized = true;
>>           /* After allocating a larger batch, retry one more time to grab
>> -         * the whole bucket.
>> +         * the whole bucket. Drop the current refs since for the next
>> +         * attempt the composition could have changed, thus start over.
>>            */
>> -        state->bucket--;
>> +        for (i = 0; i < iter->end_sk; i++) {
>> +            sock_put(iter->batch[i]);
>> +            iter->batch[i] = NULL;
>> +        }
>> +        state->bucket = orig_bucket;
>> +        iter->offset = orig_offset;
> 
> It does not need to start over from the orig_bucket. Once it advanced to the next bucket (state->bucket++), the orig_bucket is done. Otherwise, it may need to make backward progress here on the state->bucket. The batch size too small happens on the current state->bucket, so it should retry with the same state->bucket after realloc_batch(). If the state->bucket happens to be the orig_bucket (mean it has not advanced), it will skip the same orig_offset.
> 
> If the orig_bucket had changed (e.g. having more sockets than the last time it was batched) after state->bucket++, it is arguably fine because it was added after the orig_bucket was completely captured in a batch before. The same goes for (orig_bucket-1) that could have changed during the whole udp_table iteration.

Fair, I was thinking in relation to the above to avoid such inconsistency, hence the
drop of the refs. I think it's probably fine to argue either way on how semantics
should be as long as the code doesn't get overly complex for covering this corner
case.

Thanks,
Daniel