lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 21 Dec 2023 21:27:40 +0100
From: Daniel Borkmann <daniel@...earbox.net>
To: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: 'Alexei Starovoitov ' <ast@...nel.org>,
 'Andrii Nakryiko ' <andrii@...nel.org>, netdev@...r.kernel.org,
 kernel-team@...a.com, Aditi Ghag <aditi.ghag@...valent.com>,
 bpf@...r.kernel.org
Subject: Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress
 in bpf_iter_udp

On 12/21/23 3:58 PM, Martin KaFai Lau wrote:
> On 12/21/23 5:21 AM, Daniel Borkmann wrote:
>> On 12/21/23 5:45 AM, Martin KaFai Lau wrote:
>>> On 12/20/23 11:10 AM, Martin KaFai Lau wrote:
>>>> Good catch. It will unnecessary skip in the following batch/bucket if there is changes in the current batch/bucket.
>>>>
>>>>  From looking at the loop again, I think it is better not to change the iter->offset during the for loop. Only update iter->offset after the for loop has concluded.
>>>>
>>>> The non-zero iter->offset is only useful for the first bucket, so does a test on the first bucket (state->bucket == bucket) before skipping sockets. Something like this:
>>>>
>>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>>> index 89e5a806b82e..a993f364d6ae 100644
>>>> --- a/net/ipv4/udp.c
>>>> +++ b/net/ipv4/udp.c
>>>> @@ -3139,6 +3139,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>       struct net *net = seq_file_net(seq);
>>>>       struct udp_table *udptable;
>>>>       unsigned int batch_sks = 0;
>>>> +    int bucket, bucket_offset;
>>>>       bool resized = false;
>>>>       struct sock *sk;
>>>>
>>>> @@ -3162,14 +3163,14 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>       iter->end_sk = 0;
>>>>       iter->st_bucket_done = false;
>>>>       batch_sks = 0;
>>>> +    bucket = state->bucket;
>>>> +    bucket_offset = 0;
>>>>
>>>>       for (; state->bucket <= udptable->mask; state->bucket++) {
>>>>           struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
>>>>
>>>> -        if (hlist_empty(&hslot2->head)) {
>>>> -            iter->offset = 0;
>>>> +        if (hlist_empty(&hslot2->head))
>>>>               continue;
>>>> -        }
>>>>
>>>>           spin_lock_bh(&hslot2->lock);
>>>>           udp_portaddr_for_each_entry(sk, &hslot2->head) {
>>>> @@ -3177,8 +3178,9 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>                   /* Resume from the last iterated socket at the
>>>>                    * offset in the bucket before iterator was stopped.
>>>>                    */
>>>> -                if (iter->offset) {
>>>> -                    --iter->offset;
>>>> +                if (state->bucket == bucket &&
>>>> +                    bucket_offset < iter->offset) {
>>>> +                    ++bucket_offset;
>>>>                       continue;
>>>>                   }
>>>>                   if (iter->end_sk < iter->max_sk) {
>>>> @@ -3192,10 +3194,10 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>
>>>>           if (iter->end_sk)
>>>>               break;
>>>> +    }
>>>>
>>>> -        /* Reset the current bucket's offset before moving to the next bucket. */
>>>> +    if (state->bucket != bucket)
>>>>           iter->offset = 0;
>>>> -    }
>>>>
>>>>       /* All done: no batch made. */
>>>>       if (!iter->end_sk)
>>>
>>> I think I found another bug in the current bpf_iter_udp_batch(). The "state->bucket--;" at the end of the batch() function is wrong also. It does not need to go back to the previous bucket. After realloc with a larger batch array, it should retry on the "state->bucket" as is. I tried to force the bind() to use bucket 0 and bind a larger so_reuseport set (24 sockets). WARN_ON(state->bucket < 0) triggered.
>>>
>>> Going back to this bug (backward progress on --iter->offset), I think it is a bit cleaner to always reset iter->offset to 0 and advance iter->offset to the resume_offset only when needed. Something like this:
>>
>> Hm, my assumption was.. why not do something like the below, and fully start over?
>>
>> I'm mostly puzzled about the side-effects here, in particular, if for the rerun the sockets
>> in the bucket could already have changed.. maybe I'm still missing something - what do
>> we need to deal with exactly worst case when we need to go and retry everything, and what
>> guarantees do we have?
>>
>> (only compile tested)
>>
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index 89e5a806b82e..ca62a4bb7bec 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -3138,7 +3138,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>       struct udp_iter_state *state = &iter->state;
>>       struct net *net = seq_file_net(seq);
>>       struct udp_table *udptable;
>> -    unsigned int batch_sks = 0;
>> +    int orig_bucket, orig_offset;
>> +    unsigned int i, batch_sks = 0;
>>       bool resized = false;
>>       struct sock *sk;
>>
>> @@ -3149,7 +3150,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>       }
>>
>>       udptable = udp_get_table_seq(seq, net);
>> -
>> +    orig_bucket = state->bucket;
>> +    orig_offset = iter->offset;
>>   again:
>>       /* New batch for the next bucket.
>>        * Iterate over the hash table to find a bucket with sockets matching
>> @@ -3211,9 +3213,15 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>       if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
>>           resized = true;
>>           /* After allocating a larger batch, retry one more time to grab
>> -         * the whole bucket.
>> +         * the whole bucket. Drop the current refs since for the next
>> +         * attempt the composition could have changed, thus start over.
>>            */
>> -        state->bucket--;
>> +        for (i = 0; i < iter->end_sk; i++) {
>> +            sock_put(iter->batch[i]);
>> +            iter->batch[i] = NULL;
>> +        }
>> +        state->bucket = orig_bucket;
>> +        iter->offset = orig_offset;
> 
> It does not need to start over from the orig_bucket. Once it advanced to the next bucket (state->bucket++), the orig_bucket is done. Otherwise, it may need to make backward progress here on the state->bucket. The batch size too small happens on the current state->bucket, so it should retry with the same state->bucket after realloc_batch(). If the state->bucket happens to be the orig_bucket (mean it has not advanced), it will skip the same orig_offset.
> 
> If the orig_bucket had changed (e.g. having more sockets than the last time it was batched) after state->bucket++, it is arguably fine because it was added after the orig_bucket was completely captured in a batch before. The same goes for (orig_bucket-1) that could have changed during the whole udp_table iteration.

Fair, I was thinking in relation to the above to avoid such inconsistency, hence the
drop of the refs. I think it's probably fine to argue either way on how semantics
should be as long as the code doesn't get overly complex for covering this corner
case.

Thanks,
Daniel

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ