netdev - Re: [PATCH v2 bpf-next 2/5] bpf: udp: Propagate ENOMEM up from bpf_iter_udp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABi4-ojQVb=8SKGNubpy=bG4pg1o=tNaz9UspYDTbGTPZTu8gQ@mail.gmail.com>
Date: Fri, 11 Apr 2025 16:31:18 -0700
From: Jordan Rife <jordan@...fe.io>
To: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: Aditi Ghag <aditi.ghag@...valent.com>, Daniel Borkmann <daniel@...earbox.net>, 
	Willem de Bruijn <willemdebruijn.kernel@...il.com>, Kuniyuki Iwashima <kuniyu@...zon.com>, bpf@...r.kernel.org, 
	netdev@...r.kernel.org
Subject: Re: [PATCH v2 bpf-next 2/5] bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch

> Since the cookie change is in the next patch, it will be useful to mention it is
> a prep work for the next patch.

Sure, will do.

> The resized == true case will have a similar issue. Meaning the next
> bpf_iter_udp_batch() will end up skipping the remaining sk in that bucket, e.g.
> the partial-bucket batch has been consumed, so cur_sk == end_sk but
> st_bucket_done == false and bpf_iter_udp_resume() returns NULL. It is sort of a
> regression from the current "offset" implementation for this case. Any thought
> on how to make it better?

Are you referring to the case where the bucket grows in size so much
between releasing and reacquiring the bucket's lock to where we still
can't fit all sockets into our batch even after a
bpf_iter_udp_realloc_batch()? If so, I think we touched on this a bit
in [1]:

> > Barring the possibility that bpf_iter_udp_realloc_batch() fails to
> > grab more memory (should this basically never happen?), this should
> > ensure that we always grab the full contents of the bucket on the
> > second go around. However, the spin lock on hslot2->lock is released
> > before doing this. Would it not be more accurate to hold onto the lock
> > until after the second attempt, so we know the size isn't changing
> > between the time where we release the lock and the time when we
> > reacquire it post-batch-resize. The bucket size would have to grow by
> > more than 1.5x for the new size to be insufficient, so I may be
> > splitting hairs here, but just something I noticed.
>
> It is a very corner case.
>
> I guess it can with GFP_ATOMIC. I just didn't think it was needed considering
> the key of the hash is addresses+ports. If we have many socks collided on the
> same addresses+ports bucket, that would be a better hashtable problem to solve
> first.
>
> The default batch size is 16 now. On the listening hashtable + SO_REUSEPORT,
> userspace may have one listen sk binding on the same address+port for each
> thread. It is not uncommon to have hundreds of CPU cores now, so it may actually
> need to hit the realloc_batch() path once and then likely will stay at that size
> for the whole hashtable iteration.

Would it be worth using GFP_ATOMIC here so that we can hold onto the
spinlock and guarantee the bucket size doesn't change?

Some alternatives I can imagine are:

1) Loop until iter->end_sk == batch_sks, possibly calling realloc a
couple times. The unbounded loop is a bit worrying; I guess
bpf_iter_udp_batch could "race" if the bucket size keeps growing here.
2) Loop some bounded number of times and return some ERR_PTR(x) if the
loop can't keep up after a few tries so we don't break the invariant
that the batch is always a full snapshot of a bucket.
3) Set some flag in the iterator state, e.g. iter->is_partial,
indicating to the next call to bpf_iter_udp_realloc_batch() that the
last batch was actually partial and that if it can't find any of the
cookies from last time it should start over from the beginning of the
bucket instead of advancing to the next bucket. This may repeat
sockets we've already seen in the worst case, but still better than
skipping them.

I kind of like the GFP_ATOMIC thing if possible.

[1]: https://lore.kernel.org/bpf/c53cee32-02c0-4c5a-a57d-910b12e73afd@linux.dev/

-Jordan