[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAO3-PbrGR-U0+TxgeUUZHSwvNw2Dx=o6_EYf60ND7BLiLA-xWQ@mail.gmail.com>
Date: Thu, 6 Feb 2025 02:20:02 -0600
From: Yan Zhai <yan@...udflare.com>
To: Hou Tao <houtao@...weicloud.com>
Cc: bpf@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>, John Fastabend <john.fastabend@...il.com>,
Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau <martin.lau@...ux.dev>,
Eduard Zingerman <eddyz87@...il.com>, Song Liu <song@...nel.org>,
Yonghong Song <yonghong.song@...ux.dev>, KP Singh <kpsingh@...nel.org>,
Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>,
Mykola Lysenko <mykolal@...com>, Shuah Khan <shuah@...nel.org>, Brian Vazquez <brianvv@...gle.com>,
linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org,
kernel-team@...udflare.com
Subject: Re: [PATCH bpf] bpf: skip non existing key in generic_map_lookup_batch
On Thu, Feb 6, 2025 at 12:40 AM Hou Tao <houtao@...weicloud.com> wrote:
>
> Hi,
>
> On 2/6/2025 12:57 AM, Yan Zhai wrote:
> > The generic_map_lookup_batch currently returns EINTR if it fails with
> > ENOENT and retries several times on bpf_map_copy_value. The next batch
> > would start from the same location, presuming it's a transient issue.
> > This is incorrect if a map can actually have "holes", i.e.
> > "get_next_key" can return a key that does not point to a valid value. At
> > least the array of maps type may contain such holes legitly. Right now
> > these holes show up, generic batch lookup cannot proceed any more. It
> > will always fail with EINTR errors.
> >
> > Rather, do not retry in generic_map_lookup_batch. If it finds a non
> > existing element, skip to the next key.
> >
> > Fixes: cb4d03ab499d ("bpf: Add generic support for lookup batch op")
> > Closes: https://lore.kernel.org/bpf/Z6JXtA1M5jAZx8xD@debian.debian/
> > Signed-off-by: Yan Zhai <yan@...udflare.com>
> > ---
> > kernel/bpf/syscall.c | 16 ++----
> > .../bpf/map_tests/map_in_map_batch_ops.c | 54 ++++++++++++++-----
> > 2 files changed, 45 insertions(+), 25 deletions(-)
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index c420edbfb7c8..5691fc0d370d 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -1979,7 +1979,7 @@ int generic_map_lookup_batch(struct bpf_map *map,
> > void __user *values = u64_to_user_ptr(attr->batch.values);
> > void __user *keys = u64_to_user_ptr(attr->batch.keys);
> > void *buf, *buf_prevkey, *prev_key, *key, *value;
> > - int err, retry = MAP_LOOKUP_RETRIES;
> > + int err;
> > u32 value_size, cp, max_count;
> >
> > if (attr->batch.elem_flags & ~BPF_F_LOCK)
> > @@ -2026,14 +2026,8 @@ int generic_map_lookup_batch(struct bpf_map *map,
> > err = bpf_map_copy_value(map, key, value,
> > attr->batch.elem_flags);
> >
> > - if (err == -ENOENT) {
> > - if (retry) {
> > - retry--;
> > - continue;
> > - }
> > - err = -EINTR;
> > - break;
> > - }
> > + if (err == -ENOENT)
> > + goto next_key;
> >
> > if (err)
> > goto free_buf;
> > @@ -2048,12 +2042,12 @@ int generic_map_lookup_batch(struct bpf_map *map,
> > goto free_buf;
> > }
> >
> > + cp++;
> > +next_key:
> > if (!prev_key)
> > prev_key = buf_prevkey;
> >
> > swap(prev_key, key);
> > - retry = MAP_LOOKUP_RETRIES;
> > - cp++;
> > cond_resched();
> > }
> >
>
> Let's move the new thread for further discussion.
>
> >We are probably not on the same page. Let me clarify:
>
> >By "retry logic" I mean this code snippet:
> > if (err == -ENOENT) {
> > if (retry) {
> > retry--;
> > continue;
> > }
> > err = -EINTR;
> > break;
> > }
>
>
> Yes, the retry logic doesn't change the previous key. Thanks for the
> clarifying.
> > And by "skipping to the next key", it's simply
> >
> > if (err == -ENOENT)
> > goto next_key;
> >
> > Note the "next_key" label was not in the current codebase. It is only
> > in my posted patch. I don't think this would break lpm_trie unless I
> > missed something.
>
> I was trying to say that the proposed patch may break the lookup_batch
> operation for lpm_trie, and let me explain step by step:
>
> For LPM trie map:
> (1) ->map_get_next_key(map, prev_key, key) returns a valid key
>
> (2) bpf_map_copy_value() return -ENOMENT
> It means the key must be deleted concurrently.
>
I see what you mean now, thanks for the detailed explanation!
So for lpm_trie, if an element is deleted between get_next_key and
copy_value, then retry would still proceed to its next tree node. But
with or without retry, I think we cannot prevent the key from cycling
back to the beginning: an element can be deleted after copy_value, so
the key becomes invalid. After swap with prev_key and call
bpf_get_next_key, it cycles back to the leftmost. Similar can happen
if during retry the prev_key also gets invalidated. IIUC the rcu
locking in lookup really cannot prevent the tree structure from
changes.
On the other hand, bpf_map_get_next_key manual already specifies "If a
key is not found, the operation returns zero and sets the next_key
pointer to the key of the first element.". IMHO it probably makes more
sense that, regardless of normal or batch lookup, it is ultimately the
user's responsibility to properly synchronize, or deduplicate what is
returned from the kernel. I intend to keep the "skip to next" to save
the unnecessary complexity here, unless there are strong objections
that I should not.
> (3) goto next_key
> It swaps the prev_key and key
>
> (4) ->map_get_next_key(map, prev_key, key) again
> prev_key points to a non-existing key, for LPM trie it will treat just
> like prev_key=NULL case, the returned key will be duplicated.
>
>
> > diff --git a/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c b/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c
> > index 66191ae9863c..b38be71f06be 100644
>
> It is better to split the update of map_tests into a separated patch and
> it will be more friendly for stable backport.
Sure I can do that.
> > --- a/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c
> > +++ b/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c
> > @@ -120,11 +120,12 @@ static void validate_fetch_results(int outer_map_fd,
> >
> > static void fetch_and_validate(int outer_map_fd,
> > struct bpf_map_batch_opts *opts,
> > - __u32 batch_size, bool delete_entries)
> > + __u32 batch_size, bool delete_entries,
> > + bool has_holes)
> > {
> > + int err, max_entries = OUTER_MAP_ENTRIES - !!has_holes;
> > __u32 *fetched_keys, *fetched_values, total_fetched = 0;
> > __u32 batch_key = 0, fetch_count, step_size;
> > - int err, max_entries = OUTER_MAP_ENTRIES;
> > __u32 value_size = sizeof(__u32);
> >
> > /* Total entries needs to be fetched */
> > @@ -135,9 +136,9 @@ static void fetch_and_validate(int outer_map_fd,
> > "error=%s\n", strerror(errno));
> >
> > for (step_size = batch_size;
> > - step_size <= max_entries;
> > + step_size < max_entries + batch_size; /* allow read partial */
> > step_size += batch_size) {
> > - fetch_count = step_size;
> > + fetch_count = batch_size;
>
> The change "fetch_count = batch_size" may fail the lookup batch
> operation of hash table, because the element in one bucket may be
> greater than batch_size and it will return -ENOSPC constantly. And it
> seems the original implementation of fetch_and_validate() is buggy,
> because for hash map, the returned fetched_count may be less than the
> passed count when there are too many elements in the same bucket. I
> don't know the reason why the bug doesn't show up.
ACK on the ENOSPC part. Should indeed increase the fetch size. The
original code was buggy, my patch failed BPF CI for the unrelated hash
of maps case. The fetch loop should not have assumed a full batch
being returned. But it's purely luck whether tests would fail in
practice, my local vmtests all passed >.< I will fix this in V2.
> > err = delete_entries
> > ? bpf_map_lookup_and_delete_batch(outer_map_fd,
> > total_fetched ? &batch_key : NULL,
> > @@ -184,18 +185,19 @@ static void fetch_and_validate(int outer_map_fd,
> > }
> >
> > static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type,
> > - enum bpf_map_type inner_map_type)
> > + enum bpf_map_type inner_map_type,
> > + bool has_holes)
> > {
> > + __u32 max_entries = OUTER_MAP_ENTRIES - !!has_holes;
> > __u32 *outer_map_keys, *inner_map_fds;
> > - __u32 max_entries = OUTER_MAP_ENTRIES;
> > LIBBPF_OPTS(bpf_map_batch_opts, opts);
> > __u32 value_size = sizeof(__u32);
> > int batch_size[2] = {5, 10};
> > __u32 map_index, op_index;
> > int outer_map_fd, ret;
> >
> > - outer_map_keys = calloc(max_entries, value_size);
> > - inner_map_fds = calloc(max_entries, value_size);
> > + outer_map_keys = calloc(OUTER_MAP_ENTRIES, value_size);
> > + inner_map_fds = calloc(OUTER_MAP_ENTRIES, value_size);
> > CHECK((!outer_map_keys || !inner_map_fds),
> > "Memory allocation failed for outer_map_keys or inner_map_fds",
> > "error=%s\n", strerror(errno));
> > @@ -209,6 +211,24 @@ static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type,
> > ((outer_map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
> > ? 9 : 1000) - map_index;
> >
> > + /* This condition is only meaningful for array of maps.
> > + *
> > + * max_entries == OUTER_MAP_ENTRIES - 1 if it is true. Say
> > + * max_entries is short for n, then outer_map_keys looks like:
> > + *
> > + * [n, n-1, ... 2, 1]
> > + *
> > + * We change it to
> > + *
> > + * [n, n-1, ... 2, 0]
> > + *
> > + * So it will leave key 1 as a hole. It will serve to test the
> > + * correctness when batch on an array: a "non-exist" key might be
> > + * actually allocated and returned from key iteration.
> > + */
> > + if (has_holes)
> > + outer_map_keys[max_entries - 1]--;
> > +
> > /* batch operation - map_update */
> > ret = bpf_map_update_batch(outer_map_fd, outer_map_keys,
> > inner_map_fds, &max_entries, &opts);
> > @@ -219,12 +239,14 @@ static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type,
> > /* batch operation - map_lookup */
> > for (op_index = 0; op_index < 2; ++op_index)
> > fetch_and_validate(outer_map_fd, &opts,
> > - batch_size[op_index], false);
> > + batch_size[op_index], false,
> > + has_holes);
> >
> > /* batch operation - map_lookup_delete */
> > if (outer_map_type == BPF_MAP_TYPE_HASH_OF_MAPS)
> > fetch_and_validate(outer_map_fd, &opts,
> > - max_entries, true /*delete*/);
> > + max_entries, true /*delete*/,
> > + has_holes);
> >
> > /* close all map fds */
> > for (map_index = 0; map_index < max_entries; map_index++)
>
> OUTER_MAP_ENTRIES instead of max_entries ?
Good catch!
thanks
Yan
> > @@ -237,16 +259,20 @@ static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type,
> >
> > void test_map_in_map_batch_ops_array(void)
> > {
> > - _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_ARRAY);
> > + _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_ARRAY, false);
> > printf("%s:PASS with inner ARRAY map\n", __func__);
> > - _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_HASH);
> > + _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_HASH, false);
> > printf("%s:PASS with inner HASH map\n", __func__);
> > + _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_ARRAY, true);
> > + printf("%s:PASS with inner ARRAY map with holes\n", __func__);
> > + _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_HASH, true);
> > + printf("%s:PASS with inner HASH map with holes\n", __func__);
> > }
> >
> > void test_map_in_map_batch_ops_hash(void)
> > {
> > - _map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_ARRAY);
> > + _map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_ARRAY, false);
> > printf("%s:PASS with inner ARRAY map\n", __func__);
> > - _map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_HASH);
> > + _map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_HASH, false);
> > printf("%s:PASS with inner HASH map\n", __func__);
> > }
>
Powered by blists - more mailing lists