[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5f6610a1-3cd0-764e-0f49-91af1004ea50@huaweicloud.com>
Date: Thu, 6 Feb 2025 14:39:52 +0800
From: Hou Tao <houtao@...weicloud.com>
To: Yan Zhai <yan@...udflare.com>, bpf@...r.kernel.org
Cc: Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
John Fastabend <john.fastabend@...il.com>,
Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau
<martin.lau@...ux.dev>, Eduard Zingerman <eddyz87@...il.com>,
Song Liu <song@...nel.org>, Yonghong Song <yonghong.song@...ux.dev>,
KP Singh <kpsingh@...nel.org>, Stanislav Fomichev <sdf@...ichev.me>,
Hao Luo <haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>,
Mykola Lysenko <mykolal@...com>, Shuah Khan <shuah@...nel.org>,
Brian Vazquez <brianvv@...gle.com>, linux-kernel@...r.kernel.org,
linux-kselftest@...r.kernel.org, kernel-team@...udflare.com
Subject: Re: [PATCH bpf] bpf: skip non existing key in
generic_map_lookup_batch
Hi,
On 2/6/2025 12:57 AM, Yan Zhai wrote:
> The generic_map_lookup_batch currently returns EINTR if it fails with
> ENOENT and retries several times on bpf_map_copy_value. The next batch
> would start from the same location, presuming it's a transient issue.
> This is incorrect if a map can actually have "holes", i.e.
> "get_next_key" can return a key that does not point to a valid value. At
> least the array of maps type may contain such holes legitly. Right now
> these holes show up, generic batch lookup cannot proceed any more. It
> will always fail with EINTR errors.
>
> Rather, do not retry in generic_map_lookup_batch. If it finds a non
> existing element, skip to the next key.
>
> Fixes: cb4d03ab499d ("bpf: Add generic support for lookup batch op")
> Closes: https://lore.kernel.org/bpf/Z6JXtA1M5jAZx8xD@debian.debian/
> Signed-off-by: Yan Zhai <yan@...udflare.com>
> ---
> kernel/bpf/syscall.c | 16 ++----
> .../bpf/map_tests/map_in_map_batch_ops.c | 54 ++++++++++++++-----
> 2 files changed, 45 insertions(+), 25 deletions(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index c420edbfb7c8..5691fc0d370d 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1979,7 +1979,7 @@ int generic_map_lookup_batch(struct bpf_map *map,
> void __user *values = u64_to_user_ptr(attr->batch.values);
> void __user *keys = u64_to_user_ptr(attr->batch.keys);
> void *buf, *buf_prevkey, *prev_key, *key, *value;
> - int err, retry = MAP_LOOKUP_RETRIES;
> + int err;
> u32 value_size, cp, max_count;
>
> if (attr->batch.elem_flags & ~BPF_F_LOCK)
> @@ -2026,14 +2026,8 @@ int generic_map_lookup_batch(struct bpf_map *map,
> err = bpf_map_copy_value(map, key, value,
> attr->batch.elem_flags);
>
> - if (err == -ENOENT) {
> - if (retry) {
> - retry--;
> - continue;
> - }
> - err = -EINTR;
> - break;
> - }
> + if (err == -ENOENT)
> + goto next_key;
>
> if (err)
> goto free_buf;
> @@ -2048,12 +2042,12 @@ int generic_map_lookup_batch(struct bpf_map *map,
> goto free_buf;
> }
>
> + cp++;
> +next_key:
> if (!prev_key)
> prev_key = buf_prevkey;
>
> swap(prev_key, key);
> - retry = MAP_LOOKUP_RETRIES;
> - cp++;
> cond_resched();
> }
>
Let's move the new thread for further discussion.
>We are probably not on the same page. Let me clarify:
>By "retry logic" I mean this code snippet:
> if (err == -ENOENT) {
> if (retry) {
> retry--;
> continue;
> }
> err = -EINTR;
> break;
> }
Yes, the retry logic doesn't change the previous key. Thanks for the
clarifying.
> And by "skipping to the next key", it's simply
>
> if (err == -ENOENT)
> goto next_key;
>
> Note the "next_key" label was not in the current codebase. It is only
> in my posted patch. I don't think this would break lpm_trie unless I
> missed something.
I was trying to say that the proposed patch may break the lookup_batch
operation for lpm_trie, and let me explain step by step:
For LPM trie map:
(1) ->map_get_next_key(map, prev_key, key) returns a valid key
(2) bpf_map_copy_value() return -ENOMENT
It means the key must be deleted concurrently.
(3) goto next_key
It swaps the prev_key and key
(4) ->map_get_next_key(map, prev_key, key) again
prev_key points to a non-existing key, for LPM trie it will treat just
like prev_key=NULL case, the returned key will be duplicated.
> diff --git a/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c b/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c
> index 66191ae9863c..b38be71f06be 100644
It is better to split the update of map_tests into a separated patch and
it will be more friendly for stable backport.
> --- a/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c
> +++ b/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c
> @@ -120,11 +120,12 @@ static void validate_fetch_results(int outer_map_fd,
>
> static void fetch_and_validate(int outer_map_fd,
> struct bpf_map_batch_opts *opts,
> - __u32 batch_size, bool delete_entries)
> + __u32 batch_size, bool delete_entries,
> + bool has_holes)
> {
> + int err, max_entries = OUTER_MAP_ENTRIES - !!has_holes;
> __u32 *fetched_keys, *fetched_values, total_fetched = 0;
> __u32 batch_key = 0, fetch_count, step_size;
> - int err, max_entries = OUTER_MAP_ENTRIES;
> __u32 value_size = sizeof(__u32);
>
> /* Total entries needs to be fetched */
> @@ -135,9 +136,9 @@ static void fetch_and_validate(int outer_map_fd,
> "error=%s\n", strerror(errno));
>
> for (step_size = batch_size;
> - step_size <= max_entries;
> + step_size < max_entries + batch_size; /* allow read partial */
> step_size += batch_size) {
> - fetch_count = step_size;
> + fetch_count = batch_size;
The change "fetch_count = batch_size" may fail the lookup batch
operation of hash table, because the element in one bucket may be
greater than batch_size and it will return -ENOSPC constantly. And it
seems the original implementation of fetch_and_validate() is buggy,
because for hash map, the returned fetched_count may be less than the
passed count when there are too many elements in the same bucket. I
don't know the reason why the bug doesn't show up.
> err = delete_entries
> ? bpf_map_lookup_and_delete_batch(outer_map_fd,
> total_fetched ? &batch_key : NULL,
> @@ -184,18 +185,19 @@ static void fetch_and_validate(int outer_map_fd,
> }
>
> static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type,
> - enum bpf_map_type inner_map_type)
> + enum bpf_map_type inner_map_type,
> + bool has_holes)
> {
> + __u32 max_entries = OUTER_MAP_ENTRIES - !!has_holes;
> __u32 *outer_map_keys, *inner_map_fds;
> - __u32 max_entries = OUTER_MAP_ENTRIES;
> LIBBPF_OPTS(bpf_map_batch_opts, opts);
> __u32 value_size = sizeof(__u32);
> int batch_size[2] = {5, 10};
> __u32 map_index, op_index;
> int outer_map_fd, ret;
>
> - outer_map_keys = calloc(max_entries, value_size);
> - inner_map_fds = calloc(max_entries, value_size);
> + outer_map_keys = calloc(OUTER_MAP_ENTRIES, value_size);
> + inner_map_fds = calloc(OUTER_MAP_ENTRIES, value_size);
> CHECK((!outer_map_keys || !inner_map_fds),
> "Memory allocation failed for outer_map_keys or inner_map_fds",
> "error=%s\n", strerror(errno));
> @@ -209,6 +211,24 @@ static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type,
> ((outer_map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
> ? 9 : 1000) - map_index;
>
> + /* This condition is only meaningful for array of maps.
> + *
> + * max_entries == OUTER_MAP_ENTRIES - 1 if it is true. Say
> + * max_entries is short for n, then outer_map_keys looks like:
> + *
> + * [n, n-1, ... 2, 1]
> + *
> + * We change it to
> + *
> + * [n, n-1, ... 2, 0]
> + *
> + * So it will leave key 1 as a hole. It will serve to test the
> + * correctness when batch on an array: a "non-exist" key might be
> + * actually allocated and returned from key iteration.
> + */
> + if (has_holes)
> + outer_map_keys[max_entries - 1]--;
> +
> /* batch operation - map_update */
> ret = bpf_map_update_batch(outer_map_fd, outer_map_keys,
> inner_map_fds, &max_entries, &opts);
> @@ -219,12 +239,14 @@ static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type,
> /* batch operation - map_lookup */
> for (op_index = 0; op_index < 2; ++op_index)
> fetch_and_validate(outer_map_fd, &opts,
> - batch_size[op_index], false);
> + batch_size[op_index], false,
> + has_holes);
>
> /* batch operation - map_lookup_delete */
> if (outer_map_type == BPF_MAP_TYPE_HASH_OF_MAPS)
> fetch_and_validate(outer_map_fd, &opts,
> - max_entries, true /*delete*/);
> + max_entries, true /*delete*/,
> + has_holes);
>
> /* close all map fds */
> for (map_index = 0; map_index < max_entries; map_index++)
OUTER_MAP_ENTRIES instead of max_entries ?
> @@ -237,16 +259,20 @@ static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type,
>
> void test_map_in_map_batch_ops_array(void)
> {
> - _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_ARRAY);
> + _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_ARRAY, false);
> printf("%s:PASS with inner ARRAY map\n", __func__);
> - _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_HASH);
> + _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_HASH, false);
> printf("%s:PASS with inner HASH map\n", __func__);
> + _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_ARRAY, true);
> + printf("%s:PASS with inner ARRAY map with holes\n", __func__);
> + _map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_HASH, true);
> + printf("%s:PASS with inner HASH map with holes\n", __func__);
> }
>
> void test_map_in_map_batch_ops_hash(void)
> {
> - _map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_ARRAY);
> + _map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_ARRAY, false);
> printf("%s:PASS with inner ARRAY map\n", __func__);
> - _map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_HASH);
> + _map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_HASH, false);
> printf("%s:PASS with inner HASH map\n", __func__);
> }
Powered by blists - more mailing lists