[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAADnVQ+sgpyau=1psyFL9Z-kMmqg8nFju1PxOnTBhz5gzOdgNA@mail.gmail.com>
Date: Wed, 17 Aug 2022 15:17:50 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Sagarika Sharma <sharmasagarika@...gle.com>
Cc: Brian Vazquez <brianvv@...gle.com>,
Sagarika Sharma <sagarikashar@...il.com>,
Andrii Nakryiko <andrii@...nel.org>,
Mykola Lysenko <mykolal@...com>,
Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Yonghong Song <yhs@...com>,
Stanislav Fomichev <sdf@...gle.com>,
Luigi Rizzo <lrizzo@...gle.com>,
LKML <linux-kernel@...r.kernel.org>,
Network Development <netdev@...r.kernel.org>,
bpf <bpf@...r.kernel.org>
Subject: Re: [PATCH bpf-next 2/2] use prefetch function in bpf_map_lookup_batch()
On Tue, Aug 16, 2022 at 8:35 AM Sagarika Sharma
<sharmasagarika@...gle.com> wrote:
>
> This patch introduces the use of a module parameter n_prefetch
> which enables prefetching within the bpf_map_lookup_batch function
> for a faster lookup. Benefits depend on the platform, relative
> density of the map, and the setting of the module parameter as
> described below.
>
> For multiprocessor machines, for a particular key in a bpf map,
> each cpu has a value associated with that key. This patch’s
> change is as follows: when copying each of these values to
> userspace in bpf_map_lookup_batch, the value for a cpu
> n_prefetch ahead is prefetched.
>
> MEASUREMENTS:
> The benchmark test added in this patch series was used to
> measure the effect of prefetching as well as determine the
> optimal setting of n_prefetch given the different parameters:
> the test was run on many different platforms (with varying
> number of cpus), with a range of settings of n_prefetch, and with
> saturated, dense, and sparse maps (num_entries/capacity_of_map).
> The benchmark test measures the average time for a single entry
> lookup (t = num_entries_looked_up/total_time) given the varied
> factors as mentioned above. The overhead of the
> bpf_map_lookup_batch syscall introduces some error.
>
> Here are the experimental results:
>
> amd machine with 256 cores (rome zen 2)
> Density of map n_prefetch single entry lookup time (ns/op)
> --------------------------------------------------------------------
> 40k / 40k 0 16176.471
> 1 13095.238
> 5 7432.432
> 12 5188.679
> 20 9482.759
>
> 10k / 40k 0 13253.012
> 5 7482.993
> 12 5164.319
> 20 9649.123
>
> 2.5k / 40k 0 7394.958
> 5 7201.309
> 13 4721.030
> 20 8118.081
>
> For denser maps, the experiments suggest that as n_prefetch
> increases, there is a significant time benefit (~66% decrease)
> until a certain point after which the time benefit begins to
> decrease. For sparser maps, there is a less pronounced speedup
> from prefetching. Additionally, this experiment seems to suggest
> the optimal n_prefetch range on this particular machine is 12-13,
> but a setting of n_prefetch = 5 can still improve the single
> entry lookup time.
>
> intel-skylake (with 112 cores)
> Density of map n_prefetch single entry lookup time (ns/op)
> ------------------------------------------------------------------
> 40k / 40k 0 5729.167
> 1 5092.593
> 5 3395.062
> 20 6875.000
>
> 10k / 40k 0 2029.520
> 5 2989.130
> 20 5820.106
>
> 2.5k / 40k 0 1598.256
> 5 2935.290
> 20 4867.257
>
> For this particular machine, the experimental results suggest that
> there is only a significant benefit in prefetching with denser maps.
> Prefetching within bpf_map_lookup_batch can provide significant
> benefit depending on the use case. Across the many different
> platforms experiments were performed on, a setting of n_prefetch = 5,
> although not the optimal setting, significantly decreased the single
> entry lookup time for denser maps.
>
> Signed-off-by: Sagarika Sharma <sharmasagarika@...gle.com>
> ---
> kernel/bpf/hashtab.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 8392f7f8a8ac..eb70c4bbe246 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -15,6 +15,9 @@
> #include "bpf_lru_list.h"
> #include "map_in_map.h"
>
> +static uint n_prefetch;
> +module_param(n_prefetch, uint, 0644);
module_param is no go. sorry.
> +
> #define HTAB_CREATE_FLAG_MASK \
> (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
> BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)
> @@ -1743,9 +1746,13 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
> if (is_percpu) {
> int off = 0, cpu;
> void __percpu *pptr;
> + int num_cpus = num_possible_cpus();
>
> pptr = htab_elem_get_ptr(l, map->key_size);
> for_each_possible_cpu(cpu) {
> + if (n_prefetch > 0 && (cpu + n_prefetch) <= num_cpus)
> + prefetch(per_cpu_ptr(pptr, cpu + n_prefetch));
> +
prefetch is a decent technique, but doesn't look like it helps
in all cases. Your numbers suggest that it may hurt too.
Whatever you're doing with map lookups you need a different
strategy than micro-optimizing this loop with prefetch.
Have you considering using for_each bpf side helper or user
space side map iterator to aggregate and copy values?
Or iterate over all map elements on one cpu first and
then on other cpus ? Essentially swapping
hlist_nulls_for_each_entry_safe and for_each_possible_cpu ?
Powered by blists - more mailing lists