netdev - Re: [PATCH bpf-next 2/2] use prefetch function in bpf_map_lookup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAADnVQ+sgpyau=1psyFL9Z-kMmqg8nFju1PxOnTBhz5gzOdgNA@mail.gmail.com>
Date:   Wed, 17 Aug 2022 15:17:50 -0700
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Sagarika Sharma <sharmasagarika@...gle.com>
Cc:     Brian Vazquez <brianvv@...gle.com>,
        Sagarika Sharma <sagarikashar@...il.com>,
        Andrii Nakryiko <andrii@...nel.org>,
        Mykola Lysenko <mykolal@...com>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Yonghong Song <yhs@...com>,
        Stanislav Fomichev <sdf@...gle.com>,
        Luigi Rizzo <lrizzo@...gle.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Network Development <netdev@...r.kernel.org>,
        bpf <bpf@...r.kernel.org>
Subject: Re: [PATCH bpf-next 2/2] use prefetch function in bpf_map_lookup_batch()

On Tue, Aug 16, 2022 at 8:35 AM Sagarika Sharma
<sharmasagarika@...gle.com> wrote:
>
> This patch introduces the use of a module parameter n_prefetch
> which enables prefetching within the bpf_map_lookup_batch function
> for a faster lookup. Benefits depend on the platform, relative
> density of the map, and the setting of the module parameter as
> described below.
>
> For multiprocessor machines, for a particular key in a bpf map,
> each cpu has a value associated with that key. This patch’s
> change is as follows: when copying each of these values to
> userspace in bpf_map_lookup_batch, the value for a cpu
> n_prefetch ahead is prefetched.
>
> MEASUREMENTS:
> The benchmark test added in this patch series was used to
> measure the effect of prefetching as well as determine the
> optimal setting of n_prefetch given the different parameters:
> the test was run on many different platforms (with varying
> number of cpus), with a range of settings of n_prefetch, and with
> saturated, dense, and sparse maps (num_entries/capacity_of_map).
> The benchmark test measures the average time for a single entry
> lookup (t = num_entries_looked_up/total_time) given the varied
> factors as mentioned above. The overhead of the
> bpf_map_lookup_batch syscall introduces some error.
>
> Here are the experimental results:
>
> amd machine with 256 cores (rome zen 2)
> Density of map  n_prefetch      single entry lookup time (ns/op)
> --------------------------------------------------------------------
> 40k / 40k       0               16176.471
>                 1               13095.238
>                 5               7432.432
>                 12              5188.679
>                 20              9482.759
>
> 10k / 40k       0               13253.012
>                 5               7482.993
>                 12              5164.319
>                 20              9649.123
>
> 2.5k / 40k      0               7394.958
>                 5               7201.309
>                 13              4721.030
>                 20              8118.081
>
> For denser maps, the experiments suggest that as n_prefetch
> increases, there is a significant time benefit (~66% decrease)
> until a certain point after which the time benefit begins to
> decrease. For sparser maps, there is a less pronounced speedup
> from prefetching. Additionally, this experiment seems to suggest
> the optimal n_prefetch range on this particular machine is 12-13,
> but a setting of n_prefetch = 5 can still improve the single
> entry lookup time.
>
> intel-skylake (with 112 cores)
> Density of map  n_prefetch      single entry lookup time (ns/op)
> ------------------------------------------------------------------
> 40k / 40k       0               5729.167
>                 1               5092.593
>                 5               3395.062
>                 20              6875.000
>
> 10k / 40k       0               2029.520
>                 5               2989.130
>                 20              5820.106
>
> 2.5k / 40k      0               1598.256
>                 5               2935.290
>                 20              4867.257
>
> For this particular machine, the experimental results suggest that
> there is only a significant benefit in prefetching with denser maps.
> Prefetching within bpf_map_lookup_batch can provide significant
> benefit depending on the use case. Across the many different
> platforms experiments were performed on, a setting of n_prefetch = 5,
> although not the optimal setting, significantly decreased the single
> entry lookup time for denser maps.
>
> Signed-off-by: Sagarika Sharma <sharmasagarika@...gle.com>
> ---
>  kernel/bpf/hashtab.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 8392f7f8a8ac..eb70c4bbe246 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -15,6 +15,9 @@
>  #include "bpf_lru_list.h"
>  #include "map_in_map.h"
>
> +static uint n_prefetch;
> +module_param(n_prefetch, uint, 0644);

module_param is no go. sorry.

> +
>  #define HTAB_CREATE_FLAG_MASK                                          \
>         (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE |    \
>          BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)
> @@ -1743,9 +1746,13 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
>                 if (is_percpu) {
>                         int off = 0, cpu;
>                         void __percpu *pptr;
> +                       int num_cpus = num_possible_cpus();
>
>                         pptr = htab_elem_get_ptr(l, map->key_size);
>                         for_each_possible_cpu(cpu) {
> +                               if (n_prefetch > 0 && (cpu + n_prefetch) <= num_cpus)
> +                                       prefetch(per_cpu_ptr(pptr, cpu + n_prefetch));
> +

prefetch is a decent technique, but doesn't look like it helps
in all cases. Your numbers suggest that it may hurt too.
Whatever you're doing with map lookups you need a different
strategy than micro-optimizing this loop with prefetch.
Have you considering using for_each bpf side helper or user
space side map iterator to aggregate and copy values?

Or iterate over all map elements on one cpu first and
then on other cpus ? Essentially swapping
hlist_nulls_for_each_entry_safe and for_each_possible_cpu ?