[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e0caa616-ea4e-82b9-6fae-6f7318f88ab3@fb.com>
Date: Sat, 1 Aug 2020 22:23:57 -0700
From: Yonghong Song <yhs@...com>
To: Brian Vazquez <brianvv@...gle.com>,
Brian Vazquez <brianvv.kernel@...il.com>,
Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
"David S . Miller" <davem@...emloft.net>
CC: <linux-kernel@...r.kernel.org>, <netdev@...r.kernel.org>,
<bpf@...r.kernel.org>, Luigi Rizzo <lrizzo@...gle.com>
Subject: Re: [PATCH V2 bpf-next] bpf: make __htab_lookup_and_delete_batch
faster when map is almost empty
On 8/1/20 11:09 AM, Brian Vazquez wrote:
> While running some experiments it was observed that map_lookup_batch was 2x
> slower than get_next_key + lookup when the syscall overhead is minimal.
> This was because the map_lookup_batch implementation was more expensive
> traversing empty buckets, this can be really costly when the pre-allocated
> map is too big.
>
> This patch optimizes the case when the bucket is empty so we can move
> quickly to next bucket.
>
> The Benchmark was generated using the google/benchmark library[1]. When
> the benckmark is executed the number of iterations is governed by the
> amount of time the benckmarks takes, the number of iterations is at
> least 1 and not more than 1e9, until CPU time(of the entire binary, not
> just the part to measure), is greater than 0.5s. Time and CPU reported
> are the average of a single iteration over the iteration runs.
>
> The experiments to exercise the empty buckets are as follows:
>
> -The map was populated with a single entry to make sure that the syscall
> overhead is not helping the map_batch_lookup.
> -The size of the preallocated map was increased to show the effect of
> traversing empty buckets.
>
> To interpret the results, Benchmark is the name of the experiment where
> the first number correspond to the number of elements in the map, and
> the next one correspond to the size of the pre-allocated map. Time and
> CPU are average and correspond to the time elapsed per iteration and the
> system time consumtion per iteration.
thanks for explanation!
>
> Results:
>
> Using get_next_key + lookup:
>
> Benchmark Time(ns) CPU(ns) Iteration
> ---------------------------------------------------------------
> BM_DumpHashMap/1/1k 3593 3586 192680
> BM_DumpHashMap/1/4k 6004 5972 100000
> BM_DumpHashMap/1/16k 15755 15710 44341
> BM_DumpHashMap/1/64k 59525 59376 10000
>
> Using htab_lookup_batch before this patch:
> Benchmark Time(ns) CPU(ns) Iterations
> ---------------------------------------------------------------
> BM_DumpHashMap/1/1k 3933 3927 177978
> BM_DumpHashMap/1/4k 9192 9177 73951
> BM_DumpHashMap/1/16k 42011 41970 16789
> BM_DumpHashMap/1/64k 117895 117661 6135
>
> Using htab_lookup_batch with this patch:
> Benchmark Time(ns) CPU(ns) Iterations
> ---------------------------------------------------------------
> BM_DumpHashMap/1/1k 2809 2803 249212
> BM_DumpHashMap/1/4k 5318 5316 100000
> BM_DumpHashMap/1/16k 14925 14895 47448
> BM_DumpHashMap/1/64k 58870 58674 10000
>
> [1] https://github.com/google/benchmark.git
>
> Changelog:
>
> v1 -> v2:
> - Add more information about how to interpret the results
>
> Suggested-by: Luigi Rizzo <lrizzo@...gle.com>
> Cc: Yonghong Song <yhs@...com>
> Signed-off-by: Brian Vazquez <brianvv@...gle.com>
> ---
> kernel/bpf/hashtab.c | 23 ++++++++---------------
> 1 file changed, 8 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 024276787055..b6d28bd6345b 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -1349,7 +1349,6 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
> struct hlist_nulls_head *head;
> struct hlist_nulls_node *n;
> unsigned long flags = 0;
> - bool locked = false;
> struct htab_elem *l;
> struct bucket *b;
> int ret = 0;
> @@ -1408,19 +1407,19 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
> dst_val = values;
> b = &htab->buckets[batch];
> head = &b->head;
> - /* do not grab the lock unless need it (bucket_cnt > 0). */
> - if (locked)
> - flags = htab_lock_bucket(htab, b);
>
> + l = hlist_nulls_entry_safe(rcu_dereference_raw(hlist_nulls_first_rcu(head)),
> + struct htab_elem, hash_node);
> + if (!l && (batch + 1 < htab->n_buckets)) {
> + batch++;
> + goto again_nocopy;
> + }
In this case, if batch + 1 == htab->n_buckets, we still go through
htab_lock_bucket/htab_unlock_bucket which is really not needed.
So since we are trying to optimize for performance, let us handle
the above case as well. We can do
if (!l) {
if (batch + 1 < htab->n_buckets) {
batch++;
goto again_nocopy;
}
bucket_cnt = 0;
goto done_bucket;
}
...
done_bucket:
rcu_read_unlock();
bpf_enable_instrumentation();
...
what do you think?
> +
> + flags = htab_lock_bucket(htab, b);
> bucket_cnt = 0;
> hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
> bucket_cnt++;
>
> - if (bucket_cnt && !locked) {
> - locked = true;
> - goto again_nocopy;
> - }
> -
> if (bucket_cnt > (max_count - total)) {
> if (total == 0)
> ret = -ENOSPC;
> @@ -1446,10 +1445,6 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
> goto alloc;
> }
>
> - /* Next block is only safe to run if you have grabbed the lock */
> - if (!locked)
> - goto next_batch;
> -
> hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
> memcpy(dst_key, l->key, key_size);
>
> @@ -1492,7 +1487,6 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
> }
>
> htab_unlock_bucket(htab, b, flags);
> - locked = false;
>
> while (node_to_free) {
> l = node_to_free;
> @@ -1500,7 +1494,6 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
> bpf_lru_push_free(&htab->lru, &l->lru_node);
> }
>
> -next_batch:
> /* If we are not copying data, we can go to next bucket and avoid
> * unlocking the rcu.
> */
>
Powered by blists - more mailing lists