netdev - Re: [PATCH bpf-next v2] bpf: using rcu_read_lock for bpf_sk_storage

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <24a16da9-a783-dff5-fcc4-75532bb05f1e@fb.com>
Date:   Tue, 15 Sep 2020 23:09:29 -0700
From:   Yonghong Song <yhs@...com>
To:     Martin KaFai Lau <kafai@...com>
CC:     <bpf@...r.kernel.org>, <netdev@...r.kernel.org>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>, <kernel-team@...com>,
        Song Liu <songliubraving@...com>
Subject: Re: [PATCH bpf-next v2] bpf: using rcu_read_lock for
 bpf_sk_storage_map iterator



On 9/15/20 10:37 PM, Martin KaFai Lau wrote:
> On Tue, Sep 15, 2020 at 03:35:07PM -0700, Yonghong Song wrote:
>> If a bucket contains a lot of sockets, during bpf_iter traversing
>> a bucket, concurrent userspace bpf_map_update_elem() and
>> bpf program bpf_sk_storage_{get,delete}() may experience
>> some undesirable delays as they will compete with bpf_iter
>> for bucket lock.
>>
>> Note that the number of buckets for bpf_sk_storage_map
>> is roughly the same as the number of cpus. So if there
>> are lots of sockets in the system, each bucket could
>> contain lots of sockets.
>>
>> Different actual use cases may experience different delays.
>> Here, using selftest bpf_iter subtest bpf_sk_storage_map,
>> I hacked the kernel with ktime_get_mono_fast_ns()
>> to collect the time when a bucket was locked
>> during bpf_iter prog traversing that bucket. This way,
>> the maximum incurred delay was measured w.r.t. the
>> number of elements in a bucket.
>>      # elems in each bucket          delay(ns)
>>        64                            17000
>>        256                           72512
>>        2048                          875246
>>
>> The potential delays will be further increased if
>> we have even more elemnts in a bucket. Using rcu_read_lock()
>> is a reasonable compromise here. It may lose some precision, e.g.,
>> access stale sockets, but it will not hurt performance of
>> bpf program or user space application which also tries
>> to get/delete or update map elements.
>>
>> Cc: Martin KaFai Lau <kafai@...com>
>> Acked-by: Song Liu <songliubraving@...com>
>> Signed-off-by: Yonghong Song <yhs@...com>
>> ---
>>   net/core/bpf_sk_storage.c | 21 ++++++++-------------
>>   1 file changed, 8 insertions(+), 13 deletions(-)
>>
>> Changelog:
>>   v1 -> v2:
>>     - added some performance number. (Song)
>>     - tried to silence some sparse complains. but still has some left like
>>         context imbalance in "..." - different lock contexts for basic block
>>       which the code is too hard for sparse to analyze. (Jakub)
>>
>> diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
>> index 4a86ea34f29e..4fc6b03d3639 100644
>> --- a/net/core/bpf_sk_storage.c
>> +++ b/net/core/bpf_sk_storage.c
>> @@ -678,6 +678,7 @@ struct bpf_iter_seq_sk_storage_map_info {
>>   static struct bpf_local_storage_elem *
>>   bpf_sk_storage_map_seq_find_next(struct bpf_iter_seq_sk_storage_map_info *info,
>>   				 struct bpf_local_storage_elem *prev_selem)
>> +	__acquires(RCU) __releases(RCU)
>>   {
>>   	struct bpf_local_storage *sk_storage;
>>   	struct bpf_local_storage_elem *selem;
>> @@ -701,7 +702,7 @@ bpf_sk_storage_map_seq_find_next(struct bpf_iter_seq_sk_storage_map_info *info,
>>   		if (!selem) {
>>   			/* not found, unlock and go to the next bucket */
>>   			b = &smap->buckets[bucket_id++];
>> -			raw_spin_unlock_bh(&b->lock);
>> +			rcu_read_unlock();
>>   			skip_elems = 0;
>>   			break;
>>   		}
>> @@ -715,7 +716,7 @@ bpf_sk_storage_map_seq_find_next(struct bpf_iter_seq_sk_storage_map_info *info,
>>   
>>   	for (i = bucket_id; i < (1U << smap->bucket_log); i++) {
>>   		b = &smap->buckets[i];
>> -		raw_spin_lock_bh(&b->lock);
>> +		rcu_read_lock();
>>   		count = 0;
>>   		hlist_for_each_entry(selem, &b->list, map_node) {
> hlist_for_each_entry_rcu()

Good catch!

> 
>>   			sk_storage = rcu_dereference_raw(selem->local_storage);
> Does lockdep complain without "_raw"?

It didn't. But I think using rcu_dereference() is better as it provides 
extra checking.

Will fix and send v3.