netdev - Re: [External] Re: [PATCH] bpf: avoid grabbing spin

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0f904395-350d-5ee7-152e-93d104742e98@fb.com>
Date:   Thu, 19 May 2022 09:45:21 -0700
From:   Yonghong Song <yhs@...com>
To:     Feng Zhou <zhoufeng.zf@...edance.com>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Martin KaFai Lau <kafai@...com>,
        Song Liu <songliubraving@...com>,
        John Fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...nel.org>,
        Network Development <netdev@...r.kernel.org>,
        bpf <bpf@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>,
        Xiongchun Duan <duanxiongchun@...edance.com>,
        Muchun Song <songmuchun@...edance.com>,
        Dongdong Wang <wangdongdong.6@...edance.com>,
        Cong Wang <cong.wang@...edance.com>,
        Chengming Zhou <zhouchengming@...edance.com>
Subject: Re: [External] Re: [PATCH] bpf: avoid grabbing spin_locks of all cpus
 when no free elems



On 5/18/22 8:12 PM, Feng Zhou wrote:
> 在 2022/5/19 上午4:39, Yonghong Song 写道:
>>
>>
>> On 5/17/22 11:57 PM, Feng Zhou wrote:
>>> 在 2022/5/18 下午2:32, Alexei Starovoitov 写道:
>>>> On Tue, May 17, 2022 at 11:27 PM Feng zhou 
>>>> <zhoufeng.zf@...edance.com> wrote:
>>>>> From: Feng Zhou <zhoufeng.zf@...edance.com>
>>>>>
>>>>> We encountered bad case on big system with 96 CPUs that
>>>>> alloc_htab_elem() would last for 1ms. The reason is that after the
>>>>> prealloc hashtab has no free elems, when trying to update, it will 
>>>>> still
>>>>> grab spin_locks of all cpus. If there are multiple update users, the
>>>>> competition is very serious.
>>>>>
>>>>> So this patch add is_empty in pcpu_freelist_head to check freelist
>>>>> having free or not. If having, grab spin_lock, or check next cpu's
>>>>> freelist.
>>>>>
>>>>> Before patch: hash_map performance
>>>>> ./map_perf_test 1
>>
>> could you explain what parameter '1' means here?
> 
> This code is here:
> samples/bpf/map_perf_test_user.c
> samples/bpf/map_perf_test_kern.c
> parameter '1' means testcase flag, test hash_map's performance
> parameter '2048' means test hash_map's performance when free=0.
> testcase flag '2048' is added by myself to reproduce the problem 
> phenomenon.
> 
>>
>>>>> 0:hash_map_perf pre-alloc 975345 events per sec
>>>>> 4:hash_map_perf pre-alloc 855367 events per sec
>>>>> 12:hash_map_perf pre-alloc 860862 events per sec
>>>>> 8:hash_map_perf pre-alloc 849561 events per sec
>>>>> 3:hash_map_perf pre-alloc 849074 events per sec
>>>>> 6:hash_map_perf pre-alloc 847120 events per sec
>>>>> 10:hash_map_perf pre-alloc 845047 events per sec
>>>>> 5:hash_map_perf pre-alloc 841266 events per sec
>>>>> 14:hash_map_perf pre-alloc 849740 events per sec
>>>>> 2:hash_map_perf pre-alloc 839598 events per sec
>>>>> 9:hash_map_perf pre-alloc 838695 events per sec
>>>>> 11:hash_map_perf pre-alloc 845390 events per sec
>>>>> 7:hash_map_perf pre-alloc 834865 events per sec
>>>>> 13:hash_map_perf pre-alloc 842619 events per sec
>>>>> 1:hash_map_perf pre-alloc 804231 events per sec
>>>>> 15:hash_map_perf pre-alloc 795314 events per sec
>>>>>
>>>>> hash_map the worst: no free
>>>>> ./map_perf_test 2048
>>>>> 6:worse hash_map_perf pre-alloc 28628 events per sec
>>>>> 5:worse hash_map_perf pre-alloc 28553 events per sec
>>>>> 11:worse hash_map_perf pre-alloc 28543 events per sec
>>>>> 3:worse hash_map_perf pre-alloc 28444 events per sec
>>>>> 1:worse hash_map_perf pre-alloc 28418 events per sec
>>>>> 7:worse hash_map_perf pre-alloc 28427 events per sec
>>>>> 13:worse hash_map_perf pre-alloc 28330 events per sec
>>>>> 14:worse hash_map_perf pre-alloc 28263 events per sec
>>>>> 9:worse hash_map_perf pre-alloc 28211 events per sec
>>>>> 15:worse hash_map_perf pre-alloc 28193 events per sec
>>>>> 12:worse hash_map_perf pre-alloc 28190 events per sec
>>>>> 10:worse hash_map_perf pre-alloc 28129 events per sec
>>>>> 8:worse hash_map_perf pre-alloc 28116 events per sec
>>>>> 4:worse hash_map_perf pre-alloc 27906 events per sec
>>>>> 2:worse hash_map_perf pre-alloc 27801 events per sec
>>>>> 0:worse hash_map_perf pre-alloc 27416 events per sec
>>>>> 3:worse hash_map_perf pre-alloc 28188 events per sec
>>>>>
>>>>> ftrace trace
>>>>>
>>>>> 0)               |  htab_map_update_elem() {
>>>>> 0)   0.198 us    |    migrate_disable();
>>>>> 0)               |    _raw_spin_lock_irqsave() {
>>>>> 0)   0.157 us    |      preempt_count_add();
>>>>> 0)   0.538 us    |    }
>>>>> 0)   0.260 us    |    lookup_elem_raw();
>>>>> 0)               |    alloc_htab_elem() {
>>>>> 0)               |      __pcpu_freelist_pop() {
>>>>> 0)               |        _raw_spin_lock() {
>>>>> 0)   0.152 us    |          preempt_count_add();
>>>>> 0)   0.352 us    | native_queued_spin_lock_slowpath();
>>>>> 0)   1.065 us    |        }
>>>>>                   |        ...
>>>>> 0)               |        _raw_spin_unlock() {
>>>>> 0)   0.254 us    |          preempt_count_sub();
>>>>> 0)   0.555 us    |        }
>>>>> 0) + 25.188 us   |      }
>>>>> 0) + 25.486 us   |    }
>>>>> 0)               |    _raw_spin_unlock_irqrestore() {
>>>>> 0)   0.155 us    |      preempt_count_sub();
>>>>> 0)   0.454 us    |    }
>>>>> 0)   0.148 us    |    migrate_enable();
>>>>> 0) + 28.439 us   |  }
>>>>>
>>>>> The test machine is 16C, trying to get spin_lock 17 times, in addition
>>>>> to 16c, there is an extralist.
>>>> Is this with small max_entries and a large number of cpus?
>>>>
>>>> If so, probably better to fix would be to artificially
>>>> bump max_entries to be 4x of num_cpus.
>>>> Racy is_empty check still wastes the loop.
>>>
>>> This hash_map worst testcase with 16 CPUs, map's max_entries is 1000.
>>>
>>> This is the test case I constructed, it is to fill the map on 
>>> purpose, and then
>>>
>>> continue to update, just to reproduce the problem phenomenon.
>>>
>>> The bad case we encountered with 96 CPUs, map's max_entries is 10240.
>>
>> For such cases, most likely the map is *almost* full. What is the 
>> performance if we increase map size, e.g., from 10240 to 16K(16192)?
> 
> Yes, increasing max_entries can temporarily solve this problem, but when 
> 16k is used up,
> it will still encounter this problem. This patch is to try to fix this 
> corner case.

Okay, if I understand correctly, in your use case, you have lots of 
different keys and your intention is NOT to capture all the keys in
the hash table. So given a hash table, it is possible that the hash
will become full even if you increase the hashtable size.

Maybe you will occasionally delete some keys which will free some
space but the space will be quickly occupied by the new updates.

For such cases, yes, check whether the free list is empty or not
before taking the lock should be helpful. But I am wondering
what is the rationale behind your use case.