netdev - Re: [External] Re: [PATCH] bpf: avoid grabbing spin

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <380fa11e-f15d-da1a-51f7-70e14ed58ffc@bytedance.com>
Date:   Thu, 19 May 2022 11:12:48 +0800
From:   Feng Zhou <zhoufeng.zf@...edance.com>
To:     Yonghong Song <yhs@...com>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Martin KaFai Lau <kafai@...com>,
        Song Liu <songliubraving@...com>,
        John Fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...nel.org>,
        Network Development <netdev@...r.kernel.org>,
        bpf <bpf@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>,
        Xiongchun Duan <duanxiongchun@...edance.com>,
        Muchun Song <songmuchun@...edance.com>,
        Dongdong Wang <wangdongdong.6@...edance.com>,
        Cong Wang <cong.wang@...edance.com>,
        Chengming Zhou <zhouchengming@...edance.com>
Subject: Re: [External] Re: [PATCH] bpf: avoid grabbing spin_locks of all cpus
 when no free elems

在 2022/5/19 上午4:39, Yonghong Song 写道:
>
>
> On 5/17/22 11:57 PM, Feng Zhou wrote:
>> 在 2022/5/18 下午2:32, Alexei Starovoitov 写道:
>>> On Tue, May 17, 2022 at 11:27 PM Feng zhou 
>>> <zhoufeng.zf@...edance.com> wrote:
>>>> From: Feng Zhou <zhoufeng.zf@...edance.com>
>>>>
>>>> We encountered bad case on big system with 96 CPUs that
>>>> alloc_htab_elem() would last for 1ms. The reason is that after the
>>>> prealloc hashtab has no free elems, when trying to update, it will 
>>>> still
>>>> grab spin_locks of all cpus. If there are multiple update users, the
>>>> competition is very serious.
>>>>
>>>> So this patch add is_empty in pcpu_freelist_head to check freelist
>>>> having free or not. If having, grab spin_lock, or check next cpu's
>>>> freelist.
>>>>
>>>> Before patch: hash_map performance
>>>> ./map_perf_test 1
>
> could you explain what parameter '1' means here?

This code is here:
samples/bpf/map_perf_test_user.c
samples/bpf/map_perf_test_kern.c
parameter '1' means testcase flag, test hash_map's performance
parameter '2048' means test hash_map's performance when free=0.
testcase flag '2048' is added by myself to reproduce the problem phenomenon.

>
>>>> 0:hash_map_perf pre-alloc 975345 events per sec
>>>> 4:hash_map_perf pre-alloc 855367 events per sec
>>>> 12:hash_map_perf pre-alloc 860862 events per sec
>>>> 8:hash_map_perf pre-alloc 849561 events per sec
>>>> 3:hash_map_perf pre-alloc 849074 events per sec
>>>> 6:hash_map_perf pre-alloc 847120 events per sec
>>>> 10:hash_map_perf pre-alloc 845047 events per sec
>>>> 5:hash_map_perf pre-alloc 841266 events per sec
>>>> 14:hash_map_perf pre-alloc 849740 events per sec
>>>> 2:hash_map_perf pre-alloc 839598 events per sec
>>>> 9:hash_map_perf pre-alloc 838695 events per sec
>>>> 11:hash_map_perf pre-alloc 845390 events per sec
>>>> 7:hash_map_perf pre-alloc 834865 events per sec
>>>> 13:hash_map_perf pre-alloc 842619 events per sec
>>>> 1:hash_map_perf pre-alloc 804231 events per sec
>>>> 15:hash_map_perf pre-alloc 795314 events per sec
>>>>
>>>> hash_map the worst: no free
>>>> ./map_perf_test 2048
>>>> 6:worse hash_map_perf pre-alloc 28628 events per sec
>>>> 5:worse hash_map_perf pre-alloc 28553 events per sec
>>>> 11:worse hash_map_perf pre-alloc 28543 events per sec
>>>> 3:worse hash_map_perf pre-alloc 28444 events per sec
>>>> 1:worse hash_map_perf pre-alloc 28418 events per sec
>>>> 7:worse hash_map_perf pre-alloc 28427 events per sec
>>>> 13:worse hash_map_perf pre-alloc 28330 events per sec
>>>> 14:worse hash_map_perf pre-alloc 28263 events per sec
>>>> 9:worse hash_map_perf pre-alloc 28211 events per sec
>>>> 15:worse hash_map_perf pre-alloc 28193 events per sec
>>>> 12:worse hash_map_perf pre-alloc 28190 events per sec
>>>> 10:worse hash_map_perf pre-alloc 28129 events per sec
>>>> 8:worse hash_map_perf pre-alloc 28116 events per sec
>>>> 4:worse hash_map_perf pre-alloc 27906 events per sec
>>>> 2:worse hash_map_perf pre-alloc 27801 events per sec
>>>> 0:worse hash_map_perf pre-alloc 27416 events per sec
>>>> 3:worse hash_map_perf pre-alloc 28188 events per sec
>>>>
>>>> ftrace trace
>>>>
>>>> 0)               |  htab_map_update_elem() {
>>>> 0)   0.198 us    |    migrate_disable();
>>>> 0)               |    _raw_spin_lock_irqsave() {
>>>> 0)   0.157 us    |      preempt_count_add();
>>>> 0)   0.538 us    |    }
>>>> 0)   0.260 us    |    lookup_elem_raw();
>>>> 0)               |    alloc_htab_elem() {
>>>> 0)               |      __pcpu_freelist_pop() {
>>>> 0)               |        _raw_spin_lock() {
>>>> 0)   0.152 us    |          preempt_count_add();
>>>> 0)   0.352 us    | native_queued_spin_lock_slowpath();
>>>> 0)   1.065 us    |        }
>>>>                   |        ...
>>>> 0)               |        _raw_spin_unlock() {
>>>> 0)   0.254 us    |          preempt_count_sub();
>>>> 0)   0.555 us    |        }
>>>> 0) + 25.188 us   |      }
>>>> 0) + 25.486 us   |    }
>>>> 0)               |    _raw_spin_unlock_irqrestore() {
>>>> 0)   0.155 us    |      preempt_count_sub();
>>>> 0)   0.454 us    |    }
>>>> 0)   0.148 us    |    migrate_enable();
>>>> 0) + 28.439 us   |  }
>>>>
>>>> The test machine is 16C, trying to get spin_lock 17 times, in addition
>>>> to 16c, there is an extralist.
>>> Is this with small max_entries and a large number of cpus?
>>>
>>> If so, probably better to fix would be to artificially
>>> bump max_entries to be 4x of num_cpus.
>>> Racy is_empty check still wastes the loop.
>>
>> This hash_map worst testcase with 16 CPUs, map's max_entries is 1000.
>>
>> This is the test case I constructed, it is to fill the map on 
>> purpose, and then
>>
>> continue to update, just to reproduce the problem phenomenon.
>>
>> The bad case we encountered with 96 CPUs, map's max_entries is 10240.
>
> For such cases, most likely the map is *almost* full. What is the 
> performance if we increase map size, e.g., from 10240 to 16K(16192)?

Yes, increasing max_entries can temporarily solve this problem, but when 
16k is used up,
it will still encounter this problem. This patch is to try to fix this 
corner case.