[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9c0c3e0b-33bc-51a7-7916-7278f14f308e@fb.com>
Date: Wed, 18 May 2022 13:39:50 -0700
From: Yonghong Song <yhs@...com>
To: Feng Zhou <zhoufeng.zf@...edance.com>,
Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <kafai@...com>,
Song Liu <songliubraving@...com>,
John Fastabend <john.fastabend@...il.com>,
KP Singh <kpsingh@...nel.org>,
Network Development <netdev@...r.kernel.org>,
bpf <bpf@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>,
Xiongchun Duan <duanxiongchun@...edance.com>,
Muchun Song <songmuchun@...edance.com>,
Dongdong Wang <wangdongdong.6@...edance.com>,
Cong Wang <cong.wang@...edance.com>,
Chengming Zhou <zhouchengming@...edance.com>
Subject: Re: [External] Re: [PATCH] bpf: avoid grabbing spin_locks of all cpus
when no free elems
On 5/17/22 11:57 PM, Feng Zhou wrote:
> 在 2022/5/18 下午2:32, Alexei Starovoitov 写道:
>> On Tue, May 17, 2022 at 11:27 PM Feng zhou <zhoufeng.zf@...edance.com>
>> wrote:
>>> From: Feng Zhou <zhoufeng.zf@...edance.com>
>>>
>>> We encountered bad case on big system with 96 CPUs that
>>> alloc_htab_elem() would last for 1ms. The reason is that after the
>>> prealloc hashtab has no free elems, when trying to update, it will still
>>> grab spin_locks of all cpus. If there are multiple update users, the
>>> competition is very serious.
>>>
>>> So this patch add is_empty in pcpu_freelist_head to check freelist
>>> having free or not. If having, grab spin_lock, or check next cpu's
>>> freelist.
>>>
>>> Before patch: hash_map performance
>>> ./map_perf_test 1
could you explain what parameter '1' means here?
>>> 0:hash_map_perf pre-alloc 975345 events per sec
>>> 4:hash_map_perf pre-alloc 855367 events per sec
>>> 12:hash_map_perf pre-alloc 860862 events per sec
>>> 8:hash_map_perf pre-alloc 849561 events per sec
>>> 3:hash_map_perf pre-alloc 849074 events per sec
>>> 6:hash_map_perf pre-alloc 847120 events per sec
>>> 10:hash_map_perf pre-alloc 845047 events per sec
>>> 5:hash_map_perf pre-alloc 841266 events per sec
>>> 14:hash_map_perf pre-alloc 849740 events per sec
>>> 2:hash_map_perf pre-alloc 839598 events per sec
>>> 9:hash_map_perf pre-alloc 838695 events per sec
>>> 11:hash_map_perf pre-alloc 845390 events per sec
>>> 7:hash_map_perf pre-alloc 834865 events per sec
>>> 13:hash_map_perf pre-alloc 842619 events per sec
>>> 1:hash_map_perf pre-alloc 804231 events per sec
>>> 15:hash_map_perf pre-alloc 795314 events per sec
>>>
>>> hash_map the worst: no free
>>> ./map_perf_test 2048
>>> 6:worse hash_map_perf pre-alloc 28628 events per sec
>>> 5:worse hash_map_perf pre-alloc 28553 events per sec
>>> 11:worse hash_map_perf pre-alloc 28543 events per sec
>>> 3:worse hash_map_perf pre-alloc 28444 events per sec
>>> 1:worse hash_map_perf pre-alloc 28418 events per sec
>>> 7:worse hash_map_perf pre-alloc 28427 events per sec
>>> 13:worse hash_map_perf pre-alloc 28330 events per sec
>>> 14:worse hash_map_perf pre-alloc 28263 events per sec
>>> 9:worse hash_map_perf pre-alloc 28211 events per sec
>>> 15:worse hash_map_perf pre-alloc 28193 events per sec
>>> 12:worse hash_map_perf pre-alloc 28190 events per sec
>>> 10:worse hash_map_perf pre-alloc 28129 events per sec
>>> 8:worse hash_map_perf pre-alloc 28116 events per sec
>>> 4:worse hash_map_perf pre-alloc 27906 events per sec
>>> 2:worse hash_map_perf pre-alloc 27801 events per sec
>>> 0:worse hash_map_perf pre-alloc 27416 events per sec
>>> 3:worse hash_map_perf pre-alloc 28188 events per sec
>>>
>>> ftrace trace
>>>
>>> 0) | htab_map_update_elem() {
>>> 0) 0.198 us | migrate_disable();
>>> 0) | _raw_spin_lock_irqsave() {
>>> 0) 0.157 us | preempt_count_add();
>>> 0) 0.538 us | }
>>> 0) 0.260 us | lookup_elem_raw();
>>> 0) | alloc_htab_elem() {
>>> 0) | __pcpu_freelist_pop() {
>>> 0) | _raw_spin_lock() {
>>> 0) 0.152 us | preempt_count_add();
>>> 0) 0.352 us | native_queued_spin_lock_slowpath();
>>> 0) 1.065 us | }
>>> | ...
>>> 0) | _raw_spin_unlock() {
>>> 0) 0.254 us | preempt_count_sub();
>>> 0) 0.555 us | }
>>> 0) + 25.188 us | }
>>> 0) + 25.486 us | }
>>> 0) | _raw_spin_unlock_irqrestore() {
>>> 0) 0.155 us | preempt_count_sub();
>>> 0) 0.454 us | }
>>> 0) 0.148 us | migrate_enable();
>>> 0) + 28.439 us | }
>>>
>>> The test machine is 16C, trying to get spin_lock 17 times, in addition
>>> to 16c, there is an extralist.
>> Is this with small max_entries and a large number of cpus?
>>
>> If so, probably better to fix would be to artificially
>> bump max_entries to be 4x of num_cpus.
>> Racy is_empty check still wastes the loop.
>
> This hash_map worst testcase with 16 CPUs, map's max_entries is 1000.
>
> This is the test case I constructed, it is to fill the map on purpose,
> and then
>
> continue to update, just to reproduce the problem phenomenon.
>
> The bad case we encountered with 96 CPUs, map's max_entries is 10240.
For such cases, most likely the map is *almost* full. What is the
performance if we increase map size, e.g., from 10240 to 16K(16192)?
Powered by blists - more mailing lists