lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <85391f68-f5d9-974f-3ce5-26cc486e0dd9@bytedance.com>
Date:   Tue, 19 Oct 2021 13:11:04 +0800
From:   Chengming Zhou <zhouchengming@...edance.com>
To:     Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Martin KaFai Lau <kafai@...com>,
        Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
        John Fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...nel.org>,
        Network Development <netdev@...r.kernel.org>,
        bpf <bpf@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [External] Re: [PATCH] bpf: use count for prealloc hashtab too

在 2021/10/19 下午12:43, Alexei Starovoitov 写道:
> On Mon, Oct 18, 2021 at 9:31 PM Chengming Zhou
> <zhouchengming@...edance.com> wrote:
>>
>> 在 2021/10/19 上午11:45, Alexei Starovoitov 写道:
>>> On Mon, Oct 18, 2021 at 8:14 PM Chengming Zhou
>>> <zhouchengming@...edance.com> wrote:
>>>>
>>>> 在 2021/10/19 上午9:57, Alexei Starovoitov 写道:
>>>>> On Sun, Oct 17, 2021 at 10:49 PM Chengming Zhou
>>>>> <zhouchengming@...edance.com> wrote:
>>>>>>
>>>>>> 在 2021/10/16 上午3:58, Alexei Starovoitov 写道:
>>>>>>> On Fri, Oct 15, 2021 at 11:04 AM Chengming Zhou
>>>>>>> <zhouchengming@...edance.com> wrote:
>>>>>>>>
>>>>>>>> We only use count for kmalloc hashtab not for prealloc hashtab, because
>>>>>>>> __pcpu_freelist_pop() return NULL when no more elem in pcpu freelist.
>>>>>>>>
>>>>>>>> But the problem is that __pcpu_freelist_pop() will traverse all CPUs and
>>>>>>>> spin_lock for all CPUs to find there is no more elem at last.
>>>>>>>>
>>>>>>>> We encountered bad case on big system with 96 CPUs that alloc_htab_elem()
>>>>>>>> would last for 1ms. This patch use count for prealloc hashtab too,
>>>>>>>> avoid traverse and spin_lock for all CPUs in this case.
>>>>>>>>
>>>>>>>> Signed-off-by: Chengming Zhou <zhouchengming@...edance.com>
>>>>>>>
>>>>>>> It's not clear from the commit log what you're solving.
>>>>>>> The atomic inc/dec in critical path of prealloc maps hurts performance.
>>>>>>> That's why it's not used.
>>>>>>>
>>>>>> Thanks for the explanation, what I'm solving is when hash table hasn't free
>>>>>> elements, we don't need to call __pcpu_freelist_pop() to traverse and
>>>>>> spin_lock all CPUs. The ftrace output of this bad case is below:
>>>>>>
>>>>>>  50)               |  htab_map_update_elem() {
>>>>>>  50)   0.329 us    |    _raw_spin_lock_irqsave();
>>>>>>  50)   0.063 us    |    lookup_elem_raw();
>>>>>>  50)               |    alloc_htab_elem() {
>>>>>>  50)               |      pcpu_freelist_pop() {
>>>>>>  50)   0.209 us    |        _raw_spin_lock();
>>>>>>  50)   0.264 us    |        _raw_spin_lock();
>>>>>
>>>>> This is LRU map. Not hash map.
>>>>> It will grab spin_locks of other cpus
>>>>> only if all previous cpus don't have free elements.
>>>>> Most likely your map is actually full and doesn't have any free elems.
>>>>> Since it's an lru it will force free an elem eventually.
>>>>>
>>>>
>>>> Maybe I missed something, the map_update_elem function of LRU map is
>>>> htab_lru_map_update_elem() and the htab_map_update_elem() above is the
>>>> map_update_elem function of hash map.
>>>> Because of the implementation of percpu freelist used in hash map, it
>>>> will spin_lock all other CPUs when there is no free elements.
>>>
>>> Ahh. Right. Then what's the point of optimizing the error case
>>> at the expense of the fast path?
>>>
>>
>> Yes, this patch only optimized the very bad case that no free elements left,
>> and add atomic operation in the fast path. Maybe the better workaround is not
>> allowing full hash map in our case or using LRU map?
> 
> No idea, since you haven't explained your use case.
> 
>> But the problem of spinlock contention also exist even when the map is not full,
>> like some CPUs run out of its freelist but other CPUs seldom used, then have to
>> grab those CPUs' spinlock to get free element.
> 
> In theory that would be correct. Do you see it in practice?
> Please describe the use case.
> 
>> Should we change the current implementation of percpu freelist to percpu lockless freelist?
> 
> Like llist.h ? That was tried already and for typical hash map usage
> it's slower than percpu free list.
> Many progs still do a lot of hash map update/delete on all cpus at once.
> That is the use case hashmap optimized for.
> Please see commit 6c9059817432 ("bpf: pre-allocate hash map elements")
> that also lists different alternative algorithms that were benchmarked.
> 

Ok, I will figure out our use case, try these alternatives and collect some data first.
Thanks for your explanation.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ