linux-kernel - Re: [net-next] bpf: avoid hashtab deadlock with try

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fb7e9567-6452-7ccc-d2d5-697eb06ac251@redhat.com>
Date:   Tue, 29 Nov 2022 23:07:13 -0500
From:   Waiman Long <longman@...hat.com>
To:     Tonghao Zhang <xiangxia.m.yue@...il.com>,
        Hou Tao <houtao1@...wei.com>
Cc:     Hou Tao <houtao@...weicloud.com>, Hao Luo <haoluo@...gle.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>,
        netdev@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Martin KaFai Lau <martin.lau@...ux.dev>,
        Song Liu <song@...nel.org>, Yonghong Song <yhs@...com>,
        John Fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...nel.org>,
        Stanislav Fomichev <sdf@...gle.com>,
        Jiri Olsa <jolsa@...nel.org>, bpf <bpf@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Boqun Feng <boqun.feng@...il.com>
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

On 11/29/22 22:32, Tonghao Zhang wrote:
> On Wed, Nov 30, 2022 at 11:07 AM Waiman Long <longman@...hat.com> wrote:
>> On 11/29/22 21:47, Tonghao Zhang wrote:
>>> On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <houtao@...weicloud.com> wrote:
>>>> Hi Hao,
>>>>
>>>> On 11/30/2022 3:36 AM, Hao Luo wrote:
>>>>> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <boqun.feng@...il.com> wrote:
>>>>>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
>>>>>> lock pattern. Also after a second thought, the below suggestion doesn't
>>>>>> work. I think the proper way is to make htab_lock_bucket() as a
>>>>>> raw_spin_trylock_irqsave().
>>>>>>
>>>>>> Regards,
>>>>>> Boqun
>>>>>>
>>>>> The potential deadlock happens when the lock is contended from the
>>>>> same cpu. When the lock is contended from a remote cpu, we would like
>>>>> the remote cpu to spin and wait, instead of giving up immediately. As
>>>>> this gives better throughput. So replacing the current
>>>>> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
>>>>>
>>>>> I suspect the source of the problem is the 'hash' that we used in
>>>>> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
>>>>> whether we should use a hash derived from 'bucket' rather than from
>>>>> 'key'. For example, from the memory address of the 'bucket'. Because,
>>>>> different keys may fall into the same bucket, but yield different
>>>>> hashes. If the same bucket can never have two different 'hashes' here,
>>>>> the map_locked check should behave as intended. Also because
>>>>> ->map_locked is per-cpu, execution flows from two different cpus can
>>>>> both pass.
>>>> The warning from lockdep is due to the reason the bucket lock A is used in a
>>>> no-NMI context firstly, then the same bucke lock is used a NMI context, so
>>> Yes, I tested lockdep too, we can't use the lock in NMI(but only
>>> try_lock work fine) context if we use them no-NMI context. otherwise
>>> the lockdep prints the warning.
>>> * for the dead-lock case: we can use the
>>> 1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
>>> 2. or hash bucket address.
>>>
>>> * for lockdep warning, we should use in_nmi check with map_locked.
>>>
>>> BTW, the patch doesn't work, so we can remove the lock_key
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9
>>>
>>> static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>>                                      struct bucket *b, u32 hash,
>>>                                      unsigned long *pflags)
>>> {
>>>           unsigned long flags;
>>>
>>>           hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
>>>
>>>           preempt_disable();
>>>           if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>>>                   __this_cpu_dec(*(htab->map_locked[hash]));
>>>                   preempt_enable();
>>>                   return -EBUSY;
>>>           }
>>>
>>>           if (in_nmi()) {
>>>                   if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
>>>                           return -EBUSY;
>> That is not right. You have to do the same step as above by decrementing
>> the percpu count and enable preemption. So you may want to put all these
>> busy_out steps after the return 0 and use "goto busy_out;" to jump there.
> Yes, thanks Waiman, I should add the busy_out label.
>>>           } else {
>>>                   raw_spin_lock_irqsave(&b->raw_lock, flags);
>>>           }
>>>
>>>           *pflags = flags;
>>>           return 0;
>>> }
>> BTW, with that change, I believe you can actually remove all the percpu
>> map_locked count code.
> there are some case, for example, we run the bpf_prog A B in task
> context on the same cpu.
> bpf_prog A
> update map X
>      htab_lock_bucket
>          raw_spin_lock_irqsave()
>      lookup_elem_raw()
>          // bpf prog B is attached on lookup_elem_raw()
>          bpf prog B
>              update map X again and update the element
>                  htab_lock_bucket()
>                      // dead-lock
>                      raw_spinlock_irqsave()

I see, so nested locking is possible in this case. Beside using the 
percpu map_lock, another way is to have cpumask associated with each 
bucket lock and use each bit in the cpumask for to control access using 
test_and_set_bit() for each cpu. That will allow more concurrency and 
you can actually find out how contended is the lock. Anyway, it is just 
a thought.

Cheers,
Longman