linux-kernel - Re: [net-next] bpf: avoid hashtab deadlock with try

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8d424223-1da6-60bf-dd2c-cd2fe6d263fe@huaweicloud.com>
Date:   Wed, 30 Nov 2022 12:13:02 +0800
From:   Hou Tao <houtao@...weicloud.com>
To:     Tonghao Zhang <xiangxia.m.yue@...il.com>
Cc:     Hao Luo <haoluo@...gle.com>, Waiman Long <longman@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>,
        netdev@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Martin KaFai Lau <martin.lau@...ux.dev>,
        Song Liu <song@...nel.org>, Yonghong Song <yhs@...com>,
        John Fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...nel.org>,
        Stanislav Fomichev <sdf@...gle.com>,
        Jiri Olsa <jolsa@...nel.org>, bpf <bpf@...r.kernel.org>,
        "houtao1@...wei.com" <houtao1@...wei.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Boqun Feng <boqun.feng@...il.com>
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock

Hi,

On 11/30/2022 10:47 AM, Tonghao Zhang wrote:
> On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <houtao@...weicloud.com> wrote:
>> Hi Hao,
>>
>> On 11/30/2022 3:36 AM, Hao Luo wrote:
>>> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <boqun.feng@...il.com> wrote:
>>>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
>>>> lock pattern. Also after a second thought, the below suggestion doesn't
>>>> work. I think the proper way is to make htab_lock_bucket() as a
>>>> raw_spin_trylock_irqsave().
>>>>
>>>> Regards,
>>>> Boqun
>>>>
>>> The potential deadlock happens when the lock is contended from the
>>> same cpu. When the lock is contended from a remote cpu, we would like
>>> the remote cpu to spin and wait, instead of giving up immediately. As
>>> this gives better throughput. So replacing the current
>>> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
>>>
>>> I suspect the source of the problem is the 'hash' that we used in
>>> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
>>> whether we should use a hash derived from 'bucket' rather than from
>>> 'key'. For example, from the memory address of the 'bucket'. Because,
>>> different keys may fall into the same bucket, but yield different
>>> hashes. If the same bucket can never have two different 'hashes' here,
>>> the map_locked check should behave as intended. Also because
>>> ->map_locked is per-cpu, execution flows from two different cpus can
>>> both pass.
>> The warning from lockdep is due to the reason the bucket lock A is used in a
>> no-NMI context firstly, then the same bucke lock is used a NMI context, so
> Yes, I tested lockdep too, we can't use the lock in NMI(but only
> try_lock work fine) context if we use them no-NMI context. otherwise
> the lockdep prints the warning.
> * for the dead-lock case: we can use the
> 1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
> 2. or hash bucket address.
Use the computed hash will be better than hash bucket address, because the hash
buckets are allocated sequentially.
>
> * for lockdep warning, we should use in_nmi check with map_locked.
>
> BTW, the patch doesn't work, so we can remove the lock_key
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9
>
> static inline int htab_lock_bucket(const struct bpf_htab *htab,
>                                    struct bucket *b, u32 hash,
>                                    unsigned long *pflags)
> {
>         unsigned long flags;
>
>         hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
>
>         preempt_disable();
>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>                 __this_cpu_dec(*(htab->map_locked[hash]));
>                 preempt_enable();
>                 return -EBUSY;
>         }
>
>         if (in_nmi()) {
>                 if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
>                         return -EBUSY;
The only purpose of trylock here is to make lockdep happy and it may lead to
unnecessary -EBUSY error for htab operations in NMI context. I still prefer add
a virtual lock-class for map_locked to fix the lockdep warning. So could you use
separated patches to fix the potential dead-lock and the lockdep warning ? It
will be better you can also add a bpf selftests for deadlock problem as said before.

Thanks,
Tao
>         } else {
>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
>         }
>
>         *pflags = flags;
>         return 0;
> }
>
>
>> lockdep deduces that may be a dead-lock. I have already tried to use the same
>> map_locked for keys with the same bucket, the dead-lock is gone, but still got
>> lockdep warning.
>>> Hao
>>> .
>