netdev - Re: [PATCH bpf-next 2/4] bpf: fix lockdep false positive in stackmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <510d3bb7-5ec0-e281-aabc-a3d379475d71@gmail.com>
Date:   Tue, 5 Feb 2019 19:21:08 -0800
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Waiman Long <longman@...hat.com>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Alexei Starovoitov <ast@...nel.org>, davem@...emloft.net,
        daniel@...earbox.net, edumazet@...gle.com, jannh@...gle.com,
        netdev@...r.kernel.org, kernel-team@...com
Subject: Re: [PATCH bpf-next 2/4] bpf: fix lockdep false positive in stackmap



On 01/30/2019 06:48 PM, Waiman Long wrote:
> On 01/30/2019 09:01 PM, Alexei Starovoitov wrote:
>> On Wed, Jan 30, 2019 at 04:32:12PM -0500, Waiman Long wrote:
>>> On 01/30/2019 04:11 PM, Waiman Long wrote:
>>>> On 01/30/2019 03:10 PM, Alexei Starovoitov wrote:
>>>>> On Wed, Jan 30, 2019 at 02:42:23PM -0500, Waiman Long wrote:
>>>>>> On 01/30/2019 02:30 PM, Alexei Starovoitov wrote:
>>>>>>> On Wed, Jan 30, 2019 at 11:15:30AM +0100, Peter Zijlstra wrote:
>>>>>>>> On Tue, Jan 29, 2019 at 08:04:56PM -0800, Alexei Starovoitov wrote:
>>>>>>>>> Lockdep warns about false positive:
>>>>>>>> This is not a false positive, and you probably also need to use
>>>>>>>> down_read_non_owner() to match this up_read_non_owner().
>>>>>>>>
>>>>>>>> {up,down}_read() and {up,down}_read_non_owner() are not only different
>>>>>>>> in the lockdep annotation; there is also optimistic spin stuff that
>>>>>>>> relies on 'owner' tracking.
>>>>>>> Can you point out in the code the spin bit?
>>>>>>> As far as I can see sem->owner is debug only feature.
>>>>>>> All owner checks are done under CONFIG_DEBUG_RWSEMS.
>>>>>> No, sem->owner is mainly for performing optimistic spinning which is a
>>>>>> performance feature to make rwsem writer-lock performs similar to mutex.
>>>>>> The debugging part is just an add-on. It is not the reason for the
>>>>>> presence of sem->owner.
>>>>> I see. Got it.
>>>>>
>>>>>>> Also there is no down_read_trylock_non_owner() at the moment.
>>>>>>> We can argue about it for -next, but I'd rather silence lockdep
>>>>>>> with this patch today.
>>>>>>>
>>>>>> We can add down_read_trylock_non_owner() if there is a need for it. It
>>>>>> should be easy to do.
>>>>> Yes, but looking through the code it's not clear to me that it's safe
>>>>> to mix non_owner() versions with regular.
>>>>> bpf/stackmap.c does down_read_trylock + up_read.
>>>>> If we add new down_read_trylock_non_owner that set the owner to
>>>>> NULL | RWSEM_* bits is this safe with conccurent read/write
>>>>> that do regular versions?
>>>>> rwsem_can_spin_on_owner() does:
>>>>>         if (owner) {
>>>>>                 ret = is_rwsem_owner_spinnable(owner) &&
>>>>>                       owner_on_cpu(owner);
>>>>> that looks correct.
>>>>> For a second I thought there could be fault here due to non_owner.
>>>>> But there could be other places where it's assumed that owner
>>>>> is never null?
>>>> The content of owner is not the cause of the lockdep warning. The
>>>> lockdep code assumes that the task that acquires the lock will release
>>>> it some time later. That is not the case when you need to acquire the
>>>> lock by one task and released by another. In this case, you have to use
>>>> the non_owner version of down/up_read which disable the lockdep
>>>> acquire/release tracking. That will be the only difference between the
>>>> two set of APIs.
>>>>
>>>> Cheers,
>>>> Longman
>>> BTW, you may want to do something like that to make sure that the lock
>>> ownership is probably tracked.
>>>
>>> -Longman
>>>
>>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>>> index d43b145..79eef9d 100644
>>> --- a/kernel/bpf/stackmap.c
>>> +++ b/kernel/bpf/stackmap.c
>>> @@ -338,6 +338,13 @@ static void stack_map_get_build_id_offset(struct
>>> bpf_stack_
>>>         } else {
>>>                 work->sem = &current->mm->mmap_sem;
>>>                 irq_work_queue(&work->irq_work);
>>> +
>>> +               /*
>>> +                * The irq_work will release the mmap_sem with
>>> +                * up_read_non_owner(). The rwsem_release() is called
>>> +                * here to release the lock from lockdep's perspective.
>>> +                */
>>> +               rwsem_release(&current->mm->mmap_sem.dep_map, 1, _RET_IP_);
>> are you saying the above is enough coupled with up_read_non_owner?
>> Looking at how amdgpu is using this api... I think they just use non_owner
>> version when doing things from different task.
>> So I don't think pairing non_owner with non_owner is strictly necessary.
>> It seems fine to use down_read_trylock() with above rwsem_release() hack
>> plus up_read_non_owner().
>> Am I missing something?
>>
> The statement above is to clear the state for the lockdep so that it
> knows that the task no longer owns the lock. Without doing that, there
> is a possibility of some other kind of incorrect lockdep warning may be
> produced because the code will still think the task hold a read-lock on
> the mmap_sem. It is also possible no warning will be reported.
> 
> The bpf code is kind of special that it acquires the mmap_sem. Later on,
> it either releases it itself (non-NMI) or request irq_work to release it
> (NMI). So either, you use the _non_owner versions for both acquire and
> release or fake the release like the code segment above.
> 
> Peter, please chime in if you have other suggestion.
> 

Hi guys

What are the plans to address the issue then ?

Latest net tree exhibits this trace :

Thanks

[  546.116982] =====================================
[  546.121688] WARNING: bad unlock balance detected!
[  546.126393] 5.0.0-dbg-DEV #559 Not tainted
[  546.130491] -------------------------------------
[  546.135196] taskset/15409 is trying to release lock (&mm->mmap_sem) at:
[  546.141844] [<ffffffffb0233246>] do_up_read+0x16/0x30
[  546.146919] but there are no more locks to release!
[  546.151790] 
               other info that might help us debug this:
[  546.158325] 1 lock held by taskset/15409:
[  546.162325]  #0: 00000000683db857 (&sig->cred_guard_mutex){+.+.}, at: __do_execve_file.isra.38+0x13e/0xbc0
[  546.171978] 
               stack backtrace:
[  546.176327] CPU: 0 PID: 15409 Comm: taskset Not tainted 5.0.0-dbg-DEV #559
[  546.183214] Hardware name: Intel RML,PCH/Iota_QC_19, BIOS 2.54.0 06/07/2018
[  546.190182] Call Trace:
[  546.192633]  <IRQ>
[  546.194672]  dump_stack+0x67/0x95
[  546.198006]  ? do_up_read+0x16/0x30
[  546.201491]  print_unlock_imbalance_bug.part.33+0xd0/0xd7
[  546.206914]  ? do_up_read+0x16/0x30
[  546.210406]  lock_release+0x213/0x2d0
[  546.214088]  up_read+0x20/0xa0
[  546.217138]  do_up_read+0x16/0x30
[  546.220457]  irq_work_run_list+0x4a/0x70
[  546.224408]  irq_work_run+0x18/0x40
[  546.227911]  smp_irq_work_interrupt+0x54/0x1d0
[  546.232362]  irq_work_interrupt+0xf/0x20
[  546.236280]  </IRQ>
[  546.238396] RIP: 0010:release_pages+0x69/0x650
[  546.242839] Code: 01 4c 8b 2f 4c 8d 67 08 48 8d 44 f7 08 45 31 f6 48 89 04 24 eb 04 49 83 c4 08 48 8b 05 20 fe 31 01 49 39 c5 74 26 4d 8b 7d 08 <4d> 8d 47 ff 41 83 e7 01 4d 0f 44 c5 41 8b 40 34 4d 89 c7 85 c0 0f
[  546.261615] RSP: 0018:ffffac604732bb00 EFLAGS: 00000287 ORIG_RAX: ffffffffffffff09
[  546.269190] RAX: ffffe4c9c0ca8000 RBX: 0000000000000020 RCX: 0000000000000006
[  546.276352] RDX: 0000000000000006 RSI: ffff8cf7facb6aa8 RDI: ffff8cf7facb6240
[  546.283482] RBP: ffffac604732bb70 R08: ffffe4c9c0ba3d80 R09: 0000000000000000
[  546.290613] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8cf7db3292d8
[  546.297752] R13: ffffe4c9c0ba3dc0 R14: 0000000000000000 R15: ffffe4c9c0ba3d88
[  546.304907]  free_pages_and_swap_cache+0xe6/0x100
[  546.309618]  tlb_flush_mmu_free+0x36/0x60
[  546.313625]  arch_tlb_finish_mmu+0x28/0x1e0
[  546.317801]  tlb_finish_mmu+0x23/0x30
[  546.321459]  exit_mmap+0xc9/0x1f0
[  546.324787]  ? __mutex_unlock_slowpath+0x11/0x2e0
[  546.329502]  mmput+0x62/0x130
[  546.332481]  flush_old_exec+0x60e/0x810
[  546.336314]  load_elf_binary+0x830/0x18c0
[  546.340323]  ? search_binary_handler+0x8a/0x200
[  546.344848]  search_binary_handler+0x99/0x200
[  546.349221]  __do_execve_file.isra.38+0x76d/0xbc0
[  546.353918]  __x64_sys_execve+0x39/0x50
[  546.357757]  do_syscall_64+0x5a/0x460
[  546.361440]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  546.366489] RIP: 0033:0x7fd599e3c067
[  546.370069] Code: Bad RIP value.
[  546.373306] RSP: 002b:00007ffcf7092e58 EFLAGS: 00000202 ORIG_RAX: 000000000000003b
[  546.380880] RAX: ffffffffffffffda RBX: 00007ffcf7093058 RCX: 00007fd599e3c067
[  546.388010] RDX: 00007ffcf7093070 RSI: 00007ffcf7093058 RDI: 00007ffcf70937c0
[  546.395132] RBP: 00007ffcf7092ec0 R08: 00000000ffffffff R09: 0000000001b1ae40
[  546.402265] R10: 00007ffcf7092c80 R11: 0000000000000202 R12: 00007ffcf70937c0
[  546.409385] R13: 00007ffcf7093070 R14: 0000000000000000 R15: 00007ffcf7093048