[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <510d3bb7-5ec0-e281-aabc-a3d379475d71@gmail.com>
Date: Tue, 5 Feb 2019 19:21:08 -0800
From: Eric Dumazet <eric.dumazet@...il.com>
To: Waiman Long <longman@...hat.com>,
Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
Alexei Starovoitov <ast@...nel.org>, davem@...emloft.net,
daniel@...earbox.net, edumazet@...gle.com, jannh@...gle.com,
netdev@...r.kernel.org, kernel-team@...com
Subject: Re: [PATCH bpf-next 2/4] bpf: fix lockdep false positive in stackmap
On 01/30/2019 06:48 PM, Waiman Long wrote:
> On 01/30/2019 09:01 PM, Alexei Starovoitov wrote:
>> On Wed, Jan 30, 2019 at 04:32:12PM -0500, Waiman Long wrote:
>>> On 01/30/2019 04:11 PM, Waiman Long wrote:
>>>> On 01/30/2019 03:10 PM, Alexei Starovoitov wrote:
>>>>> On Wed, Jan 30, 2019 at 02:42:23PM -0500, Waiman Long wrote:
>>>>>> On 01/30/2019 02:30 PM, Alexei Starovoitov wrote:
>>>>>>> On Wed, Jan 30, 2019 at 11:15:30AM +0100, Peter Zijlstra wrote:
>>>>>>>> On Tue, Jan 29, 2019 at 08:04:56PM -0800, Alexei Starovoitov wrote:
>>>>>>>>> Lockdep warns about false positive:
>>>>>>>> This is not a false positive, and you probably also need to use
>>>>>>>> down_read_non_owner() to match this up_read_non_owner().
>>>>>>>>
>>>>>>>> {up,down}_read() and {up,down}_read_non_owner() are not only different
>>>>>>>> in the lockdep annotation; there is also optimistic spin stuff that
>>>>>>>> relies on 'owner' tracking.
>>>>>>> Can you point out in the code the spin bit?
>>>>>>> As far as I can see sem->owner is debug only feature.
>>>>>>> All owner checks are done under CONFIG_DEBUG_RWSEMS.
>>>>>> No, sem->owner is mainly for performing optimistic spinning which is a
>>>>>> performance feature to make rwsem writer-lock performs similar to mutex.
>>>>>> The debugging part is just an add-on. It is not the reason for the
>>>>>> presence of sem->owner.
>>>>> I see. Got it.
>>>>>
>>>>>>> Also there is no down_read_trylock_non_owner() at the moment.
>>>>>>> We can argue about it for -next, but I'd rather silence lockdep
>>>>>>> with this patch today.
>>>>>>>
>>>>>> We can add down_read_trylock_non_owner() if there is a need for it. It
>>>>>> should be easy to do.
>>>>> Yes, but looking through the code it's not clear to me that it's safe
>>>>> to mix non_owner() versions with regular.
>>>>> bpf/stackmap.c does down_read_trylock + up_read.
>>>>> If we add new down_read_trylock_non_owner that set the owner to
>>>>> NULL | RWSEM_* bits is this safe with conccurent read/write
>>>>> that do regular versions?
>>>>> rwsem_can_spin_on_owner() does:
>>>>> if (owner) {
>>>>> ret = is_rwsem_owner_spinnable(owner) &&
>>>>> owner_on_cpu(owner);
>>>>> that looks correct.
>>>>> For a second I thought there could be fault here due to non_owner.
>>>>> But there could be other places where it's assumed that owner
>>>>> is never null?
>>>> The content of owner is not the cause of the lockdep warning. The
>>>> lockdep code assumes that the task that acquires the lock will release
>>>> it some time later. That is not the case when you need to acquire the
>>>> lock by one task and released by another. In this case, you have to use
>>>> the non_owner version of down/up_read which disable the lockdep
>>>> acquire/release tracking. That will be the only difference between the
>>>> two set of APIs.
>>>>
>>>> Cheers,
>>>> Longman
>>> BTW, you may want to do something like that to make sure that the lock
>>> ownership is probably tracked.
>>>
>>> -Longman
>>>
>>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>>> index d43b145..79eef9d 100644
>>> --- a/kernel/bpf/stackmap.c
>>> +++ b/kernel/bpf/stackmap.c
>>> @@ -338,6 +338,13 @@ static void stack_map_get_build_id_offset(struct
>>> bpf_stack_
>>> } else {
>>> work->sem = ¤t->mm->mmap_sem;
>>> irq_work_queue(&work->irq_work);
>>> +
>>> + /*
>>> + * The irq_work will release the mmap_sem with
>>> + * up_read_non_owner(). The rwsem_release() is called
>>> + * here to release the lock from lockdep's perspective.
>>> + */
>>> + rwsem_release(¤t->mm->mmap_sem.dep_map, 1, _RET_IP_);
>> are you saying the above is enough coupled with up_read_non_owner?
>> Looking at how amdgpu is using this api... I think they just use non_owner
>> version when doing things from different task.
>> So I don't think pairing non_owner with non_owner is strictly necessary.
>> It seems fine to use down_read_trylock() with above rwsem_release() hack
>> plus up_read_non_owner().
>> Am I missing something?
>>
> The statement above is to clear the state for the lockdep so that it
> knows that the task no longer owns the lock. Without doing that, there
> is a possibility of some other kind of incorrect lockdep warning may be
> produced because the code will still think the task hold a read-lock on
> the mmap_sem. It is also possible no warning will be reported.
>
> The bpf code is kind of special that it acquires the mmap_sem. Later on,
> it either releases it itself (non-NMI) or request irq_work to release it
> (NMI). So either, you use the _non_owner versions for both acquire and
> release or fake the release like the code segment above.
>
> Peter, please chime in if you have other suggestion.
>
Hi guys
What are the plans to address the issue then ?
Latest net tree exhibits this trace :
Thanks
[ 546.116982] =====================================
[ 546.121688] WARNING: bad unlock balance detected!
[ 546.126393] 5.0.0-dbg-DEV #559 Not tainted
[ 546.130491] -------------------------------------
[ 546.135196] taskset/15409 is trying to release lock (&mm->mmap_sem) at:
[ 546.141844] [<ffffffffb0233246>] do_up_read+0x16/0x30
[ 546.146919] but there are no more locks to release!
[ 546.151790]
other info that might help us debug this:
[ 546.158325] 1 lock held by taskset/15409:
[ 546.162325] #0: 00000000683db857 (&sig->cred_guard_mutex){+.+.}, at: __do_execve_file.isra.38+0x13e/0xbc0
[ 546.171978]
stack backtrace:
[ 546.176327] CPU: 0 PID: 15409 Comm: taskset Not tainted 5.0.0-dbg-DEV #559
[ 546.183214] Hardware name: Intel RML,PCH/Iota_QC_19, BIOS 2.54.0 06/07/2018
[ 546.190182] Call Trace:
[ 546.192633] <IRQ>
[ 546.194672] dump_stack+0x67/0x95
[ 546.198006] ? do_up_read+0x16/0x30
[ 546.201491] print_unlock_imbalance_bug.part.33+0xd0/0xd7
[ 546.206914] ? do_up_read+0x16/0x30
[ 546.210406] lock_release+0x213/0x2d0
[ 546.214088] up_read+0x20/0xa0
[ 546.217138] do_up_read+0x16/0x30
[ 546.220457] irq_work_run_list+0x4a/0x70
[ 546.224408] irq_work_run+0x18/0x40
[ 546.227911] smp_irq_work_interrupt+0x54/0x1d0
[ 546.232362] irq_work_interrupt+0xf/0x20
[ 546.236280] </IRQ>
[ 546.238396] RIP: 0010:release_pages+0x69/0x650
[ 546.242839] Code: 01 4c 8b 2f 4c 8d 67 08 48 8d 44 f7 08 45 31 f6 48 89 04 24 eb 04 49 83 c4 08 48 8b 05 20 fe 31 01 49 39 c5 74 26 4d 8b 7d 08 <4d> 8d 47 ff 41 83 e7 01 4d 0f 44 c5 41 8b 40 34 4d 89 c7 85 c0 0f
[ 546.261615] RSP: 0018:ffffac604732bb00 EFLAGS: 00000287 ORIG_RAX: ffffffffffffff09
[ 546.269190] RAX: ffffe4c9c0ca8000 RBX: 0000000000000020 RCX: 0000000000000006
[ 546.276352] RDX: 0000000000000006 RSI: ffff8cf7facb6aa8 RDI: ffff8cf7facb6240
[ 546.283482] RBP: ffffac604732bb70 R08: ffffe4c9c0ba3d80 R09: 0000000000000000
[ 546.290613] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8cf7db3292d8
[ 546.297752] R13: ffffe4c9c0ba3dc0 R14: 0000000000000000 R15: ffffe4c9c0ba3d88
[ 546.304907] free_pages_and_swap_cache+0xe6/0x100
[ 546.309618] tlb_flush_mmu_free+0x36/0x60
[ 546.313625] arch_tlb_finish_mmu+0x28/0x1e0
[ 546.317801] tlb_finish_mmu+0x23/0x30
[ 546.321459] exit_mmap+0xc9/0x1f0
[ 546.324787] ? __mutex_unlock_slowpath+0x11/0x2e0
[ 546.329502] mmput+0x62/0x130
[ 546.332481] flush_old_exec+0x60e/0x810
[ 546.336314] load_elf_binary+0x830/0x18c0
[ 546.340323] ? search_binary_handler+0x8a/0x200
[ 546.344848] search_binary_handler+0x99/0x200
[ 546.349221] __do_execve_file.isra.38+0x76d/0xbc0
[ 546.353918] __x64_sys_execve+0x39/0x50
[ 546.357757] do_syscall_64+0x5a/0x460
[ 546.361440] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 546.366489] RIP: 0033:0x7fd599e3c067
[ 546.370069] Code: Bad RIP value.
[ 546.373306] RSP: 002b:00007ffcf7092e58 EFLAGS: 00000202 ORIG_RAX: 000000000000003b
[ 546.380880] RAX: ffffffffffffffda RBX: 00007ffcf7093058 RCX: 00007fd599e3c067
[ 546.388010] RDX: 00007ffcf7093070 RSI: 00007ffcf7093058 RDI: 00007ffcf70937c0
[ 546.395132] RBP: 00007ffcf7092ec0 R08: 00000000ffffffff R09: 0000000001b1ae40
[ 546.402265] R10: 00007ffcf7092c80 R11: 0000000000000202 R12: 00007ffcf70937c0
[ 546.409385] R13: 00007ffcf7093070 R14: 0000000000000000 R15: 00007ffcf7093048
Powered by blists - more mailing lists