[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201002224012.kafu4edg2bz6x2x6@kafai-mbp.dhcp.thefacebook.com>
Date: Fri, 2 Oct 2020 15:40:27 -0700
From: Martin KaFai Lau <kafai@...com>
To: Song Liu <songliubraving@...com>
CC: <netdev@...r.kernel.org>, <bpf@...r.kernel.org>,
<kernel-team@...com>, <ast@...nel.org>, <daniel@...earbox.net>,
<john.fastabend@...il.com>, <kpsingh@...omium.org>
Subject: Re: [PATCH bpf-next] bpf: use raw_spin_trylock() for
pcpu_freelist_push/pop in NMI
On Fri, Sep 25, 2020 at 05:07:56PM -0700, Song Liu wrote:
> Recent improvements in LOCKDEP highlighted a potential A-A deadlock with
> pcpu_freelist in NMI:
>
> ./tools/testing/selftests/bpf/test_progs -t stacktrace_build_id_nmi
>
> [ 18.984807] ================================
> [ 18.984807] WARNING: inconsistent lock state
> [ 18.984808] 5.9.0-rc6-01771-g1466de1330e1 #2967 Not tainted
> [ 18.984809] --------------------------------
> [ 18.984809] inconsistent {INITIAL USE} -> {IN-NMI} usage.
> [ 18.984810] test_progs/1990 [HC2[2]:SC0[0]:HE0:SE1] takes:
> [ 18.984810] ffffe8ffffc219c0 (&head->lock){....}-{2:2}, at:
> __pcpu_freelist_pop+0xe3/0x180
> [ 18.984813] {INITIAL USE} state was registered at:
> [ 18.984814] lock_acquire+0x175/0x7c0
> [ 18.984814] _raw_spin_lock+0x2c/0x40
> [ 18.984815] __pcpu_freelist_pop+0xe3/0x180
> [ 18.984815] pcpu_freelist_pop+0x31/0x40
> [ 18.984816] htab_map_alloc+0xbbf/0xf40
> [ 18.984816] __do_sys_bpf+0x5aa/0x3ed0
> [ 18.984817] do_syscall_64+0x2d/0x40
> [ 18.984818] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 18.984818] irq event stamp: 12
> [ ... ]
> [ 18.984822] other info that might help us debug this:
> [ 18.984823] Possible unsafe locking scenario:
> [ 18.984823]
> [ 18.984824] CPU0
> [ 18.984824] ----
> [ 18.984824] lock(&head->lock);
> [ 18.984826] <Interrupt>
> [ 18.984826] lock(&head->lock);
> [ 18.984827]
> [ 18.984828] *** DEADLOCK ***
> [ 18.984828]
> [ 18.984829] 2 locks held by test_progs/1990:
> [ ... ]
> [ 18.984838] <NMI>
> [ 18.984838] dump_stack+0x9a/0xd0
> [ 18.984839] lock_acquire+0x5c9/0x7c0
> [ 18.984839] ? lock_release+0x6f0/0x6f0
> [ 18.984840] ? __pcpu_freelist_pop+0xe3/0x180
> [ 18.984840] _raw_spin_lock+0x2c/0x40
> [ 18.984841] ? __pcpu_freelist_pop+0xe3/0x180
> [ 18.984841] __pcpu_freelist_pop+0xe3/0x180
> [ 18.984842] pcpu_freelist_pop+0x17/0x40
> [ 18.984842] ? lock_release+0x6f0/0x6f0
> [ 18.984843] __bpf_get_stackid+0x534/0xaf0
> [ 18.984843] bpf_prog_1fd9e30e1438d3c5_oncpu+0x73/0x350
> [ 18.984844] bpf_overflow_handler+0x12f/0x3f0
>
> This is because pcpu_freelist_head.lock is accessed in both NMI and
> non-NMI context. Fix this issue by using raw_spin_trylock() in NMI.
>
> For systems with only one cpu, there is a trickier scenario with
> pcpu_freelist_push(): if the only pcpu_freelist_head.lock is already
> locked before NMI, raw_spin_trylock() will never succeed. Unlike,
> _pop(), where we can failover and return NULL, failing _push() will leak
> memory. Fix this issue with an extra list, pcpu_freelist.extralist. The
> extralist is primarily used to take _push() when raw_spin_trylock()
> failed on all the per cpu lists. It should be empty most of the time.
It is tricky. LGTM.
Acked-by: Martin KaFai Lau <kafai@...com>
Powered by blists - more mailing lists