[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <08a5981f-fb20-4ff0-92a6-45aaee952b00@virtuozzo.com>
Date: Mon, 9 Jun 2025 12:25:32 +0800
From: Pavel Tikhomirov <ptikhomirov@...tuozzo.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>,
Boqun Feng <boqun.feng@...il.com>, Waiman Long <longman@...hat.com>,
Kees Cook <kees@...nel.org>, Joel Granados <joel.granados@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Konstantin Khorenko <khorenko@...tuozzo.com>, Denis Lunev
<den@...tuozzo.com>,
Aleksandr Mikhalitsyn <aleksandr.mikhalitsyn@...onical.com>,
linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH] locking: detect spin_lock_irq() call with disabled
interrupts
On 6/6/25 18:58, Peter Zijlstra wrote:
> On Fri, Jun 06, 2025 at 05:57:23PM +0800, Pavel Tikhomirov wrote:
>> This is intended to easily detect irq spinlock self-deadlocks like:
>>
>> spin_lock_irq(A);
>> spin_lock_irq(B);
>> spin_unlock_irq(B);
>> IRQ {
>> spin_lock(A); <- deadlocks
>> spin_unlock(A);
>> }
>> spin_unlock_irq(A);
>>
>> Recently we saw this kind of deadlock on our partner's node:
>>
>> PID: 408 TASK: ffff8eee0870ca00 CPU: 36 COMMAND: "kworker/36:1H"
>> #0 [fffffe3861831e60] crash_nmi_callback at ffffffff97269e31
>> #1 [fffffe3861831e68] nmi_handle at ffffffff972300bb
>> #2 [fffffe3861831eb0] default_do_nmi at ffffffff97e9e000
>> #3 [fffffe3861831ed0] exc_nmi at ffffffff97e9e211
>> #4 [fffffe3861831ef0] end_repeat_nmi at ffffffff98001639
>> [exception RIP: native_queued_spin_lock_slowpath+638]
>> RIP: ffffffff97eb31ae RSP: ffffb1c8cd2a4d40 RFLAGS: 00000046
>> RAX: 0000000000000000 RBX: ffff8f2dffb34780 RCX: 0000000000940000
>> RDX: 000000000000002a RSI: 0000000000ac0000 RDI: ffff8eaed4eb81c0
>> RBP: ffff8eaed4eb81c0 R8: 0000000000000000 R9: ffff8f2dffaf3438
>> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>> R13: 0000000000000024 R14: 0000000000000000 R15: ffffd1c8bfb24b80
>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
>> --- <NMI exception stack> ---
>> #5 [ffffb1c8cd2a4d40] native_queued_spin_lock_slowpath at ffffffff97eb31ae
>> #6 [ffffb1c8cd2a4d60] _raw_spin_lock_irqsave at ffffffff97eb2730
>> #7 [ffffb1c8cd2a4d70] __wake_up at ffffffff9737c02d
>> #8 [ffffb1c8cd2a4da0] sbitmap_queue_wake_up at ffffffff9786c74d
>> #9 [ffffb1c8cd2a4dc8] sbitmap_queue_clear at ffffffff9786cc97
>> --- <IRQ stack> ---
>> [exception RIP: _raw_spin_unlock_irq+20]
>> RIP: ffffffff97eb2e84 RSP: ffffb1c8cd90fd18 RFLAGS: 00000283
>> RAX: 0000000000000001 RBX: ffff8eafb68efb40 RCX: 0000000000000001
>> RDX: 0000000000000008 RSI: 0000000000000061 RDI: ffff8eafb06c3c70
>> RBP: ffff8eee7af43000 R8: ffff8eaed4eb81c8 R9: ffff8eaed4eb81c8
>> R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
>> R13: ffff8eafb06c3bd0 R14: ffff8eafb06c3bc0 R15: ffff8eaed4eb81c0
>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
>>
>> Luckily it was already fixed in mainstream by:
>> commit b313a8c83551 ("block: Fix lockdep warning in blk_mq_mark_tag_wait")
>>
>> Currently if we are unlucky we may miss such a deadlock on our testing
>> system as it is racy and it depends on the specific interrupt handler
>> appearing at the right place and at the right time. So this patch tries
>> to detect the problem despite the absence of the interrupt.
>>
>> If we see spin_lock_irq under interrupts already disabled we can assume
>> that it has paired spin_unlock_irq which would reenable interrupts where
>> they should not be reenabled. So we report a warning for it.
>>
>> Same thing on spin_unlock_irq even if we were lucky and there was no
>> deadlock let's report if interrupts were enabled.
>>
>> Let's make this functionality catch one problem and then be disabled, to
>> prevent from spamming kernel log with warnings. Also let's add sysctl
>> kernel.debug_spin_lock_irq_with_disabled_interrupts to reenable it if
>> needed. Also let's add a by default enabled configuration option
>> DEBUG_SPINLOCK_IRQ_WITH_DISABLED_INTERRUPTS_BY_DEFAULT, in case we will
>> need this on boot.
>>
>> Yes Lockdep can detect that, if it sees both the interrupt stack and the
>> regular stack where we can get into interrupt with spinlock held. But
>> with this approach we can detect the problem even without ever getting
>> into interrupt stack. And also this functionality seems to be more
>> lightweight then Lockdep as it does not need to maintain lock dependency
>> graph.
>
> So why do we need DEBUG_SPINLOCK code, that's injected into every single
> callsite, if lockdep can already detect this?
Hello Peter,
Thank you for your reply!
We are trying to figure out a way to improve the detection of similar
cases in future, in our tests and even in some cases in production. I
see that in mainstream this issue (commit b313a8c83551) was detected by
Lockdep and fixed in the next release, that is very fast. But, sadly, we
didn't catch it in our QA testing cycles (including runs with debug
kernel with Lockdep), and it led to repeated complete nodes hang in
production of one of our partners.
1. This patch, arguably, is less performance consuming than Lockdep (I
don't have performance results, but my motivations is: we don't need to
save stacks and manage lock topology tree on each lock, thus saving
memory and some cpu cycles). Because Lockdep makes tests slow, we don't
run all tests with it, and hopefully we would be able to run all tests
with this small check, and it will improve detection probability.
2. If on a production node we have a reproduce of such a deadlock on
spinlock and we can't easily find the real spinlock owner in crashdump,
we can probably enable this small feature there and catch the "guilty"
stack directly on next reproduce. Running Lockdep there will definitely
be a no go. The suggested checks are really quite small and lightweight
and we think even release kernels can have those checks compiled-in
(with static key disabled by default surely) and those checks can be
enabled without extra reboot on affected nodes and with very small
performance footprint.
3. Lockdep generates false positives, from my experience roughly <10% of
what it reports is an actual thing, but maybe I'm just unlucky (or not
understanding something). And then Lockdep detects a problem there is no
way to search for the next error (AFAIU lock dependency tree gets a
cycle), you have to reboot, if I don't miss something. So we have to fix
even false positives if we want to get possible new detections later on
the same test. (Or maybe disable Lockdep tracing with lockdep_off, but
that can skip a real problem nearby.)
4. If lock in interrupt code path will not be triggered in test Lockdep
will not detect the deadlock. But with this check we only need to
trigger the non-interrupt code path to detect the problem (having
enabled interrupt brefore spin_unlock_irq* is always a bad sign) and
possible deadlock. So we again improve probability of detecting such
cases by a small fraction.
>
>
--
Best regards, Pavel Tikhomirov
Senior Software Developer, Virtuozzo.
Powered by blists - more mailing lists