lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aItfC4AjjH-IdBfy@shell.armlinux.org.uk>
Date: Thu, 31 Jul 2025 13:18:19 +0100
From: "Russell King (Oracle)" <linux@...linux.org.uk>
To: Krzysztof Kozlowski <krzk@...nel.org>, Mark Brown <broonie@...nel.org>
Cc: Chanwoo Choi <cw00.choi@...sung.com>,
	Alexandre Belloni <alexandre.belloni@...tlin.com>,
	Linus Walleij <linus.walleij@...aro.org>,
	Bartosz Golaszewski <brgl@...ev.pl>, linux-gpio@...r.kernel.org,
	linux-rtc@...r.kernel.org, linux-kernel@...r.kernel.org,
	Thierry Reding <treding@...dia.com>
Subject: Re: [BUG] 6.16-rc7: lockdep failure with max77620-gpio/max77686-rtc

On Wed, Jul 30, 2025 at 09:43:02PM +0200, Krzysztof Kozlowski wrote:
> On 30/07/2025 19:58, Russell King (Oracle) wrote:
> > Hi,
> > 
> > First, I'm not sure who is responsible for the max77620-gpio driver
> 
> 77620 is only for nvidia platforms and nvidia was upstreaming it,
> although it shares the RTC driver part with max77686. You should Cc
> nvidia SoC maintainers, maybe Thierry has someone around who could
> investigate it.
> 
> > (it's not in MAINTAINERS) but this bug points towards a problem with
> > one or other of these drivers.
> > 
> > Here is /proc/interrupts which may help debug this:
> > 
> >            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5
> >  94:          1          0          0          0          0          0 max77620-
> > top   4 Edge      max77686-rtc
> >  95:          1          0          0          0          0          0 max77686-rtc   1 Edge      rtc-alarm1
> > 
> > While running 6.16-rc7 on the Jetson Xavier NX platform, upon suspend,
> > I receive the following lockdep splat. I've added some instrumentation
> > into irq_set_irq_wake() which appears twice in the calltrace to print
> > the IRQ number and the "on" parameter to locate which interrupts are
> > involved in this splat. This splat is 100% reproducable.
> > 
> > [   46.721367] irq_set_irq_wake: irq=95 on=1
> > [   46.722067] irq_set_irq_wake: irq=94 on=1
> > [   46.722181] ============================================
> > [   46.722578] WARNING: possible recursive locking detected
> > [   46.722852] 6.16.0-rc7-net-next+ #432 Not tainted
> > [   46.722965] --------------------------------------------
> > [   46.723127] rtcwake/3984 is trying to acquire lock:
> > [   46.723235] ffff0000813b2c68 (&d->lock){+.+.}-{4:4}, at: regmap_irq_lock+0x18/0x24
> > [   46.723452]
> >                but task is already holding lock:
> > [   46.723556] ffff00008504dc68 (&d->lock){+.+.}-{4:4}, at: regmap_irq_lock+0x18/0x24
> > [   46.723780]
> >                other info that might help us debug this:
> > [   46.723903]  Possible unsafe locking scenario:
> > 
> > [   46.724015]        CPU0
> > [   46.724067]        ----
> > [   46.724119]   lock(&d->lock);
> > [   46.724212]   lock(&d->lock);
> > [   46.724282]
> >                 *** DEADLOCK ***
> > 
> > [   46.724348]  May be due to missing lock nesting notation
> > 
> > [   46.724492] 6 locks held by rtcwake/3984:
> > [   46.724576]  #0: ffff0000825693f8 (sb_writers#3){.+.+}-{0:0}, at: vfs_write+0x184/0x350
> > [   46.724902]  #1: ffff00008fd7fa88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x104/0x1c8
> > [   46.725258]  #2: ffff000080a64588 (kn->active#87){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x10c/0x1c8
> > [   46.725609]  #3: ffff8000815d4fb8 (system_transition_mutex){+.+.}-{4:4}, at: pm_suspend+0x220/0x300
> > [   46.725897]  #4: ffff00008500a8f8 (&dev->mutex){....}-{4:4}, at: device_suspend+0x1d8/0x630
> > [   46.726173]  #5: ffff00008504dc68 (&d->lock){+.+.}-{4:4}, at: regmap_irq_lock+0x18/0x24
> 
> 
> max77686 only disables/enables interrupts in suspend path, but max77620
> is doing also I2C transfers, but above is regmap_irq_lock, not regmap
> lock. Maybe this is not really max77620/77686 related at all? None of
> these do anything weird (or different than last 5 years), so missing
> nesting could be result of changes in other parts...
> 
> 
> > [   46.732435]
> >                stack backtrace:
> > [   46.734019] CPU: 3 UID: 0 PID: 3984 Comm: rtcwake Not tainted 6.16.0-rc7-net-next+ #432 PREEMPT
> > [   46.734029] Hardware name: NVIDIA NVIDIA Jetson Xavier NX Developer Kit/Jetson, BIOS 6.0-37391689 08/28/2024
> > [   46.734033] Call trace:
> > [   46.734036]  show_stack+0x18/0x24 (C)
> > [   46.734070]  dump_stack_lvl+0x90/0xd0
> > [   46.734080]  dump_stack+0x18/0x24
> > [   46.734107]  print_deadlock_bug+0x260/0x350
> > [   46.734114]  __lock_acquire+0xf28/0x2088
> > [   46.734120]  lock_acquire+0x19c/0x33c
> > [   46.734126]  __mutex_lock+0x84/0x530
> > [   46.734135]  mutex_lock_nested+0x24/0x30
> > [   46.734155]  regmap_irq_lock+0x18/0x24
> > [   46.734161]  __irq_get_desc_lock+0x8c/0x9c
> > [   46.734170]  irq_set_irq_wake+0x5c/0x1ac	<== I guess IRQ 94
> 
> ...like changes in irqchip.
> 
> > [   46.734176]  regmap_irq_sync_unlock+0x314/0x4f4
> > [   46.734182]  __irq_put_desc_unlock+0x48/0x4c
> > [   46.734190]  irq_set_irq_wake+0x88/0x1ac	<== I guess IRQ 95
> > [   46.734195]  max77686_rtc_suspend+0x34/0x74
> 
> 
> Because really above part is virtually unchanged since 10 years, except
> my commit d8f090dbeafdcc3d30761aa0062f19d1adf9ef08 (you can try
> reverting it... but it still could be correct/needed and just irqchip
> changed something around locking).

Thanks. I can say that reverting this has no effect on lockdep's splat,
so your change is in the clear.

However, there's also regmap-irq stuff to consider in the backtrace,
and looking at this today, I can't see how regmap-irq can be nested.

drivers/rtc/rtc-max77686.c makes use of regmap-irq since commit
f3937549a975 ("rtc: max77686: move initialisation of rtc regmap, irq
chip locally") in 2016.

drivers/mfd/max77620.c also makes use of regmap-irq since commit
327156c59360 ("mfd: max77620: Add core driver for MAX77620/MAX20024")
also in 2016.

Looking at the regmap-irq code, not much has changed in
regmap_irq_lock() and regmap_irq_sync_unlock(). The same seems true of
the ordering in irq_set_irq_wake(). So, I don't think this is a
regression as such, but a latent bug that either no one has bothered
to report, or no one bothers to test with lockdep enabled anymore.

I think the sequence here is:

irq_set_irq_wake() for IRQ 95 (rtc-alarm1) is called.
 + regmap_irq_lock() is called, which takes d->lock for IRQ 95
 + ... irq_set_irq_wake() does stuff
 ` regmap_irq_sync_unlock() is then called
   + this synchronises the wake state with the parent by calling
   | disable_irq_wake() or enable_irq_wake() as appropriate.
   | This calls irq_set_irq_wake(), causing recursion but on a
   | different IRQ, which also uses regmap-irq.
   | + regmap_irq_lock() is called, which takes d->lock for IRQ 94
   | |    * SPLAT *
   | + ...
   ` d->lock is released

This highlights the problem with "d->lock" - using "d" for a variable
name, while short, doesn't actually tell us what lock it is - is it
the irqdesc lock in kernel/irq ? Is it the regmap_irq_chip_data mutex
called "lock" in regmap_irq? It looks to me like it's the mutex.

So, I'd like to start a campaign against single-letter variables,
especially when it comes to code that takes locks! We should have
something in the kernel coding style which prevents single-letter
variable names when locks are taken!

I can't see that anything has changed in the code with regards to the
locking, so I think this is a bug that's been present ever since these
drivers were introduced, and regmap-irq is deficient in that it causes
the same lockdep lock class to be taken recursively when the IRQ wake
state changes.

>From what I can see, irq wake support for regmap-irq was added in
commit a43fd50dc99a5 ("regmap: Implement support for wake IRQs") and
this is the only operation that is propagated to the parent
interupt(s). Thus, the above splat is unlikely to occur unless one
makes use of wake support on a regmap-irq based interrupt whose
parent is also regmap-irq based.

So, adding Mark Brown.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ