linux-kernel - Re: drivers/nx: Invalid wait context issue when rebooting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZwleX01sala8Rc3-@linux.ibm.com>
Date: Fri, 11 Oct 2024 22:50:31 +0530
From: Vishal Chourasia <vishalc@...ux.ibm.com>
To: Michael Ellerman <mpe@...erman.id.au>
Cc: linuxppc-dev@...ts.ozlabs.org, Herbert Xu <herbert@...dor.apana.org.au>,
        "David S. Miller" <davem@...emloft.net>,
        Nicholas Piggin <npiggin@...il.com>,
        Christophe Leroy <christophe.leroy@...roup.eu>,
        Naveen N Rao <naveen@...nel.org>,
        Madhavan Srinivasan <maddy@...ux.ibm.com>,
        linux-crypto@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: drivers/nx: Invalid wait context issue when rebooting

On Fri, Oct 11, 2024 at 05:13:11PM +0530, Vishal Chourasia wrote:
> On Fri, Oct 11, 2024 at 09:37:27PM +1100, Michael Ellerman wrote:
> > Vishal Chourasia <vishalc@...ux.ibm.com> writes:
> > > Hi,
> > > I am getting Invalid wait context warning printed when rebooting lpar
> > >
> > > kexec/61926 is trying to acquire `of_reconfig_chain.rwsem` while holding
> > > spinlock `devdata_mutex`
> > >
> > > Note: Name of the spinlock is misleading.
> > 
> > Oof, yeah let's rename that to devdata_spinlock at least.
> > 
> > > In my case, I compiled a new vmlinux file and loaded it into the running
> > > kernel using `kexec -l` and then hit `reboot`
> > >
> > > dmesg:
> > > ------
> > >
> > > [ BUG: Invalid wait context ]
> > > 6.11.0-test2-10547-g684a64bf32b6-dirty #79 Not tainted
> > 
> > Is that v6.11 plus ~10,000 patches? O_o
> > 
> > Ah no, 684a64bf32b6 is roughly v6.12-rc1. Maybe if you fetch tags into
> > your tree you will get a more sensible version string?
> > 
> > Could also be good to try v6.12-rc2.
> Sure.
> > 
> > > -----------------------------
> > > kexec/61926 is trying to lock:
> > > c000000002d8b590 ((of_reconfig_chain).rwsem){++++}-{4:4}, at: blocking_notifier_chain_unregister+0x44/0xa0
> > > other info that might help us debug this:
> > > context-{5:5}
> > > 4 locks held by kexec/61926:
> > >  #0: c000000002926c70 (system_transition_mutex){+.+.}-{4:4}, at: __do_sys_reboot+0xf8/0x2e0
> > >  #1: c00000000291af30 (&dev->mutex){....}-{4:4}, at: device_shutdown+0x160/0x310
> > >  #2: c000000051011938 (&dev->mutex){....}-{4:4}, at: device_shutdown+0x174/0x310
> > >  #3: c000000002d88070 (devdata_mutex){....}-{3:3}, at: nx842_remove+0xac/0x1bc
> >   
> > That's pretty conclusive.
> > 
> > I don't understand why you're the first person to see this. I can't see
> > that any of the relevant code has changed recently. Unless something in
> > lockdep itself changed?
> > 
> > Did you just start seeing this on recent kernels? Can you bisect?
> Yes. Sure, I will try bisecting, and get back.
I tested for v6.0, v6.6, v6.9 kernel version, and all of them hit this
bug. 

Also, this bug is hit when I load the vmlinux file using `kexec -l`
and do a reboot.
> > 
> > > stack backtrace:
> > > CPU: 2 UID: 0 PID: 61926 Comm: kexec Not tainted 6.11.0-test2-10547-g684a64bf32b6-dirty #79
> > > Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries
> > > Call Trace:
> > > [c0000000bb577400] [c000000001239704] dump_stack_lvl+0xc8/0x130 (unreliable)
> > > [c0000000bb577440] [c000000000248398] __lock_acquire+0xb68/0xf00
> > > [c0000000bb577550] [c000000000248820] lock_acquire.part.0+0xf0/0x2a0
> > > [c0000000bb577670] [c00000000127faa0] down_write+0x70/0x1e0
> > > [c0000000bb5776b0] [c0000000001acea4] blocking_notifier_chain_unregister+0x44/0xa0
> > > [c0000000bb5776e0] [c000000000e2312c] of_reconfig_notifier_unregister+0x2c/0x40
> > > [c0000000bb577700] [c000000000ded24c] nx842_remove+0x148/0x1bc
> > > [c0000000bb577790] [c00000000011a114] vio_bus_remove+0x54/0xc0
> > > [c0000000bb5777c0] [c000000000c1a44c] device_shutdown+0x20c/0x310
> > > [c0000000bb577850] [c0000000001b0ab4] kernel_restart_prepare+0x54/0x70
> > > [c0000000bb577870] [c000000000308718] kernel_kexec+0xa8/0x110
> > > [c0000000bb5778e0] [c0000000001b1144] __do_sys_reboot+0x214/0x2e0
> > > [c0000000bb577a40] [c000000000032f98] system_call_exception+0x148/0x310
> > > [c0000000bb577e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
> > 
> > I don't see why of_reconfig_notifier_unregister() needs to be called
> > with the devdata_mutext held, but I haven't looked that closely at it.
> > 
> > So the change below might work.
> > 
> > cheers
> > 
> > diff --git a/drivers/crypto/nx/nx-common-pseries.c b/drivers/crypto/nx/nx-common-pseries.c
> > index 35f2d0d8507e..a2050c5fb11d 100644
> > --- a/drivers/crypto/nx/nx-common-pseries.c
> > +++ b/drivers/crypto/nx/nx-common-pseries.c
> > @@ -1122,10 +1122,11 @@ static void nx842_remove(struct vio_dev *viodev)
> >  
> >  	crypto_unregister_alg(&nx842_pseries_alg);
> >  
> > +	of_reconfig_notifier_unregister(&nx842_of_nb);
> > +
> >  	spin_lock_irqsave(&devdata_mutex, flags);
> >  	old_devdata = rcu_dereference_check(devdata,
> >  			lockdep_is_held(&devdata_mutex));
> > -	of_reconfig_notifier_unregister(&nx842_of_nb);
> >  	RCU_INIT_POINTER(devdata, NULL);
> >  	spin_unlock_irqrestore(&devdata_mutex, flags);
> >  	synchronize_rcu();
> >