[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z6OdkdI2ss19FyVT@shell.armlinux.org.uk>
Date: Wed, 5 Feb 2025 17:19:13 +0000
From: "Russell King (Oracle)" <linux@...linux.org.uk>
To: kernel test robot <oliver.sang@...el.com>
Cc: Andrew Lunn <andrew@...n.ch>, oe-lkp@...ts.linux.dev, lkp@...el.com,
linux-kernel@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>,
Jacob Keller <jacob.e.keller@...el.com>, netdev@...r.kernel.org,
Kory Maincent <kory.maincent@...tlin.com>
Subject: Re: [linus:master] [net] 03abf2a7c6: WARNING:suspicious_RCU_usage
On Wed, Feb 05, 2025 at 02:08:04PM +0800, kernel test robot wrote:
> kernel test robot noticed "WARNING:suspicious_RCU_usage" on:
>
> commit: 03abf2a7c65451e663b078b0ed1bfa648cd9380f ("net: phylink: add EEE management")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
I think there's multiple issues here that need addressing:
1) calling phy_detach() in a context that phy_attach() is allowed
causes this warning, which seems absurd (being able to attach but
not detach on error is a problem.)
This is the root cause of the issue, and others have run into this same
problem. There's already been an effort to address this:
https://lore.kernel.org/r/20250117141446.1076951-1-kory.maincent@bootlin.com
https://lore.kernel.org/r/20250117173645.1107460-1-kory.maincent@bootlin.com
https://lore.kernel.org/r/20250120141926.1290763-1-kory.maincent@bootlin.com
and I think the conclusion is that the RTNL had to be held while calling
phy_detach().
2) phy_modify_mmd() returning -EPERM. Having traced through the code,
this comes from my swphy.c which returns -1 (eww). However, as this
code was extracted from fixed_phy.c, and the emulation is provided
for userspace, this is part of the uAPI of the kernel and can't be
changed.
3) the blamed commit introduces a call to phy_modify_mmd() to set the
clock-stop bit, which ought not be done unless phylink managed EEE
is being used.
(2) and (3) together is what ends up causing:
> [ 19.646149][ T22] dsa-loop fixed-0:1f lan1 (uninitialized): failed to connect to PHY: -EPERM
> [ 19.647542][ T22] dsa-loop fixed-0:1f lan1 (uninitialized): error -1 setting up PHY for tree 0, switch 0, port 0
> [ 19.649283][ T22] dsa-loop fixed-0:1f lan2 (uninitialized): PHY [dsa-0.0:01] driver [Generic PHY] (irq=POLL)
> [ 19.650853][ T22] dsa-loop fixed-0:1f lan2 (uninitialized): failed to connect to PHY: -EPERM
> [ 19.652238][ T22] dsa-loop fixed-0:1f lan2 (uninitialized): error -1 setting up PHY for tree 0, switch 0, port 1
> [ 19.653856][ T22] dsa-loop fixed-0:1f lan3 (uninitialized): PHY [dsa-0.0:02] driver [Generic PHY] (irq=POLL)
> [ 19.655392][ T22] dsa-loop fixed-0:1f lan3 (uninitialized): failed to connect to PHY: -EPERM
> [ 19.656689][ T22] dsa-loop fixed-0:1f lan3 (uninitialized): error -1 setting up PHY for tree 0, switch 0, port 2
> [ 19.658308][ T22] dsa-loop fixed-0:1f lan4 (uninitialized): PHY [dsa-0.0:03] driver [Generic PHY] (irq=POLL)
> [ 19.659841][ T22] dsa-loop fixed-0:1f lan4 (uninitialized): failed to connect to PHY: -EPERM
> [ 19.661168][ T22] dsa-loop fixed-0:1f lan4 (uninitialized): error -1 setting up PHY for tree 0, switch 0, port 3
> [ 19.663018][ T22] DSA: tree 0 setup
> [ 19.663591][ T22] dsa-loop fixed-0:1f: DSA mockup driver: 0x1f
which then causes phy_detach() to be called, which then triggers the
"suspicious RCU" warning.
This has merely revealed a problem in the error handling since Kory's
commit on the 12th December, and actually has nothing to do with the
blamed commit, other than it revealing the latent problem.
The "hold RTNL" solution isn't trivial to implement here - phylink's
PHY connection functions can be called with RTNL already held, so it
isn't a simple case of throwing locking at phylink (which will cause
a deadlock) - it needs every phylink user to be audited and individual
patches to take the RTNL in the driver generated as necessary. I'm not
sure when I'll be able to do that. It's also a locking change for this
API - going from not needing the RTNL to requiring it.
This is probably going to result in more kernel warnings being
generated when I throw in ASSERT_RTNL() into phylink paths that could
call phy_detach(). Sounds joyful.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
Powered by blists - more mailing lists