lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z6OdkdI2ss19FyVT@shell.armlinux.org.uk>
Date: Wed, 5 Feb 2025 17:19:13 +0000
From: "Russell King (Oracle)" <linux@...linux.org.uk>
To: kernel test robot <oliver.sang@...el.com>
Cc: Andrew Lunn <andrew@...n.ch>, oe-lkp@...ts.linux.dev, lkp@...el.com,
	linux-kernel@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>,
	Jacob Keller <jacob.e.keller@...el.com>, netdev@...r.kernel.org,
	Kory Maincent <kory.maincent@...tlin.com>
Subject: Re: [linus:master] [net]  03abf2a7c6: WARNING:suspicious_RCU_usage

On Wed, Feb 05, 2025 at 02:08:04PM +0800, kernel test robot wrote:
> kernel test robot noticed "WARNING:suspicious_RCU_usage" on:
> 
> commit: 03abf2a7c65451e663b078b0ed1bfa648cd9380f ("net: phylink: add EEE management")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

I think there's multiple issues here that need addressing:

1) calling phy_detach() in a context that phy_attach() is allowed
   causes this warning, which seems absurd (being able to attach but
   not detach on error is a problem.)

This is the root cause of the issue, and others have run into this same
problem. There's already been an effort to address this:
   https://lore.kernel.org/r/20250117141446.1076951-1-kory.maincent@bootlin.com
   https://lore.kernel.org/r/20250117173645.1107460-1-kory.maincent@bootlin.com
   https://lore.kernel.org/r/20250120141926.1290763-1-kory.maincent@bootlin.com
and I think the conclusion is that the RTNL had to be held while calling
phy_detach().

2) phy_modify_mmd() returning -EPERM. Having traced through the code,
   this comes from my swphy.c which returns -1 (eww). However, as this
   code was extracted from fixed_phy.c, and the emulation is provided
   for userspace, this is part of the uAPI of the kernel and can't be
   changed.

3) the blamed commit introduces a call to phy_modify_mmd() to set the
   clock-stop bit, which ought not be done unless phylink managed EEE
   is being used.

(2) and (3) together is what ends up causing:

> [   19.646149][   T22] dsa-loop fixed-0:1f lan1 (uninitialized): failed to connect to PHY: -EPERM
> [   19.647542][   T22] dsa-loop fixed-0:1f lan1 (uninitialized): error -1 setting up PHY for tree 0, switch 0, port 0
> [   19.649283][   T22] dsa-loop fixed-0:1f lan2 (uninitialized): PHY [dsa-0.0:01] driver [Generic PHY] (irq=POLL)
> [   19.650853][   T22] dsa-loop fixed-0:1f lan2 (uninitialized): failed to connect to PHY: -EPERM
> [   19.652238][   T22] dsa-loop fixed-0:1f lan2 (uninitialized): error -1 setting up PHY for tree 0, switch 0, port 1
> [   19.653856][   T22] dsa-loop fixed-0:1f lan3 (uninitialized): PHY [dsa-0.0:02] driver [Generic PHY] (irq=POLL)
> [   19.655392][   T22] dsa-loop fixed-0:1f lan3 (uninitialized): failed to connect to PHY: -EPERM
> [   19.656689][   T22] dsa-loop fixed-0:1f lan3 (uninitialized): error -1 setting up PHY for tree 0, switch 0, port 2
> [   19.658308][   T22] dsa-loop fixed-0:1f lan4 (uninitialized): PHY [dsa-0.0:03] driver [Generic PHY] (irq=POLL)
> [   19.659841][   T22] dsa-loop fixed-0:1f lan4 (uninitialized): failed to connect to PHY: -EPERM
> [   19.661168][   T22] dsa-loop fixed-0:1f lan4 (uninitialized): error -1 setting up PHY for tree 0, switch 0, port 3
> [   19.663018][   T22] DSA: tree 0 setup
> [   19.663591][   T22] dsa-loop fixed-0:1f: DSA mockup driver: 0x1f

which then causes phy_detach() to be called, which then triggers the
"suspicious RCU" warning.

This has merely revealed a problem in the error handling since Kory's
commit on the 12th December, and actually has nothing to do with the
blamed commit, other than it revealing the latent problem.

The "hold RTNL" solution isn't trivial to implement here - phylink's
PHY connection functions can be called with RTNL already held, so it
isn't a simple case of throwing locking at phylink (which will cause
a deadlock) - it needs every phylink user to be audited and individual
patches to take the RTNL in the driver generated as necessary. I'm not
sure when I'll be able to do that. It's also a locking change for this
API - going from not needing the RTNL to requiring it.

This is probably going to result in more kernel warnings being
generated when I throw in ASSERT_RTNL() into phylink paths that could
call phy_detach(). Sounds joyful.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ