lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250327140905.26ab227b@kmaincent-XPS-13-7390>
Date: Thu, 27 Mar 2025 14:09:05 +0100
From: Kory Maincent <kory.maincent@...tlin.com>
To: "Russell King (Oracle)" <linux@...linux.org.uk>
Cc: kernel test robot <oliver.sang@...el.com>, Andrew Lunn <andrew@...n.ch>,
 oe-lkp@...ts.linux.dev, lkp@...el.com, linux-kernel@...r.kernel.org, Jakub
 Kicinski <kuba@...nel.org>, Jacob Keller <jacob.e.keller@...el.com>,
 netdev@...r.kernel.org
Subject: Re: [linus:master] [net]  03abf2a7c6: WARNING:suspicious_RCU_usage

Hello Russell,

On Wed, 5 Feb 2025 17:19:13 +0000
"Russell King (Oracle)" <linux@...linux.org.uk> wrote:

> On Wed, Feb 05, 2025 at 02:08:04PM +0800, kernel test robot wrote:
> > kernel test robot noticed "WARNING:suspicious_RCU_usage" on:
> > 
> > commit: 03abf2a7c65451e663b078b0ed1bfa648cd9380f ("net: phylink: add EEE
> > management")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master  
> 
> I think there's multiple issues here that need addressing:
> 
> 1) calling phy_detach() in a context that phy_attach() is allowed
>    causes this warning, which seems absurd (being able to attach but
>    not detach on error is a problem.)
> 
> This is the root cause of the issue, and others have run into this same
> problem. There's already been an effort to address this:
>    https://lore.kernel.org/r/20250117141446.1076951-1-kory.maincent@bootlin.com
>    https://lore.kernel.org/r/20250117173645.1107460-1-kory.maincent@bootlin.com
>    https://lore.kernel.org/r/20250120141926.1290763-1-kory.maincent@bootlin.com
> and I think the conclusion is that the RTNL had to be held while calling
> phy_detach().
> 
> 2) phy_modify_mmd() returning -EPERM. Having traced through the code,
>    this comes from my swphy.c which returns -1 (eww). However, as this
>    code was extracted from fixed_phy.c, and the emulation is provided
>    for userspace, this is part of the uAPI of the kernel and can't be
>    changed.
> 
> 3) the blamed commit introduces a call to phy_modify_mmd() to set the
>    clock-stop bit, which ought not be done unless phylink managed EEE
>    is being used.
> 
> (2) and (3) together is what ends up causing:
> 
> > [   19.646149][   T22] dsa-loop fixed-0:1f lan1 (uninitialized): failed to
> > connect to PHY: -EPERM [   19.647542][   T22] dsa-loop fixed-0:1f lan1
> > (uninitialized): error -1 setting up PHY for tree 0, switch 0, port 0 [
> > 19.649283][   T22] dsa-loop fixed-0:1f lan2 (uninitialized): PHY
> > [dsa-0.0:01] driver [Generic PHY] (irq=POLL) [   19.650853][   T22]
> > dsa-loop fixed-0:1f lan2 (uninitialized): failed to connect to PHY: -EPERM
> > [   19.652238][   T22] dsa-loop fixed-0:1f lan2 (uninitialized): error -1
> > setting up PHY for tree 0, switch 0, port 1 [   19.653856][   T22] dsa-loop
> > fixed-0:1f lan3 (uninitialized): PHY [dsa-0.0:02] driver [Generic PHY]
> > (irq=POLL) [   19.655392][   T22] dsa-loop fixed-0:1f lan3 (uninitialized):
> > failed to connect to PHY: -EPERM [   19.656689][   T22] dsa-loop fixed-0:1f
> > lan3 (uninitialized): error -1 setting up PHY for tree 0, switch 0, port 2
> > [   19.658308][   T22] dsa-loop fixed-0:1f lan4 (uninitialized): PHY
> > [dsa-0.0:03] driver [Generic PHY] (irq=POLL) [   19.659841][   T22]
> > dsa-loop fixed-0:1f lan4 (uninitialized): failed to connect to PHY: -EPERM
> > [   19.661168][   T22] dsa-loop fixed-0:1f lan4 (uninitialized): error -1
> > setting up PHY for tree 0, switch 0, port 3 [   19.663018][   T22] DSA:
> > tree 0 setup [   19.663591][   T22] dsa-loop fixed-0:1f: DSA mockup driver:
> > 0x1f  
> 
> which then causes phy_detach() to be called, which then triggers the
> "suspicious RCU" warning.
> 
> This has merely revealed a problem in the error handling since Kory's
> commit on the 12th December, and actually has nothing to do with the
> blamed commit, other than it revealing the latent problem.
> 
> The "hold RTNL" solution isn't trivial to implement here - phylink's
> PHY connection functions can be called with RTNL already held, so it
> isn't a simple case of throwing locking at phylink (which will cause
> a deadlock) - it needs every phylink user to be audited and individual
> patches to take the RTNL in the driver generated as necessary. I'm not
> sure when I'll be able to do that. It's also a locking change for this
> API - going from not needing the RTNL to requiring it.
> 
> This is probably going to result in more kernel warnings being
> generated when I throw in ASSERT_RTNL() into phylink paths that could
> call phy_detach(). Sounds joyful.

It is indeed painful! I have began to take a look at it:
https://termbin.com/d9tq

I don't know if there is a better way to do this ...

Regards,
-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ