netdev - Re: [PATCH net-next] net: phy: avoid kernel warning dump when stopping an errored PHY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZPc0kHHMrNr0cgp/@shell.armlinux.org.uk>
Date: Tue, 5 Sep 2023 15:00:48 +0100
From: "Russell King (Oracle)" <linux@...linux.org.uk>
To: Andrew Lunn <andrew@...n.ch>
Cc: Jijie Shao <shaojijie@...wei.com>, f.fainelli@...il.com,
	davem@...emloft.net, edumazet@...gle.com, hkallweit1@...il.com,
	kuba@...nel.org, netdev@...r.kernel.org, pabeni@...hat.com,
	"shenjian15@...wei.com" <shenjian15@...wei.com>,
	"liuyonglong@...wei.com" <liuyonglong@...wei.com>,
	wangjie125@...wei.com, chenhao418@...wei.com,
	Hao Lan <lanhao@...wei.com>,
	"wangpeiyang1@...wei.com" <wangpeiyang1@...wei.com>
Subject: Re: [PATCH net-next] net: phy: avoid kernel warning dump when
 stopping an errored PHY

On Tue, Sep 05, 2023 at 02:09:29PM +0200, Andrew Lunn wrote:
> > When we do a phy_stop(), hardware might be error and we can't access to
> > mdio.And our process is read/write mdio failed first, then do phy_stop(),
> > reset hardware and call phy_start() finally.
> 
> If the hardware/fimrware is already dead, you have to expect a stack
> trace, because the once a second poll can happen, before you notice
> the hardware/firmware is dead and call phy_stop().
> 
> You might want to also disconnect the PHY and reconnect it after the
> reset.

Andrew,

I think that's what is being tried here, but there's a race between
phy_stop() and phy_state_machine() which is screwing up phydev->state.

Honestly, the locking in phy_state_machine() is insane, allows this
race to happen, and really needs fixing... and I think that the
phydev->lock usage has become really insane over the years. We have
some driver methods now which are always called with the lock held,
others where the lock may or may not be held, and others where the
lock isn't held - and none of this is documented.

Please can you have a look at the four patches I've just posted as
attached to my previous email. I think we need to start sorting out
some of this crazyness and santising the locking.

My four patches address most of it, except the call to phy_suspend().
If that can be solved, then we can improve the locking more, and
eliminate the race entirely.

If we held the lock over the entire state machine function, then the
problem that has been reported here would not exist - phy_stop()
would not be able to "nip in" during the middle of the PHY state
machine running, and thus we would not see the PHY_HALTED state
overwritten by PHY_ERROR unexpectedly.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!