netdev - Re: FEC MDIO read timeout on linkup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Date:   Thu, 5 May 2022 19:54:00 +0200
From:   Francesco Dolcini <francesco.dolcini@...adex.com>
To:     Andrew Lunn <andrew@...n.ch>
Cc:     Francesco Dolcini <francesco.dolcini@...adex.com>,
        Joakim Zhang <qiangqing.zhang@....com>, netdev@...r.kernel.org,
        Andy Duan <fugang.duan@....com>,
        Heiner Kallweit <hkallweit1@...il.com>,
        Russell King <linux@...linux.org.uk>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        "David S. Miller" <davem@...emloft.net>,
        Fabio Estevam <festevam@...il.com>,
        Tim Harvey <tharvey@...eworks.com>,
        Chris Healy <cphealy@...il.com>
Subject: Re: FEC MDIO read timeout on linkup

On Thu, May 05, 2022 at 07:41:00PM +0200, Andrew Lunn wrote:
> On Thu, May 05, 2022 at 10:29:01AM +0200, Francesco Dolcini wrote:
> > Hello Andrew and all, I believe I finally found the problem and I'm
> > preparing a patch for it.
> > 
> > On Wed, May 04, 2022 at 12:17:59AM +0200, Andrew Lunn wrote:
> > > > I'm wondering could this be related to
> > > > fec_enet_adjust_link()->fec_restart() during a fec_enet_mdio_read()
> > > > and one of the many register write in fec_restart() just creates the
> > > > issue, maybe while resetting the FEC? Does this makes any sense?
> > > 
> > > phylib is 'single threaded', in that only one thing will be active at
> > > once for a PHY. While fec_enet_adjust_link() is being called, there
> > > will not be any read/writes occurring for that PHY.
> > 
> > I think this is not the whole story here. We can have a phy interrupt
> > handler that runs in its own context and it could be doing a MDIO
> > transaction, and this is exactly my case.
> > 
> > Thread 1 (phylib WQ)       | Thread 2 (phy interrupt)
> >                            |
> >                            | phy_interrupt()            <-- PHY IRQ
> > 	                   |  handle_interrupt()
> > 	                   |   phy_read()
> > 	                   |   phy_trigger_machine()
> > 	                   |    --> schedule WQ
> >                            |
> > 	                   |
> > phy_state_machine()        |                        
> >  phy_check_link_status()   |
> >   phy_link_change()        |
> >    phydev->adjust_link()   |
> >     fec_enet_adjust_link() | 
> >      --> FEC reset         | phy_interrupt()            <-- PHY IRQ
> > 	                   |  phy_read()
> > 	 	           |
> > 
> > To confirm this I have added a spinlock to detect this race condition
> > with just a trylock and a WARN_ON(1) when the locking is failing. On
> > "MDIO read timeout" acquiring the spinlock fails.
> > 
> > This is also in agreement with the fact that polling the PHY instead of
> > having the interrupt is working just fine.
> 
> Yes, that makes sense.
> 
> But i would fix this differently. The interrupt handler runs in a
> threaded interrupt. So it can use mutex. So it should actually take
> the phy mutex.

I was just about to send a patch with phy_lock_mdio_bus() in
fec_enet_adjust_link(), anyway, I'll send the version you proposed in a
bit.

Thanks,
Francesco