netdev - Re: Beaglebone Ethernet Probe Failure In 6.8+

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3kpvqcg3twpifzkxkrvhqew3cjuq2imgo4d4b775oypguik55g@npe75wf7rpdr>
Date: Tue, 23 Apr 2024 15:07:15 -0500
From: Andrew Halaney <ahalaney@...hat.com>
To: Andrew Lunn <andrew@...n.ch>
Cc: Colin Foster <colin.foster@...advantage.com>, netdev@...r.kernel.org
Subject: Re: Beaglebone Ethernet Probe Failure In 6.8+

On Tue, Apr 23, 2024 at 03:52:35PM +0200, Andrew Lunn wrote:
> On Mon, Apr 22, 2024 at 11:00:51PM -0500, Colin Foster wrote:
> > Hi Andrew L,
> > 
> > (I CC'd Andrew Hanley, original author, for visibility)
> > 
> > On Wed, Apr 17, 2024 at 09:30:58PM +0200, Andrew Lunn wrote:
> > > On Wed, Apr 17, 2024 at 10:42:02AM -0500, Colin Foster wrote:
> > > > Hello,
> > > > 
> > > > I'm chasing down an issue in recent kernels. My setup is slightly
> > > > unconventional: a BBB with ETH0 as a CPU port to a DSA switch that is
> > > > controlled by SPI. I'll have hardware next week, but think it is worth
> > > > getting a discussion going.
> > > > 
> > > > The commit in question is commit df16c1c51d81 ("net: phy: mdio_device:
> > > > Reset device only when necessary"). This seems to cause a probe error of
> > > > the MDIO device. A dump_stack was added where the reset is skipped.
> > > > 
> > > > SMSC LAN8710/LAN8720: probe of 4a101000.mdio:00 failed with error -5
> > > 
> > > Can you confirm this EIO is this one:
> > > 
> > > https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/ti/davinci_mdio.c#L440
> > 
> > Yes, I can confirm this is the EIO.
> > 
> > > 
> > > It would be good to check the value of USERACCESS_ACK, and what the
> > > datasheet says about it.
> > 
> > The register value is 0x0020ffff
> 
> The 0xffff is the value read from the bus. That probably means the PHY
> did not answer, although it could legitimately return 0xffff to a
> read. More important is bit 29: "Acknowledge. This bit is set if the
> PHY acknowledged the read transaction." It is 0, so it thinks the PHY
> did not respond.
> 
> > The patch I threw in:
> > 
> > --- a/drivers/net/ethernet/ti/davinci_mdio.c
> > +++ b/drivers/net/ethernet/ti/davinci_mdio.c
> > @@ -437,7 +437,10 @@ static int davinci_mdio_read(struct mii_bus *bus, int phy_id, int phy_reg)
> >                         break;
> > 
> >                 reg = readl(&data->regs->user[0].access);
> > +               printk("davinci mdio reg is 0x%08x\n", reg);
> >                 ret = (reg & USERACCESS_ACK) ? (reg & USERACCESS_DATA) : -EIO;
> > +               if (ret == -EIO)
> > +                   printk("ret is this EIO\n");
> >                 break;
> >         }
> > 
> > 
> > The print:
> > 
> > [    1.537767] davinci_mdio 4a101000.mdio: davinci mdio revision 1.6, bus freq 1000000
> > [    1.538111] davinci mdio reg is 0x20400007
> 
> This is a read of register 2, and the register has value 0x0007
> 
> > [    1.538372] davinci mdio reg is 0x2060c0f1
> 
> This is a read of register 3, and the register has value 0xc0f1.
> 
> These are the ID registers, and match SMSC LAN8710/LAN8720.
> 
> > [    1.549523] davinci mdio reg is 0x03a0ffff
> 
> Register 0x1d. Not one of the standard registers. I don't know what is
> happening here.
> 
> > [    1.549551] ret is this EIO
> > [    1.549806] davinci mdio reg is 0x0020ffff
> 
> Register 1, basic mode status register.
> 
> > [    1.549821] ret is this EIO
> 
> In these two last transactions, the ACK bit is not set.
> 
> > [    1.550471] SMSC LAN8710/LAN8720: probe of 4a101000.mdio:00 failed with error -5
> > [    1.550592] davinci_mdio 4a101000.mdio: phy[0]: device 4a101000.mdio:00, driver SMSC LAN8710/LAN8720
> > 
> > Without the mdiodev->reset_state patch, I see the following:
> > 
> > [    1.537817] davinci_mdio 4a101000.mdio: davinci mdio revision 1.6, bus freq 1000000
> > [    1.538165] davinci mdio reg is 0x20400007
> > [    1.538426] davinci mdio reg is 0x2060c0f1
> 
> Same as above.
> 
> > [    1.558442] davinci mdio reg is 0x23a00090
> > [    1.558717] davinci mdio reg is 0x20207809
> > [    1.559681] davinci mdio reg is 0x21c0ffff
> 
> In all these cases, we see the ACK bit set. 
> 
> So the PHY is responding to registers 2 and 3, the ID registers. But
> it seems to be failing to respond to other registers. At a guess, i
> would say it is still coming out of reset. Does the datasheet for the
> LAN8710/LAN8720 say anything about how long a reset takes? Can you get
> a logic analyser onto the reset line and MDIO bus and see how
> different the timing is? It might be you need to add some delay values
> to the reset in DT.

For what its worth, I think that this theory makes sense if reverting the patch
highlighted above makes this go away. Before that patch, you'd see a
flow like this:

    net: phy: mdio_device: Reset device only when necessary

    Currently the phy reset sequence is as shown below for a
    devicetree described mdio phy on boot:

    1. Assert the phy_device's reset as part of registering
    2. Deassert the phy_device's reset as part of registering
    3. Deassert the phy_device's reset as part of phy_probe
    4. Deassert the phy_device's reset as part of phy_hw_init

Which means whatever the deassert time was tripled in
practice before you got around to phy_hw_init() (which if I understand
is when things start reporting no ACK above).

I am not sure what devicetree upstream would be the one to look at for
your beaglebone, but microchip's datasheet for the LAN8720A has
"TABLE 5-8: POWER-ON NRST & ..." section detailing some reset requirements:

    https://ww1.microchip.com/downloads/en/devicedoc/00002165b.pdf

If I read it right, assert time needs to be >= 100 us, and
deassert... is not so clear to me unfortunately. Maybe for starters
triple your value and see if things work ok (just based on the 3
repeated deasserts going down to 1 with the patch applied)? Hopefully
longer term the actual deassert timing can be confirmed.

Thanks,
Andrew