netdev - Re: Beaglebone Ethernet Probe Failure In 6.8+

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Zi7+xqp1GG6Jl/kU@colin-ia-desktop>
Date: Sun, 28 Apr 2024 20:58:30 -0500
From: Colin Foster <colin.foster@...advantage.com>
To: Andrew Halaney <ahalaney@...hat.com>
Cc: Andrew Lunn <andrew@...n.ch>, netdev@...r.kernel.org,
	linux-omap@...r.kernel.org
Subject: Re: Beaglebone Ethernet Probe Failure In 6.8+

Hi Andrew L and Andrew H,

Sorry for the delayed response. I couldn't get to testing anything until
just now.

On Tue, Apr 23, 2024 at 03:07:15PM -0500, Andrew Halaney wrote:
> On Tue, Apr 23, 2024 at 03:52:35PM +0200, Andrew Lunn wrote:
> > On Mon, Apr 22, 2024 at 11:00:51PM -0500, Colin Foster wrote:
> > 
> > In these two last transactions, the ACK bit is not set.
> > 
> > > [    1.550471] SMSC LAN8710/LAN8720: probe of 4a101000.mdio:00 failed with error -5
> > > [    1.550592] davinci_mdio 4a101000.mdio: phy[0]: device 4a101000.mdio:00, driver SMSC LAN8710/LAN8720
> > > 
> > > Without the mdiodev->reset_state patch, I see the following:
> > > 
> > > [    1.537817] davinci_mdio 4a101000.mdio: davinci mdio revision 1.6, bus freq 1000000
> > > [    1.538165] davinci mdio reg is 0x20400007
> > > [    1.538426] davinci mdio reg is 0x2060c0f1
> > 
> > Same as above.
> > 
> > > [    1.558442] davinci mdio reg is 0x23a00090
> > > [    1.558717] davinci mdio reg is 0x20207809
> > > [    1.559681] davinci mdio reg is 0x21c0ffff
> > 
> > In all these cases, we see the ACK bit set. 
> > 
> > So the PHY is responding to registers 2 and 3, the ID registers. But
> > it seems to be failing to respond to other registers. At a guess, i
> > would say it is still coming out of reset. Does the datasheet for the
> > LAN8710/LAN8720 say anything about how long a reset takes? Can you get
> > a logic analyser onto the reset line and MDIO bus and see how
> > different the timing is? It might be you need to add some delay values
> > to the reset in DT.

I don't think I'll be able to get onto those lines. But I do think this
is the right tree to bark up. I also found some kernelci logs that
suggest I'm not the only one seeing this issue:

https://storage.kernelci.org/mainline/master/v6.9-rc5/arm/multi_v7_defconfig/gcc-10/lab-cip/baseline-beaglebone-black.html

There might be ways to navigate the kernelci database that I'm not aware
of, but I couldn't reasonably say "before 6.8 it didn't happen, and
after 6.8 it did." I'm not sure that matters at this point though.

> 
> For what its worth, I think that this theory makes sense if reverting the patch
> highlighted above makes this go away. Before that patch, you'd see a
> flow like this:
> 
>     net: phy: mdio_device: Reset device only when necessary
> 
>     Currently the phy reset sequence is as shown below for a
>     devicetree described mdio phy on boot:
> 
>     1. Assert the phy_device's reset as part of registering
>     2. Deassert the phy_device's reset as part of registering
>     3. Deassert the phy_device's reset as part of phy_probe
>     4. Deassert the phy_device's reset as part of phy_hw_init
> 
> Which means whatever the deassert time was tripled in
> practice before you got around to phy_hw_init() (which if I understand
> is when things start reporting no ACK above).
> 
> I am not sure what devicetree upstream would be the one to look at for
> your beaglebone, but microchip's datasheet for the LAN8720A has
> "TABLE 5-8: POWER-ON NRST & ..." section detailing some reset requirements:
> 
>     https://ww1.microchip.com/downloads/en/devicedoc/00002165b.pdf
> 
> If I read it right, assert time needs to be >= 100 us, and
> deassert... is not so clear to me unfortunately. Maybe for starters
> triple your value and see if things work ok (just based on the 3
> repeated deasserts going down to 1 with the patch applied)? Hopefully
> longer term the actual deassert timing can be confirmed.

I went all in and did a 100ms delay before returning from the resets of
3 and 4 you mention. Sure enough, everything worked! It certainly should
be understood and optimized. I added the linux-omap list to this thread
(please let me know if there were others I should've CC'd on any of
these emails).

Either way, thank you both for helping me understand this! I hope to be
able to fix the issue, but at the very least I hope it is considered
"reported".


Colin Foster