netdev - Re: Beaglebone Ethernet Probe Failure In 6.8+

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZiBgvRKbxrVSu6rR@euler>
Date: Wed, 17 Apr 2024 18:52:29 -0500
From: Colin Foster <colin.foster@...advantage.com>
To: Andrew Lunn <andrew@...n.ch>
Cc: netdev@...r.kernel.org
Subject: Re: Beaglebone Ethernet Probe Failure In 6.8+

Hi Andrew,

On Wed, Apr 17, 2024 at 09:30:58PM +0200, Andrew Lunn wrote:
> On Wed, Apr 17, 2024 at 10:42:02AM -0500, Colin Foster wrote:
> > Hello,
> > 
> > I'm chasing down an issue in recent kernels. My setup is slightly
> > unconventional: a BBB with ETH0 as a CPU port to a DSA switch that is
> > controlled by SPI. I'll have hardware next week, but think it is worth
> > getting a discussion going.
> > 
> > The commit in question is commit df16c1c51d81 ("net: phy: mdio_device:
> > Reset device only when necessary"). This seems to cause a probe error of
> > the MDIO device. A dump_stack was added where the reset is skipped.
> > 
> > SMSC LAN8710/LAN8720: probe of 4a101000.mdio:00 failed with error -5
> 
> Can you confirm this EIO is this one:
> 
> https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/ti/davinci_mdio.c#L440
> 
> It would be good to check the value of USERACCESS_ACK, and what the
> datasheet says about it.
> 
> The MDIO bus itself has no real way of telling if there is a device on
> the bus at a given address, and so if the devices actually transfers
> anything on a read. So if the resets are wrong, the device is still in
> reset, or coming out of reset but not yet ready, you should just read
> 0xffff. Returning EIO would indicate some other issue.

I'll look into this next week when I have hardware again.

> 
> > Because this failure happens much earlier than DSA, I suspect is isn't
> > isolated to me and my setup - but I'm not positive at the moment.
> > 
> > I suspect one of the following:
> > 
> > 1. There's an issue with my setup / configuration.
> > 
> > 2. This is an issue for every BBB device, but probe failures don't
> > actually break functionality.
> > 
> > 
> > Depending on which of those is the case, I'll either need to:
> > 
> > A. revert the patch because it is causing probe failures
> > 
> > B. determine why the probe is failing in the MDIO driver and try to fix
> > that
> > 
> > C. Introduce an API to force resets, regardless of the previous state,
> > and apply that to the failure cases.
> > 
> > 
> > I assume the path forward is option B... but if the issue is more
> > widespread, options A or C might be the correct path.
> 
> I would prefer B, at least lets try to understand the
> problem. Depending on what we find, we might need A, but lets decided
> that later.

Agreed.

> 
> > [    1.553623] SMSC LAN8710/LAN8720: probe of 4a101000.mdio:00 failed with error -5
> > [    1.553762] davinci_mdio 4a101000.mdio: phy[0]: device 4a101000.mdio:00, driver SMSC LAN8710/LAN8720
> > [    1.554978] cpsw-switch 4a100000.switch: initialized cpsw ale version 1.4
> > [    1.555011] cpsw-switch 4a100000.switch: ALE Table size 1024
> > [    1.555210] cpsw-switch 4a100000.switch: cpts: overflow check period 500 (jiffies)
> > [    1.555234] cpsw-switch 4a100000.switch: CPTS: ref_clk_freq:250000000 calc_mult:2147483648 calc_shift:29 error:0 nsec/sec
> > [    1.555343] cpsw-switch 4a100000.switch: Detected MACID = 24:76:25:76:35:37
> > [    1.558098] cpsw-switch 4a100000.switch: initialized (regs 0x4a100000, pool size 256) hw_ver:0019010C 1.12 (0)
> 
> So despite the -EIO, it finds the PHY, and the switch seems to probe
> O.K?

Yes. The issue I face is actually down the line when I enable the DSA
ports. I haven't diagnosed it yet, but a separate reset happens from
within phy_init_hw.

Here I've kept the dump_stack() from the patch, but removed the
return, so it is functional.

This is why it seems like it might be a bug that everyone is seeing, but
nobody is noticing... I hope to know more next week.

[    8.581463] EXT4-fs (mmcblk0p2): re-mounted 084255e0-9101-48d6-af17-9601fd9c5a1d r/w. Quota mode: disabled.
[   32.500235] cpsw-switch 4a100000.switch: starting ndev. mode: dual_mac
[   32.522610] CPU: 0 PID: 166 Comm: ip Not tainted 6.7.0-rc3-00667-gdf16c1c51d81-dirty #1408
[   32.530962] Hardware name: Generic AM33XX (Flattened Device Tree)
[   32.537090] Backtrace: 
[   32.539561]  dump_backtrace from show_stack+0x20/0x24
[   32.550363]  show_stack from dump_stack_lvl+0x60/0x78
[   32.555461]  dump_stack_lvl from dump_stack+0x18/0x1c
[   32.566238]  dump_stack from mdio_device_reset+0xc4/0x108
[   32.571685]  mdio_device_reset from phy_init_hw+0x20/0xb8
[   32.580713]  phy_init_hw from phy_attach_direct+0x148/0x340
[   32.589911]  phy_attach_direct from phy_connect_direct+0x2c/0x68
[   32.607416]  phy_connect_direct from of_phy_connect+0x54/0x7c
[   32.618889]  of_phy_connect from cpsw_ndo_open+0x30c/0x4e4
[   32.630096]  cpsw_ndo_open from __dev_open+0xfc/0x1b0
[   32.645608]  __dev_open from __dev_change_flags+0x198/0x218
[   32.656909]  __dev_change_flags from dev_change_flags+0x28/0x64
[   32.670656]  dev_change_flags from do_setlink+0x258/0xed4
[   32.681789]  do_setlink from rtnl_newlink+0x544/0x87c
[   32.697294]  rtnl_newlink from rtnetlink_rcv_msg+0x138/0x318
[   32.713408]  rtnetlink_rcv_msg from netlink_rcv_skb+0xc8/0x12c
[   32.729702]  netlink_rcv_skb from rtnetlink_rcv+0x20/0x24
[   32.740825]  rtnetlink_rcv from netlink_unicast+0x1b0/0x2a4
[   32.746435]  netlink_unicast from netlink_sendmsg+0x1a4/0x408
[   32.760001]  netlink_sendmsg from ____sys_sendmsg+0xb8/0x2c4
[   32.776110]  ____sys_sendmsg from ___sys_sendmsg+0x7c/0xb4
[   32.792046]  ___sys_sendmsg from sys_sendmsg+0x60/0xa8
[   32.803952]  sys_sendmsg from ret_fast_syscall+0x0/0x1c
[   32.809212] Exception stack(0xe0c3dfa8 to 0xe0c3dff0)
[   32.814295] dfa0:                   00000002 0054ecc8 00000003 bec65790 00000000 00000000
[   32.822514] dfc0: 00000002 0054ecc8 b6f54880 00000128 00000000 00000001 bec65f32 bec65f35
[   32.830731] dfe0: 00000128 bec65748 b6e4e52f b6dcce06
[   32.835809]  r6:b6f54880 r5:0054ecc8 r4:00000002
[   32.979240] SMSC LAN8710/LAN8720 4a101000.mdio:00: attached PHY driver (mii_bus:phy_addr=4a101000.mdio:00, irq=POLL)
[   32.994721] 8021q: adding VLAN 0 to HW filter on device eth0
[   33.020751] ocelot-ext-switch ocelot-ext-switch.5.auto swp1: configuring for phy/internal link mode
[   33.055444] ocelot-ext-switch ocelot-ext-switch.5.auto swp2: configuring for phy/internal link mode
[   33.089784] ocelot-ext-switch ocelot-ext-switch.5.auto swp3: configuring for phy/internal link mode
[   33.124241] ocelot-ext-switch ocelot-ext-switch.5.auto swp4: configuring for phy/qsgmii link mode
[   33.161283] ocelot-ext-switch ocelot-ext-switch.5.auto swp5: configuring for phy/qsgmii link mode
[   33.198704] ocelot-ext-switch ocelot-ext-switch.5.auto swp6: configuring for phy/qsgmii link mode
[   33.235518] ocelot-ext-switch ocelot-ext-switch.5.auto swp7: configuring for phy/qsgmii link mode


Colin Foster