[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20201106095611.7e5912a0@gollum>
Date: Fri, 6 Nov 2020 09:56:11 +0100
From: Juerg Haefliger <juerg.haefliger@...onical.com>
To: Andrew Lunn <andrew@...n.ch>
Cc: Juerg Haefliger <juerg.haefliger@...onical.com>,
netdev@...r.kernel.org, woojung.huh@...rochip.com
Subject: Re: lan78xx: /sys/class/net/eth0/carrier stuck at 1
On Tue, 3 Nov 2020 16:27:23 +0100
Andrew Lunn <andrew@...n.ch> wrote:
> On Tue, Nov 03, 2020 at 01:47:12PM +0100, Juerg Haefliger wrote:
> > On Fri, 23 Oct 2020 15:05:19 +0200
> > Andrew Lunn <andrew@...n.ch> wrote:
> >
> > > On Fri, Oct 23, 2020 at 08:29:59AM +0200, Juerg Haefliger wrote:
> > > > On Wed, 21 Oct 2020 21:35:48 +0200
> > > > Andrew Lunn <andrew@...n.ch> wrote:
> > > >
> > > > > On Wed, Oct 21, 2020 at 05:00:53PM +0200, Juerg Haefliger wrote:
> > > > > > Hi,
> > > > > >
> > > > > > If the lan78xx driver is compiled into the kernel and the network cable is
> > > > > > plugged in at boot, /sys/class/net/eth0/carrier is stuck at 1 and doesn't
> > > > > > toggle if the cable is unplugged and replugged.
> > > > > >
> > > > > > If the network cable is *not* plugged in at boot, all seems to work fine.
> > > > > > I.e., post-boot cable plugs and unplugs toggle the carrier flag.
> > > > > >
> > > > > > Also, everything seems to work fine if the driver is compiled as a module.
> > > > > >
> > > > > > There's an older ticket for the raspi kernel [1] but I've just tested this
> > > > > > with a 5.8 kernel on a Pi 3B+ and still see that behavior.
> > > > >
> > > > > Hi Jürg
> > > >
> > > > Hi Andrew,
> > > >
> > > >
> > > > > Could you check if a different PHY driver is being used when it is
> > > > > built and broken vs module or built in and working.
> > > > >
> > > > > Look at /sys/class/net/eth0/phydev/driver
> > > >
> > > > There's no such file.
> > >
> > > I _think_ that means it is using genphy, the generic PHY driver, not a
> > > specific vendor PHY driver? What does
> > >
> > > /sys/class/net/eth0/phydev/phy_id contain.
> >
> > There is no directory /sys/class/net/eth0/phydev.
>
> [Goes and looks at the code]
>
> The symbolic link is only created if the PHY is connected to the MAC
> if the MAC has been registered with the core first. lan78xx does it
> the other way around:
>
> ret = lan78xx_phy_init(dev);
> if (ret < 0)
> goto out4;
>
> ret = register_netdev(netdev);
> if (ret != 0) {
> netif_err(dev, probe, netdev, "couldn't register the device\n");
> goto out5;
> }
>
> The register dump you show below indicates an ID of 007c132, which
> fits the drivers drivers/net/phy/microchip.c : "Microchip
> LAN88xx". Any mention of that in dmesg, do you see the module loaded?
It's built-in but is being used (as tracing shows).
> >
> > > > Given that all works fine as long as the cable is unplugged at boot points
> > > > more towards a race at boot or incorrect initialization sequence or something.
> > >
> > > Could be. Could you run
> > >
> > > mii-tool -vv eth0
> >
> > Hrm. Running that command unlocks the carrier flag and it starts toggling on
> > cable unplug/plug. First invocation:
> >
> > $ sudo mii-tool -vv eth0
> > Using SIOCGMIIPHY=0x8947
> > eth0: negotiated 1000baseT-FD flow-control, link ok
> > registers for MII PHY 1:
> > 1040 79ed 0007 c132 05e1 cde1 000f 0000
> > 0000 0200 0800 0000 0000 0000 0000 3000
> > 0000 0000 0088 0000 0000 0000 3200 0004
> > 0040 a000 a000 0000 a035 0000 0000 0000
> > product info: vendor 00:01:f0, model 19 rev 2
> > basic mode: autonegotiation enabled
> > basic status: autonegotiation complete, link ok
> > capabilities: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
> > advertising: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
> > link partner: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
> >
> > Subsequent invocation:
> >
> > $ sudo mii-tool -vv eth0
> > Using SIOCGMIIPHY=0x8947
> > eth0: negotiated 1000baseT-FD flow-control, link ok
> > registers for MII PHY 1:
> > 1040 79ed 0007 c132 05e1 cde1 000d 0000
> > 0000 0200 0800 0000 0000 0000 0000 3000
> > 0000 0000 0088 0000 0000 0000 3200 0004
> > 0040 a000 0000 0000 a035 0000 0000 0000
> > product info: vendor 00:01:f0, model 19 rev 2
> > basic mode: autonegotiation enabled
> > basic status: autonegotiation complete, link ok
> > capabilities: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
> > advertising: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
> > link partner: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
> >
> > In the first invocation, register 0x1a shows a pending link-change interrupt
> > (0xa000) which wasn't serviced (and cleared) for some reason. Dumping the
> > registers cleared that interrupt bit and things start working correctly
> > afterwards. Nor sure yet why that first interrupt is ignored.
>
> So, 0x1a is the interrupt status, and 0x19 is the interrupt mask.
Yes.
> This should really be interpreted as a level interrupt. But it appears
> the hardware the interrupt is connected to is actually doing edge. And
> the edge has been missed, and so the interrupt is never serviced.
>
> I think the call sequence goes something like this, if i'm reading the
> code correct:
>
> lan78xx_probe() calls lan78xx_bind:
>
> lan78xx_bind() registers an interrupt domain. This allows USB
> status messages indicating an interrupt to be dispatched using the
> normal interrupt mechanism, despite there not being a proper
> interrupt as such. The masking of interrupts seems to be part of
> INT_EP_CTL. This register is read during
> lan78xx_setup_irq_domain(), but it is not written to disable all
> interrupts. So it could be, PHY interrupts in the interrupt
> controller are already enabled at this point.
>
> lan78xx_bind() also registers an MDIO bus. This will cause the bus
> to be probed and the PHY should be found.
>
> lan78xx_probe() calls lan78xx_phy_init:
>
> lan78xx_phy_init gets the phydev, and fills in the interrupt
> number. It then connects the PHY to the MAC using
> phy_connect_direct(). phy_connect_direct() will request the
> interrupt from the kernel, meaning it can then service interrupts,
> and as part of that, it calls into the interrupt domain and
> enables interrupts in the interrupt controller. It clears the
> interrupts in the PHY, which means the PHY interrupt status
> register is read, clearing it. It then enables interrupts in the
> PHY by again reading the status register to clear any pending
> interrupts and then sets the mask to enable interrupts.
>
> So at this point, the PHY is ready to generate interrupts, the
> interrupt controller in the USB device is ready to accept them. The
> kernel itself is read for them, and they should be passed to the PHY
> subsystem. What is not clear to me is if the USB endpoint is correctly
> setup to report interrupts.
>
> Interestingly, the last thing lan78xx_phy_init() does is call
> genphy_config_aneg(). That configures the PHY to start an
> auto-neg. That is wrong, on a number of levels. The PHY drivers
> implementation, lan88xx_config_aneg() sets the mdix before
> starting autoneg. That gets skipped by directly calling
> genphy_config_aneg(). However, the interface is not even
> registered yet, let alone up. It has no business starting an
> auto-neg. But if the cable is connected and the peer is ready,
> about 1.5 seconds later, we expect auto-neg to complete and the
> interrupt to fire.
>
> Sometime later, lan78xx_open() is called when the interface is
> configured up. It calls phy_start() which is the correct way to get
> the PHY going. That will call into the PHY driver to start auto-neg.
> I've no idea what happens in this PHY when two auto-negs are going.
>
> What lan78xx_open() also does is trigger a delayed work for
> EVENT_LINK_RESET. lan78xx_link_reset() gets called as a result, and
> that writes to the interrupt status register to clear it. Why clear
> interrupts when we have just started auto-neg and we expect it to
> cause an interrupt? The whole of lan78xx_link_reset() looks odd, and
> some of it should be moved to lan78xx_link_status_change() which
> phylib will call when the PHY changes status.
>
> O.K. As a start, remove the in from
> lan78xx_phy_init(). I don't know if that is enough, but it is clearly
> wrong.
The interface won't come up with that change. FWIW I recorded some traces
(functions phy_*, lan88xx_*, lan78xx_*):
[1] lan78xx built-in.
[2] lan78xx built-in without genphy_config_aneg() in lan78xx_phy_init().
[3] lan78xx as a module.
I've also added some trace_printks that dump the content of the status and
mask registers at the end of lan78xx_probe() (no interrupt pending) and in
lan78xx_get_drvinfo() which seems to be the next lan78xx function being
called after probing (per the built-in trace). At that time an interrupt
seems to be pending (lan78xx_get_drvinfo LAN88XX_INT_STS = 0xa000).
...Juerg
[1] https://kernel.ubuntu.com/~juergh/lp1890487/trace.builtin
[2] https://kernel.ubuntu.com/~juergh/lp1890487/trace.builtin-no-aneg
[3] https://kernel.ubuntu.com/~juergh/lp1890487/trace.module
> Andrew
Content of type "application/pgp-signature" skipped
Powered by blists - more mailing lists