lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y4klbgDIuxHXaWrC@lunn.ch>
Date:   Thu, 1 Dec 2022 23:06:38 +0100
From:   Andrew Lunn <andrew@...n.ch>
To:     Michael Walle <michael@...le.cc>
Cc:     Heiner Kallweit <hkallweit1@...il.com>,
        Russell King <linux@...linux.org.uk>,
        Horatiu Vultur <horatiu.vultur@...rochip.com>,
        netdev@...r.kernel.org, Xu Liang <lxu@...linear.com>
Subject: Re: GPY215 PHY interrupt issue

On Thu, Dec 01, 2022 at 10:31:19PM +0100, Michael Walle wrote:
> Am 2022-12-01 16:54, schrieb Andrew Lunn:
> > > So, switching the line to GPIO input doesn't help here, which also
> > > means the interrupt line will be stuck the whole time.
> > 
> > Sounds like they totally messed up the design somehow.
> > 
> > Since we are into horrible hack territory.....
> > 
> > I assume you are using the Link state change interrupt? LSTC?
> 
> Yes, but recently I've found it that it also happens with
> the speed change interrupt (during link-up). By pure luck (or
> bad luck really?) I discovered that when I reduce the MDIO
> frequency I get a similar behavior for the interrupt line
> at link-up with the LSPC interrupt. I don't think it has
> something to do with the frequency but with changed timing.

So at a wild guess, there is some sort of race condition between the
2.5MHz ish MDIO clock and the rest of the system which is probably
clocked at 25Mhz.

We have to hope this is limited to just interrupts! Not any MDIO bus
transaction.

> +	/* The PHY might leave the interrupt line asserted even after PHY_ISTAT
> +	* is read. To avoid interrupt storms, delay the interrupt handling as
> +	* long as the PHY drives the interrupt line. An internal bus read will
> +	* stall as long as the interrupt line is asserted, thus just read a
> +	* random register here.
> +	* Because we cannot access the internal bus at all while the interrupt
> +	* is driven by the PHY, there is no way to make the interrupt line
> +	* unstuck (e.g. by changing the pinmux to GPIO input) during that time
> +	* frame. Therefore, polling is the best we can do and won't do any more
> +	* harm.
> +	* It was observed that this bug happens on link state and link speed
> +	* changes on a GPY215B and GYP215C independent of the firmware version
> +	* (which doesn't mean that this list is exhaustive).
> +	*/
> +	if (needs_mdint_wa && (reg & (PHY_IMASK_LSTC | PHY_IMASK_LSPC)))
> +		gpy_mbox_read(phydev, REG_GPIO0_OUT);
> +
>  	phy_trigger_machine(phydev);
> 
>  	return IRQ_HANDLED;

So this delayed exiting the interrupt handler until the line actually
return to normal. And during that time, the interrupt is disabled. And
hence the storm is avoided.

I'm assuming gpy_mbox_read() has a timeout, so if the PHY completely
jams, it does exit? If that happens, maybe call phy_error() to
indicate the PHY is dead. And don't return IRQ_HANDLED, but
IRQ_NONE. I think the IRQ core should notice the storm for an
interrupt which nobody cares about and disable the interrupt.
Probably not much you can do after that, but at least the machine
won't be totally dead.

> It seems like at least these two are :/ So with the code above
> we could avoid the interrupt storm but we can't do anything about
> the blocked interrupt line. I'm unsure if that is acceptable or
> if we'd have to disable interrupts on this PHY altogether and
> fallback to polling.

It probably depends on your system design. If this is the only PHY and
the storm can be avoided, it is probably O.K. If you have other PHYs
sharing the line, and those PHYs are doing time sensitive stuff like
PTP, maybe polling would be better.

Maybe add a kconfig symbol CONFIG_MAXLINEAR_GPHY_BROKEN_INTERRUPTS and
make all the interrupt code conditional on this? Hopefully distros
won't enable it, but the option is there to buy into the behaviour if
you want?

      Andrew

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ