netdev - Re: GPY215 PHY interrupt issue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1d186774cfa4173955c89e7262b1d1b7@walle.cc>
Date:   Fri, 02 Dec 2022 00:24:01 +0100
From:   Michael Walle <michael@...le.cc>
To:     Andrew Lunn <andrew@...n.ch>
Cc:     Heiner Kallweit <hkallweit1@...il.com>,
        Russell King <linux@...linux.org.uk>,
        Horatiu Vultur <horatiu.vultur@...rochip.com>,
        netdev@...r.kernel.org, Xu Liang <lxu@...linear.com>
Subject: Re: GPY215 PHY interrupt issue

Am 2022-12-01 23:06, schrieb Andrew Lunn:
> On Thu, Dec 01, 2022 at 10:31:19PM +0100, Michael Walle wrote:
>> Am 2022-12-01 16:54, schrieb Andrew Lunn:
>> > > So, switching the line to GPIO input doesn't help here, which also
>> > > means the interrupt line will be stuck the whole time.
>> >
>> > Sounds like they totally messed up the design somehow.
>> >
>> > Since we are into horrible hack territory.....
>> >
>> > I assume you are using the Link state change interrupt? LSTC?
>> 
>> Yes, but recently I've found it that it also happens with
>> the speed change interrupt (during link-up). By pure luck (or
>> bad luck really?) I discovered that when I reduce the MDIO
>> frequency I get a similar behavior for the interrupt line
>> at link-up with the LSPC interrupt. I don't think it has
>> something to do with the frequency but with changed timing.
> 
> So at a wild guess, there is some sort of race condition between the
> 2.5MHz ish MDIO clock and the rest of the system which is probably
> clocked at 25Mhz.
> 
> We have to hope this is limited to just interrupts! Not any MDIO bus
> transaction.
> 
>> +	/* The PHY might leave the interrupt line asserted even after 
>> PHY_ISTAT
>> +	* is read. To avoid interrupt storms, delay the interrupt handling 
>> as
>> +	* long as the PHY drives the interrupt line. An internal bus read 
>> will
>> +	* stall as long as the interrupt line is asserted, thus just read a
>> +	* random register here.
>> +	* Because we cannot access the internal bus at all while the 
>> interrupt
>> +	* is driven by the PHY, there is no way to make the interrupt line
>> +	* unstuck (e.g. by changing the pinmux to GPIO input) during that 
>> time
>> +	* frame. Therefore, polling is the best we can do and won't do any 
>> more
>> +	* harm.
>> +	* It was observed that this bug happens on link state and link speed
>> +	* changes on a GPY215B and GYP215C independent of the firmware 
>> version
>> +	* (which doesn't mean that this list is exhaustive).
>> +	*/
>> +	if (needs_mdint_wa && (reg & (PHY_IMASK_LSTC | PHY_IMASK_LSPC)))
>> +		gpy_mbox_read(phydev, REG_GPIO0_OUT);
>> +
>>  	phy_trigger_machine(phydev);
>> 
>>  	return IRQ_HANDLED;
> 
> So this delayed exiting the interrupt handler until the line actually
> return to normal. And during that time, the interrupt is disabled. And
> hence the storm is avoided.

Exactly. Almost like your idea with polling the interrupt line
in GPIO mode.

> I'm assuming gpy_mbox_read() has a timeout, so if the PHY completely
> jams, it does exit? If that happens, maybe call phy_error() to
> indicate the PHY is dead. And don't return IRQ_HANDLED, but
> IRQ_NONE. I think the IRQ core should notice the storm for an
> interrupt which nobody cares about and disable the interrupt.
> Probably not much you can do after that, but at least the machine
> won't be totally dead.

Ah, yes. In the earlier code (that changed the pin to GPIO input),
I had that error handling. Missed that here, I'll add it.
And yes, there is an timeout of 10ms.

>> It seems like at least these two are :/ So with the code above
>> we could avoid the interrupt storm but we can't do anything about
>> the blocked interrupt line. I'm unsure if that is acceptable or
>> if we'd have to disable interrupts on this PHY altogether and
>> fallback to polling.
> 
> It probably depends on your system design. If this is the only PHY and
> the storm can be avoided, it is probably O.K. If you have other PHYs
> sharing the line, and those PHYs are doing time sensitive stuff like
> PTP, maybe polling would be better.
> 
> Maybe add a kconfig symbol CONFIG_MAXLINEAR_GPHY_BROKEN_INTERRUPTS and
> make all the interrupt code conditional on this? Hopefully distros
> won't enable it, but the option is there to buy into the behaviour if
> you want?

I don't even dare to ask, but wouldn't a device tree property
maxlinear,use-broken-interrupts make more sense than a compile time
option? I'm fine with both.

-michael