netdev - Re: [PATCH] net: usb: lan78xx: Enforce a minimum interrupt polling period

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <dc8ef510-8f7d-4c96-9fd8-76b67a22aaf9@gmx.de>
Date: Thu, 10 Apr 2025 16:14:14 +0200
From: Fiona Klute <fiona.klute@....de>
To: Andrew Lunn <andrew@...n.ch>
Cc: netdev@...r.kernel.org, Thangaraj Samynathan <Thangaraj.S@...rochip.com>,
 Rengarajan Sundararajan <Rengarajan.S@...rochip.com>,
 UNGLinuxDriver@...rochip.com, Andrew Lunn <andrew+netdev@...n.ch>,
 "David S . Miller" <davem@...emloft.net>, Eric Dumazet
 <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
 Paolo Abeni <pabeni@...hat.com>, linux-usb@...r.kernel.org,
 linux-kernel@...r.kernel.org, kernel-list@...pberrypi.com,
 stable@...r.kernel.org
Subject: Re: [PATCH] net: usb: lan78xx: Enforce a minimum interrupt polling
 period

Am 11.03.25 um 14:22 schrieb Andrew Lunn:
> On Tue, Mar 11, 2025 at 01:30:54PM +0100, Fiona Klute wrote:
>> Am 10.03.25 um 22:27 schrieb Andrew Lunn:
>>> On Mon, Mar 10, 2025 at 05:59:31PM +0100, Fiona Klute wrote:
>>>> If a new reset event appears before the previous one has been
>>>> processed, the device can get stuck into a reset loop. This happens
>>>> rarely, but blocks the device when it does, and floods the log with
>>>> messages like the following:
>>>>
>>>>     lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped
>>>>
>>>> The only bit that the driver pays attention to in the interrupt data
>>>> is "link was reset". If there's a flapping status bit in that endpoint
>>>> data (such as if PHY negotiation needs a few tries to get a stable
>>>> link), polling at a slower rate allows the state to settle.
>>>
>>> Could you expand on this a little bit more. What is the issue you are
>>> seeing?
>>
>> What happens is that *sometimes* when the interface is activated (up, im
>> my case via NetworkManager) during boot, the "kevent 4 may have been
>> dropped" message starts to be emitted about every 6 or 7 ms.
>
> This sounding a bit like an interrupt storm. The PHY interrupt is not
> being cleared correctly. PHY interrupts are level interrupts, so if
> you don't clear the interrupt at the source, it will fire again as
> soon as you re-enable it.
>
> So which PHY driver is being used? If you look for the first kernel
> message about the lan78xx it probably tells you.
>
>> [   27.918335] Call trace:
>> [   27.918338]  console_flush_all+0x2b0/0x4f8 (P)
>> [   27.918346]  console_unlock+0x8c/0x170
>> [   27.918352]  vprintk_emit+0x238/0x3b8
>> [   27.918357]  dev_vprintk_emit+0xe4/0x1b8
>> [   27.918364]  dev_printk_emit+0x64/0x98
>> [   27.918368]  __netdev_printk+0xc8/0x228
>> [   27.918376]  netdev_info+0x70/0xa8
>> [   27.918382]  phy_print_status+0xcc/0x138
>> [   27.918386]  lan78xx_link_status_change+0x78/0xb0
>> [   27.918392]  phy_link_change+0x38/0x70
>> [   27.918398]  phy_check_link_status+0xa8/0x110
>> [   27.918405]  _phy_start_aneg+0x5c/0xb8
>> [   27.918409]  lan88xx_link_change_notify+0x5c/0x128
>> [   27.918416]  _phy_state_machine+0x12c/0x2b0
>> [   27.918420]  phy_state_machine+0x34/0x80
>> [   27.918425]  process_one_work+0x150/0x3b8
>> [   27.918432]  worker_thread+0x2a4/0x4b8
>> [   27.918438]  kthread+0xec/0xf8
>> [   27.918442]  ret_from_fork+0x10/0x20
>> [   27.918534] lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped
>> [   27.924985] lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped
>
> Ah, O.K. This tells me the PHY is a lan88xx. And there is a workaround
> involved for an issue in this PHY. Often PHYs are driven by polling
> for status changes once per second. Not all PHYs/boards support
> interrupts. It could be this workaround has only been tested with
> polling, not interrupts, and so is broken when interrupts are used.
>
> As a quick hack test, in lan78xx_phy_init()
>
> 	/* if phyirq is not set, use polling mode in phylib */
> 	if (dev->domain_data.phyirq > 0)
> 		phydev->irq = dev->domain_data.phyirq;
> 	else
> 		phydev->irq = PHY_POLL;
>
> Hard code phydev->irq to PHY_POLL, so interrupts are not used.
>
> See if you can reproduce the issue when interrupts are not used.
It took a while, but I'm fairly confident now that the workaround works,
I've had over 1000 boots on the hardware in question and didn't see the
bug. Someone going by upsampled reported the same in the issue on Github
[1], and pointed out that people working with some Nvidia board and a
LAN7800 USB device came to the same conclusion a while ago [2].

That leaves me with the question, what does that mean going forward?
Would it make sense to add a quirk to unconditionally force polling on
lan88xx, at least until/unless the interrupt handling can be fixed?

Best regards,
Fiona

[1] https://github.com/raspberrypi/linux/issues/2447#issuecomment-2772789088
[2]
https://forums.developer.nvidia.com/t/jetson-xavier-and-lan7800-problem/142134/11