linux-kernel - Re: [PATCH] net: usb: lan78xx: Enforce a minimum interrupt polling period

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4577e7d7-cadc-41c6-b93f-eca7d5a8eb46@gmx.de>
Date: Tue, 11 Mar 2025 13:30:54 +0100
From: Fiona Klute <fiona.klute@....de>
To: Andrew Lunn <andrew@...n.ch>
Cc: netdev@...r.kernel.org, Thangaraj Samynathan <Thangaraj.S@...rochip.com>,
 Rengarajan Sundararajan <Rengarajan.S@...rochip.com>,
 UNGLinuxDriver@...rochip.com, Andrew Lunn <andrew+netdev@...n.ch>,
 "David S . Miller" <davem@...emloft.net>, Eric Dumazet
 <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
 Paolo Abeni <pabeni@...hat.com>, linux-usb@...r.kernel.org,
 linux-kernel@...r.kernel.org, kernel-list@...pberrypi.com,
 stable@...r.kernel.org
Subject: Re: [PATCH] net: usb: lan78xx: Enforce a minimum interrupt polling
 period

Am 10.03.25 um 22:27 schrieb Andrew Lunn:
> On Mon, Mar 10, 2025 at 05:59:31PM +0100, Fiona Klute wrote:
>> If a new reset event appears before the previous one has been
>> processed, the device can get stuck into a reset loop. This happens
>> rarely, but blocks the device when it does, and floods the log with
>> messages like the following:
>>
>>    lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped
>>
>> The only bit that the driver pays attention to in the interrupt data
>> is "link was reset". If there's a flapping status bit in that endpoint
>> data (such as if PHY negotiation needs a few tries to get a stable
>> link), polling at a slower rate allows the state to settle.
>
> Could you expand on this a little bit more. What is the issue you are
> seeing?

What happens is that *sometimes* when the interface is activated (up, im
my case via NetworkManager) during boot, the "kevent 4 may have been
dropped" message starts to be emitted about every 6 or 7 ms. The
interface is not usable. Today I also got this trace (and an equivalent
one at about 90s):

[   27.905405] lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped
[   27.911834] lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped
[   27.918199] rcu: INFO: rcu_sched self-detected stall on CPU
[   27.918205] rcu:   0-....: (3351 ticks this GP)
idle=e6bc/1/0x4000000000000000 softirq=671/671 fqs=2300
[   27.918214] rcu:   (t=5250 jiffies g=113 q=10467 ncpus=4)
[   27.918224] CPU: 0 UID: 0 PID: 64 Comm: kworker/0:4 Tainted: G
  C         6.13.5 #1
[   27.918232] Tainted: [C]=CRAP
[   27.918234] Hardware name: Raspberry Pi Compute Module 4 Rev 1.1 (DT)
[   27.918238] Workqueue: events_power_efficient phy_state_machine
[   27.918252] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS
BTYPE=--)
[   27.918257] pc : console_flush_all+0x2b0/0x4f8
[   27.918266] lr : console_flush_all+0x3f0/0x4f8
[   27.918271] sp : ffff8000805c3880
[   27.918273] x29: ffff8000805c38d0 x28: 0000000000000001 x27:
ffffcb3e232df2e8
[   27.918280] x26: ffffcb3e231bdaf0 x25: 0000000000000000 x24:
ffffcb3e231d40b8
[   27.918287] x23: 0000000000000000 x22: ffff8000805c3940 x21:
0000000000000000
[   27.918293] x20: ffff8000805c393b x19: ffffcb3e233950c8 x18:
ffff8000805c3560
[   27.918299] x17: 6c6f72746e6f6320 x16: 776f6c66202d206c x15:
ffff8001005c3717
[   27.918305] x14: 0000000000000000 x13: 646570706f726420 x12:
6e65656220657661
[   27.918311] x11: 682079616d203420 x10: 746e6576656b203a x9 :
ffffcb3e22b36c0c
[   27.918317] x8 : ffff8000805c3788 x7 : 0000000000000000 x6 :
ffff8000805c3840
[   27.918323] x5 : ffffcb3e231be000 x4 : 0000000000000000 x3 :
ffffcb3e231be3e8
[   27.918329] x2 : 0000000000000000 x1 : ffffcb3e21f25414 x0 :
ffff9e55d866a000
[   27.918335] Call trace:
[   27.918338]  console_flush_all+0x2b0/0x4f8 (P)
[   27.918346]  console_unlock+0x8c/0x170
[   27.918352]  vprintk_emit+0x238/0x3b8
[   27.918357]  dev_vprintk_emit+0xe4/0x1b8
[   27.918364]  dev_printk_emit+0x64/0x98
[   27.918368]  __netdev_printk+0xc8/0x228
[   27.918376]  netdev_info+0x70/0xa8
[   27.918382]  phy_print_status+0xcc/0x138
[   27.918386]  lan78xx_link_status_change+0x78/0xb0
[   27.918392]  phy_link_change+0x38/0x70
[   27.918398]  phy_check_link_status+0xa8/0x110
[   27.918405]  _phy_start_aneg+0x5c/0xb8
[   27.918409]  lan88xx_link_change_notify+0x5c/0x128
[   27.918416]  _phy_state_machine+0x12c/0x2b0
[   27.918420]  phy_state_machine+0x34/0x80
[   27.918425]  process_one_work+0x150/0x3b8
[   27.918432]  worker_thread+0x2a4/0x4b8
[   27.918438]  kthread+0xec/0xf8
[   27.918442]  ret_from_fork+0x10/0x20
[   27.918534] lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped
[   27.924985] lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped

In this case the interface recovered about 150s after boot (the messages
stopped and it became active), which correlates with when I logged in
over serial to check what is going on. I do not know if that *caused* it
to recover, the log gives no indication, but it's the first time I've
seen the interface get out of the stuck state (before it was in CI runs
where the device simply got reset, and I could only check logs
afterwards). Unfortunately I have no reliable way to reproduce the bug,
it just appears randomly during some boots.

Another problem is that this was *with* the patch, so it's clearly not a
correct solution and at most makes the issue less frequent (I had tested
 >300 boots successfully before sending it).

> I had a quick look at the PHY handling code, and it looks broken. The
> only time a MAC driver should look at members of phydev is during the
> adjust link callback, so lan78xx_link_status_change(). Everything is
> guaranteed to be consistent at this time. However, the current
> lan78xx_link_status_change() only adjusts EEE setting. The PHY code in
> lan78xx_link_reset() looks wrong. MAC drivers should not be reading
> PHY registers, or calling functions like phy_read_status(). Setting
> flow control should be performed in lan78xx_link_status_change() using
> phydev->pause and phydev->asym_pause.
Thanks for looking into this! I'm not familiar with ethernet drivers
yet, do you have a hint on where I can learn what those functions
*should* be doing (and thus how they might be fixed)?
Documentation/driver-api/phy/phy.rst has only a very general overview of
phy drivers.

Best regards,
Fiona