netdev - Re: Tulip 21142 panic on physical link disconnect

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <52564e1f-ab05-4347-bd64-b38a69180499@gmail.com>
Date: Thu, 19 Jun 2025 12:46:35 -0700
From: Florian Fainelli <f.fainelli@...il.com>
To: "Maciej W. Rozycki" <macro@...am.me.uk>,
 Greg Chandler <chandleg@...ardsworks.org>
Cc: stable@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: Tulip 21142 panic on physical link disconnect

Hi Maciej,

On 6/19/25 12:36, Maciej W. Rozycki wrote:
> On Thu, 19 Jun 2025, Greg Chandler wrote:
> 
>> So what I know for sure is this:
>> The tulip driver on alpha (generic and DP264) oops/panic on physical
>> disconnect, but only when an IP address is bound.
>> It does not panic when no address is bound to the interface.
>> It does not matter if the driver is compiled in, or if it is compiled as a
>> module.
>> It does not matter if all of the options are set for tulip or if none of them
>> are:
>>      New bus configuration
>>      Use PCI shared mem for NIC registers
>>      Use RX polling (NAPI)
>>      Use Interrupt Mitigation
>> The physical link does not auto-negotiate, and mii-tool does not seem to be
>> able to force it with -F or -A like you would expect it to.
>> The kernel does not drop the "Link is Up/Link is Down" messages when the PHY
>> "links"
>> The switch and interface both show LEDs as if linked at 10-Half-Duplex, and
>> the lights turn off when the link is broken.
>> Subsequently they do relink at 10-Half again if plugged back in.
>> I did also attempt to test the kernel level stack for nfsroot, just to see if
>> it worked prior to init launching everything else, and it did not.
>> I used the same IP configuration for that test as all of the tests in these
>> emails.
>> All of the oops/panics seem to happen at:
>>      kernel/time/timer.c:1657 __timer_delete_sync+0x10c/0x150
> 
>   FYI something's changed a while ago in how `del_timer_sync' is handled
> and I can see a similar warning nowadays with another network driver with
> the MIPS platform.
> 
>   Since I'm the maintainer of said driver I mean to bisect it and figure
> out what's going here, but haven't found time so far owing to other
> commitments (and the driver otherwise works just fine regardless, so it's
> minor annoyance).  If you beat me to it, then I'll gladly accept it, but
> otherwise I'm just letting you know you're not alone with this issue and
> that it's not specific to the DEC Tulip driver on your system.
 > >   For the record:
> 
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 0 at kernel/time/timer.c:1563 __timer_delete_sync+0x110/0x118
> Modules linked in:
> CPU: 0 PID: 0 Comm: swapper Tainted: G        W          6.4.0-rc3-00030-gae62c49c0cef #21
> Stack : 807a0000 80095a8c 00000000 00000004 806a0000 00000009 80c09dac 807d0000
>          807a0000 807056ec 80769fac 807a13f3 807d30c4 1000ec00 80c09d58 80787a18
>          00000000 00000000 807056ec 00000000 00000001 80c09c94 00000077 34633236
>          20202020 00000000 807d7311 20202020 807056ec 1000ec00 00000000 00000000
>          806fcb60 806fcb38 807a0000 00000001 00000000 fffffffe 00000000 807d0000
>          ...
> Call Trace:
> [<80048ecc>] show_stack+0x2c/0xf8
> [<80645c88>] dump_stack_lvl+0x34/0x4c
> [<80641d00>] __warn+0xb4/0xe8
> [<80641d84>] warn_slowpath_fmt+0x50/0x88
> [<800b177c>] __timer_delete_sync+0x110/0x118
> [<8040f4b0>] fza_interrupt+0x904/0x1004
> [<80098d7c>] __handle_irq_event_percpu+0x84/0x188
> [<80098f1c>] handle_irq_event+0x38/0xbc
> [<8009d4e4>] handle_level_irq+0xc8/0x208
> [<80098110>] generic_handle_irq+0x44/0x5c
> [<8064f450>] do_IRQ+0x1c/0x28
> [<80041cf0>] dec_irq_dispatch+0x10/0x20
> [<80043754>] handle_int+0x14c/0x158
> [<8008bf64>] do_idle+0x5c/0x15c
> [<8008c368>] cpu_startup_entry+0x20/0x28
> [<8064657c>] kernel_init+0x0/0x114
> 
> ---[ end trace 0000000000000000 ]---
> 
> -- the arrival of this particular device state change interrupt means the
> timer set up just in case the device gets stuck can be deleted, so I'm not
> sure why calling `del_timer_sync' to discard the timer has become a no-no
> now; this code is 20+ years old now, though I sat on it for a while and
> then it took some time and effort to get it upstream too.  The issue has
> started sometime between 5.18 (clean boot) and 6.4 (quoted above).
> 
>   Maybe it'll ring someone's bell and they'll chime in or otherwise I'll
> bisect it... sometime.  Or feel free to start yourself with 5.18, as it's
> not terribly old, only a bit and certainly not so as 2.6 is.

I am still not sure why I could not see that warning on by Cobalt Qube2 
trying to reproduce Greg's original issue, that is with an IP assigned 
on the interface yanking the cable did not trigger a timer warning. It 
could be that machine is orders of magnitude slower and has a different 
CONFIG_HZ value that just made it less likely to be seen?
-- 
Florian