netdev - Re: Tulip 21142 panic on physical link disconnect

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.21.2506192007440.37405@angie.orcam.me.uk>
Date: Thu, 19 Jun 2025 20:36:44 +0100 (BST)
From: "Maciej W. Rozycki" <macro@...am.me.uk>
To: Greg Chandler <chandleg@...ardsworks.org>
cc: Florian Fainelli <f.fainelli@...il.com>, stable@...r.kernel.org, 
    netdev@...r.kernel.org
Subject: Re: Tulip 21142 panic on physical link disconnect

On Thu, 19 Jun 2025, Greg Chandler wrote:

> So what I know for sure is this:
> The tulip driver on alpha (generic and DP264) oops/panic on physical
> disconnect, but only when an IP address is bound.
> It does not panic when no address is bound to the interface.
> It does not matter if the driver is compiled in, or if it is compiled as a
> module.
> It does not matter if all of the options are set for tulip or if none of them
> are:
>     New bus configuration
>     Use PCI shared mem for NIC registers
>     Use RX polling (NAPI)
>     Use Interrupt Mitigation
> The physical link does not auto-negotiate, and mii-tool does not seem to be
> able to force it with -F or -A like you would expect it to.
> The kernel does not drop the "Link is Up/Link is Down" messages when the PHY
> "links"
> The switch and interface both show LEDs as if linked at 10-Half-Duplex, and
> the lights turn off when the link is broken.
> Subsequently they do relink at 10-Half again if plugged back in.
> I did also attempt to test the kernel level stack for nfsroot, just to see if
> it worked prior to init launching everything else, and it did not.
> I used the same IP configuration for that test as all of the tests in these
> emails.
> All of the oops/panics seem to happen at:
>     kernel/time/timer.c:1657 __timer_delete_sync+0x10c/0x150

 FYI something's changed a while ago in how `del_timer_sync' is handled 
and I can see a similar warning nowadays with another network driver with 
the MIPS platform.

 Since I'm the maintainer of said driver I mean to bisect it and figure 
out what's going here, but haven't found time so far owing to other 
commitments (and the driver otherwise works just fine regardless, so it's 
minor annoyance).  If you beat me to it, then I'll gladly accept it, but 
otherwise I'm just letting you know you're not alone with this issue and 
that it's not specific to the DEC Tulip driver on your system.

 For the record:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at kernel/time/timer.c:1563 __timer_delete_sync+0x110/0x118
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Tainted: G        W          6.4.0-rc3-00030-gae62c49c0cef #21
Stack : 807a0000 80095a8c 00000000 00000004 806a0000 00000009 80c09dac 807d0000
        807a0000 807056ec 80769fac 807a13f3 807d30c4 1000ec00 80c09d58 80787a18
        00000000 00000000 807056ec 00000000 00000001 80c09c94 00000077 34633236
        20202020 00000000 807d7311 20202020 807056ec 1000ec00 00000000 00000000
        806fcb60 806fcb38 807a0000 00000001 00000000 fffffffe 00000000 807d0000
        ...
Call Trace:
[<80048ecc>] show_stack+0x2c/0xf8
[<80645c88>] dump_stack_lvl+0x34/0x4c
[<80641d00>] __warn+0xb4/0xe8
[<80641d84>] warn_slowpath_fmt+0x50/0x88
[<800b177c>] __timer_delete_sync+0x110/0x118
[<8040f4b0>] fza_interrupt+0x904/0x1004
[<80098d7c>] __handle_irq_event_percpu+0x84/0x188
[<80098f1c>] handle_irq_event+0x38/0xbc
[<8009d4e4>] handle_level_irq+0xc8/0x208
[<80098110>] generic_handle_irq+0x44/0x5c
[<8064f450>] do_IRQ+0x1c/0x28
[<80041cf0>] dec_irq_dispatch+0x10/0x20
[<80043754>] handle_int+0x14c/0x158
[<8008bf64>] do_idle+0x5c/0x15c
[<8008c368>] cpu_startup_entry+0x20/0x28
[<8064657c>] kernel_init+0x0/0x114

---[ end trace 0000000000000000 ]---

-- the arrival of this particular device state change interrupt means the 
timer set up just in case the device gets stuck can be deleted, so I'm not 
sure why calling `del_timer_sync' to discard the timer has become a no-no 
now; this code is 20+ years old now, though I sat on it for a while and 
then it took some time and effort to get it upstream too.  The issue has 
started sometime between 5.18 (clean boot) and 6.4 (quoted above).

 Maybe it'll ring someone's bell and they'll chime in or otherwise I'll 
bisect it... sometime.  Or feel free to start yourself with 5.18, as it's 
not terribly old, only a bit and certainly not so as 2.6 is.

  Maciej