netdev - Re: r8169 link up but no traffic, and watchdog error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87msx1bezt.fsf@mkjws.danelec-net.lan>
Date: Mon, 02 Oct 2023 10:28:28 +0200
From: Martin Kjær Jørgensen <me@...y.org>
To: Heiner Kallweit <hkallweit1@...il.com>
Cc: Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org,
 nic_swsd@...ltek.com
Subject: Re: r8169 link up but no traffic, and watchdog error


On Mon, Sep 25 2023, Heiner Kallweit <hkallweit1@...il.com> wrote:

> On 25.09.2023 17:41, Martin Kjær Jørgensen wrote:
>>
>> On Mon, Sep 25 2023, Heiner Kallweit <hkallweit1@...il.com> wrote:
>>
>>> On 25.09.2023 13:30, Martin Kjær Jørgensen wrote:
>>>>
>>>> On Mon, Sep 25 2023, Heiner Kallweit <hkallweit1@...il.com> wrote:
>>>>
>>>>
>>>> There are no PCI extension cards.
>>>>
>>>
>>> Your BIOS signature indicates that the system is a Thinkstation P350.
>>> According to the Lenovo website it comes with one Intel-based network port.
>>> However you have additional 4 Realtek-based network ports on the mainboard?
>>>
>>
>> Yes. 2 PCIE cards with two Realtek ethernet controllers each.
>>
>>>>> And does the problem occur with all of your NICs?
>>>>
>>>> No, only the Realtek ones.
>>>>
>>>>> The exact NIC type might provide a hint, best provide a full dmesg log.
>>>> [ 1512.295490] RSP: 0018:ffffbc0240193e88 EFLAGS: 00000246
>>>> [ 1512.295492] RAX: ffff998935680000 RBX: ffffdc023faa8e00 RCX: 000000000000001f
>>>> [ 1512.295493] RDX: 0000000000000002 RSI: ffffffffb544f718 RDI: ffffffffb543bc32
>>>> [ 1512.295494] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000018
>>>> [ 1512.295495] R10: ffff9989356b1dc4 R11: 00000000000058a8 R12: ffffffffb5d981a0
>>>> [ 1512.295496] R13: 000001601bd198ef R14: 0000000000000003 R15: 0000000000000000
>>>> [ 1512.295497]  ? cpuidle_enter_state+0xbd/0x440
>>>> [ 1512.295499]  cpuidle_enter+0x2d/0x40
>>>> [ 1512.295501]  do_idle+0x217/0x270
>>>> [ 1512.295503]  cpu_startup_entry+0x1d/0x20
>>>> [ 1512.295505]  start_secondary+0x11a/0x140
>>>> [ 1512.295508]  secondary_startup_64_no_verify+0x17e/0x18b
>>>> [ 1512.295510]  </TASK>
>>>> [ 1512.295511] ---[ end trace 0000000000000000 ]---
>>>> [ 1512.295526] r8169 0000:03:00.0: can't disable ASPM; OS doesn't have ASPM control
>>>> [ 1531.322039] r8169 0000:03:00.0 enp3s0: Link is Down
>>>> [ 1534.138489] r8169 0000:03:00.0 enp3s0: Link is Up - 1Gbps/Full - flow control rx/tx
>>>> [ 1538.177385] r8169 0000:03:00.0 enp3s0: Link is Down
>>>> [ 1566.174660] r8169 0000:03:00.0 enp3s0: Link is Up - 1Gbps/Full - flow control rx/tx
>>>> [ 1567.839082] r8169 0000:03:00.0 enp3s0: Link is Down
>>>> [ 1570.621088] r8169 0000:03:00.0 enp3s0: Link is Up - 1Gbps/Full - flow control rx/tx
>>>> [ 1576.294267] r8169 0000:03:00.0: can't disable ASPM; OS doesn't have ASPM control
>>>
>>> Regarding the following: Issue occurs after few seconds of link-loss.
>>> Was this an intentional link-down event?
>>
>> Yes, I intentionally unplug the cable at the other end for the link to go down.
>>
>>> And is issue always related to link-up after a link-loss period?
>>>
>>
>> Yes, it happends after cable is plugged in again, so after a link-loss period.
>>
>>
>>>> [ 1488.643231] r8169 0000:03:00.0 enp3s0: Link is Down
>>>> [ 1506.576941] r8169 0000:03:00.0 enp3s0: Link is Up - 1Gbps/Full - flow control rx/tx
>>>> [ 1512.295215] ------------[ cut here ]------------
>>>> [ 1512.295219] NETDEV WATCHDOG: enp3s0 (r8169): transmit queue 0 timed out 5368 ms
>
> Could you please test whether the following helps?
>
> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
> index 6351a2dc1..a2fbfff5a 100644
> --- a/drivers/net/ethernet/realtek/r8169_main.c
> +++ b/drivers/net/ethernet/realtek/r8169_main.c
> @@ -4596,7 +4596,9 @@ static void r8169_phylink_handler(struct net_device *ndev)
>  	if (netif_carrier_ok(ndev)) {
>  		rtl_link_chg_patch(tp);
>  		pm_request_resume(d);
> +		netif_wake_queue(tp->dev);
>  	} else {
> +		rtl_reset_work(tp);
>  		pm_runtime_idle(d);
>  	}

This patch seems to have a good influence. I have applied it to a vanilla
6.1.55 kernel, and been using it for a while.

No kernel netdev watchdog errors, and interface responds to traffic almost
instantly when it gets up again after link loss :-)