lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <3179622f-7090-4a57-98ba-9042809a0d2a@its-lehmann.de>
Date: Mon, 12 Feb 2024 11:39:08 +0100
From: Arno Lehmann <al@...-lehmann.de>
To: netdev@...r.kernel.org
Cc: linux-kernel@...r.kernel.org
Subject: intel i225 NIC loses PCIe link, network becomes unusable)

Hello everybody,

I'm struggling with the problem named in the subject.

Originally reported to the debian bug tracker; you'll find the history 
here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1060706

Infrequently, and apparently randomly, I have the situation that the 
PCIe link for the NIC is lost. Obviously, the network then becomes 
unusable. rmmod / modprobe'ing the igc module does not resolve this 
problem, a reboot is necessary.

I noticed this initially when installing the system last year, did a bit 
of a search, found that the kernel option 'pcie_aspm=off' was supposed 
to be useful, set that, and have that enabled ever since.

The problem persists.

Most recent case is this one:

[So Feb 11 15:47:18 2024] igc 0000:0b:00.0 eno1: NIC Link is Down
[So Feb 11 15:47:21 2024] igc 0000:0b:00.0 eno1: NIC Link is Up 1000 
Mbps Full Duplex, Flow Control: RX
[So Feb 11 16:52:01 2024] igc 0000:0b:00.0 eno1: NIC Link is Down
[So Feb 11 16:52:05 2024] igc 0000:0b:00.0 eno1: NIC Link is Up 1000 
Mbps Full Duplex, Flow Control: RX

(I have no idea if the above to events have any relevance.)

[So Feb 11 18:47:59 2024] igc 0000:0b:00.0 eno1: PCIe link lost, device 
now detached
[So Feb 11 18:47:59 2024] ------------[ cut here ]------------
[So Feb 11 18:47:59 2024] igc: Failed to read reg 0xc030!
[So Feb 11 18:47:59 2024] WARNING: CPU: 20 PID: 136256 at 
drivers/net/ethernet/intel/igc/igc_main.c:6583 igc_rd32+0x8d/0xa0 [igc]
[So Feb 11 18:47:59 2024] Modules linked in: rfcomm cpufreq_userspace 
cpufreq_powersave cpufreq_ondemand cpufreq_conservative nfsv3 nfs_acl 
rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache 
netfs overlay qrtr cmac algif_hash algif_skcipher af_alg bnep sunrpc 
binfmt_misc nls_ascii nls_cp437 vfat fat ext4 mbcache jbd2 
intel_rapl_msr intel_rapl_common btusb btrtl btbcm btintel btmtk 
bluetooth mt7921e snd_hda_codec_hdmi mt7921_common mt76_connac_lib 
edac_mce_amd snd_hda_intel mt76 snd_intel_dspcfg kvm_amd 
snd_intel_sdw_acpi sha3_generic mac80211 jitterentropy_rng snd_usb_audio 
uvcvideo snd_hda_codec drbg libarc4 videobuf2_vmalloc snd_usbmidi_lib 
asus_nb_wmi eeepc_wmi kvm uvc videobuf2_memops snd_rawmidi ansi_cprng 
snd_hda_core asus_wmi videobuf2_v4l2 snd_seq_device snd_hwdep 
ecdh_generic irqbypass battery ecc ledtrig_audio videodev snd_pcm 
sparse_keymap cfg80211 crc16 rapl snd_timer videobuf2_common 
platform_profile wmi_bmof sp5100_tco pcspkr snd ccp mc watchdog k10temp 
soundcore rfkill joydev sg evdev msr
[So Feb 11 18:47:59 2024]  parport_pc ppdev lp parport fuse loop 
efi_pstore configfs efivarfs ip_tables x_tables autofs4 xfs libcrc32c 
crc32c_generic sd_mod dm_crypt dm_mod uas usb_storage hid_generic amdgpu 
amdxcp drm_buddy gpu_sched usbhid i2c_algo_bit drm_suballoc_helper hid 
drm_display_helper sr_mod cdrom cec rc_core crc32_pclmul drm_ttm_helper 
crc32c_intel ghash_clmulni_intel ttm ahci sha512_ssse3 sha512_generic 
libahci nvme xhci_pci drm_kms_helper libata xhci_hcd nvme_core drm 
aesni_intel t10_pi usbcore scsi_mod crypto_simd crc64_rocksoft_generic 
igc cryptd crc64_rocksoft crc_t10dif crct10dif_generic i2c_piix4 
crct10dif_pclmul crc64 crct10dif_common scsi_common usb_common video wmi 
gpio_amdpt gpio_generic button
[So Feb 11 18:47:59 2024] CPU: 20 PID: 136256 Comm: kworker/20:0 Not 
tainted 6.5.0-0.deb12.4-amd64 #1  Debian 6.5.10-1~bpo12+1
[So Feb 11 18:47:59 2024] Hardware name: ASUS System Product Name/ROG 
STRIX X670E-A GAMING WIFI, BIOS 1904 01/29/2024
[So Feb 11 18:47:59 2024] Workqueue: events igc_watchdog_task [igc]
[So Feb 11 18:47:59 2024] RIP: 0010:igc_rd32+0x8d/0xa0 [igc]
[So Feb 11 18:47:59 2024] Code: 48 c7 c6 10 76 36 c0 e8 81 6a c1 d5 48 
8b bb 28 ff ff ff e8 05 d2 97 d5 84 c0 74 bc 89 ee 48 c7 c7 38 76 36 c0 
e8 c3 ee 36 d5 <0f> 0b eb aa b8 ff ff ff ff e9 15 cf e7 d5 0f 1f 44 00 
00 90 90 90
[So Feb 11 18:47:59 2024] RSP: 0018:ffffa203cfe8fdd8 EFLAGS: 00010282
[So Feb 11 18:47:59 2024] RAX: 0000000000000000 RBX: ffff961b5c75ccb8 
RCX: 0000000000000027
[So Feb 11 18:47:59 2024] RDX: ffff962a5e7213c8 RSI: 0000000000000001 
RDI: ffff962a5e7213c0
[So Feb 11 18:47:59 2024] RBP: 000000000000c030 R08: 0000000000000000 
R09: ffffa203cfe8fc68
[So Feb 11 18:47:59 2024] R10: 0000000000000003 R11: ffff962a9de3ac28 
R12: ffff961b5c75c000
[So Feb 11 18:47:59 2024] R13: 0000000000000000 R14: ffff961b54c92d40 
R15: 000000000000c030
[So Feb 11 18:47:59 2024] FS:  0000000000000000(0000) 
GS:ffff962a5e700000(0000) knlGS:0000000000000000
[So Feb 11 18:47:59 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[So Feb 11 18:47:59 2024] CR2: 00007fb76de93000 CR3: 00000001153d0000 
CR4: 0000000000750ee0
[So Feb 11 18:47:59 2024] PKRU: 55555554
[So Feb 11 18:47:59 2024] Call Trace:
[So Feb 11 18:47:59 2024]  <TASK>
[So Feb 11 18:47:59 2024]  ? igc_rd32+0x8d/0xa0 [igc]
[So Feb 11 18:47:59 2024]  ? __warn+0x81/0x130
[So Feb 11 18:47:59 2024]  ? igc_rd32+0x8d/0xa0 [igc]
[So Feb 11 18:47:59 2024]  ? report_bug+0x171/0x1a0
[So Feb 11 18:47:59 2024]  ? srso_alias_return_thunk+0x5/0x7f
[So Feb 11 18:47:59 2024]  ? prb_read_valid+0x1b/0x30
[So Feb 11 18:47:59 2024]  ? handle_bug+0x41/0x70
[So Feb 11 18:47:59 2024]  ? exc_invalid_op+0x17/0x70
[So Feb 11 18:47:59 2024]  ? asm_exc_invalid_op+0x1a/0x20
[So Feb 11 18:47:59 2024]  ? igc_rd32+0x8d/0xa0 [igc]
[So Feb 11 18:47:59 2024]  ? igc_rd32+0x8d/0xa0 [igc]
[So Feb 11 18:47:59 2024]  igc_update_stats+0x8a/0x6d0 [igc]
[So Feb 11 18:47:59 2024]  igc_watchdog_task+0x9d/0x4a0 [igc]
[So Feb 11 18:47:59 2024]  process_one_work+0x1df/0x3e0
[So Feb 11 18:47:59 2024]  worker_thread+0x51/0x390
[So Feb 11 18:47:59 2024]  ? __pfx_worker_thread+0x10/0x10
[So Feb 11 18:47:59 2024]  kthread+0xe5/0x120
[So Feb 11 18:47:59 2024]  ? __pfx_kthread+0x10/0x10
[So Feb 11 18:47:59 2024]  ret_from_fork+0x31/0x50
[So Feb 11 18:47:59 2024]  ? __pfx_kthread+0x10/0x10
[So Feb 11 18:47:59 2024]  ret_from_fork_asm+0x1b/0x30
[So Feb 11 18:47:59 2024]  </TASK>
[So Feb 11 18:47:59 2024] ---[ end trace 0000000000000000 ]---


With the guidance from the friendly folks at the debian bug tracker, we 
could find that this happens with many kernel versions, as can be 
derived from the following (condensed list below):

# journalctl  --grep '(Linux version|PCIe link lost)' --quiet | cat
Aug 30 18:16:18 Zwerg kernel: Linux version 6.1.0-11-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.38-4 (2023-08-08)
Sep 20 14:21:17 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, 
device now detached
Sep 20 19:47:06 Zwerg kernel: Linux version 6.1.0-11-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.38-4 (2023-08-08)
Okt 04 17:16:08 Zwerg kernel: Linux version 6.1.0-12-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.52-1 (2023-09-07)
Okt 06 05:44:20 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, 
device now detached
Okt 07 16:39:10 Zwerg kernel: igc 0000:0a:00.0 (unnamed net_device) 
(uninitialized): PCIe link lost, device now detached
Okt 07 16:43:41 Zwerg kernel: Linux version 6.1.0-12-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.52-1 (2023-09-07)
Okt 23 18:23:54 Zwerg kernel: Linux version 6.1.0-12-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.52-1 (2023-09-07)
Okt 23 18:31:25 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, 
device now detached
Okt 23 18:48:58 Zwerg kernel: Linux version 6.1.0-13-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.55-1 (2023-09-29)
Okt 30 11:16:06 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, 
device now detached
Okt 31 13:50:06 Zwerg kernel: igc 0000:0a:00.0 (unnamed net_device) 
(uninitialized): PCIe link lost, device now detached
Okt 31 13:52:01 Zwerg kernel: Linux version 6.1.0-13-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.55-1 (2023-09-29)
Nov 22 18:59:11 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, 
device now detached
Nov 23 12:18:19 Zwerg kernel: Linux version 6.1.0-13-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.55-1 (2023-09-29)
Nov 23 15:45:49 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, 
device now detached
Nov 23 15:52:51 Zwerg kernel: Linux version 6.1.0-13-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.55-1 (2023-09-29)
Dez 06 19:06:18 Zwerg kernel: Linux version 6.1.0-13-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.55-1 (2023-09-29)
Dez 09 15:12:13 Zwerg kernel: Linux version 6.1.0-14-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.64-1 (2023-11-30)
Dez 19 07:33:02 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, 
device now detached
Dez 20 10:29:21 Zwerg kernel: Linux version 6.1.0-15-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.66-1 (2023-12-09)
Jan 01 09:57:40 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, 
device now detached
Jan 02 13:41:33 Zwerg kernel: Linux version 6.1.0-15-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.66-1 (2023-12-09)
Jan 10 16:15:20 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, 
device now detached
Jan 13 11:02:41 Zwerg kernel: Linux version 6.1.0-17-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.69-1 (2023-12-30)
Jan 13 11:16:31 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, 
device now detached
Jan 13 11:18:13 Zwerg kernel: Linux version 6.1.0-17-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.69-1 (2023-12-30)
Jan 19 14:25:08 Zwerg kernel: Linux version 6.1.0-1-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-13) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.39.90.20221231) #1 SMP PREEMPT_DYNAMIC 
Debian 6.1.4-1 (2023-01-07)
Jan 27 09:41:16 Zwerg kernel: Linux version 6.1.0-17-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.69-1 (2023-12-30)
Jan 27 09:44:53 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, 
device now detached
Jan 27 09:48:05 Zwerg kernel: igc 0000:0a:00.0 (unnamed net_device) 
(uninitialized): PCIe link lost, device now detached
Jan 27 09:52:16 Zwerg kernel: igc 0000:0a:00.0 (unnamed net_device) 
(uninitialized): PCIe link lost, device now detached
Jan 27 09:58:46 Zwerg kernel: Linux version 6.1.0-1-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-13) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.39.90.20221231) #1 SMP PREEMPT_DYNAMIC 
Debian 6.1.4-1 (2023-01-07)
Feb 01 04:19:17 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, 
device now detached
Feb 01 14:43:03 Zwerg kernel: igc 0000:0a:00.0 (unnamed net_device) 
(uninitialized): PCIe link lost, device now detached
Feb 01 14:50:04 Zwerg kernel: Linux version 6.1.0-17-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.69-1 (2023-12-30)
Feb 01 15:28:42 Zwerg kernel: Linux version 6.5.0-0.deb12.4-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.5.10-1~bpo12+1 (2023-11-23)
Feb 08 18:26:31 Zwerg kernel: Linux version 6.5.0-0.deb12.4-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.5.10-1~bpo12+1 (2023-11-23)
Feb 08 18:33:38 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, 
device now detached
Feb 08 18:58:25 Zwerg kernel: Linux version 6.5.0-0.deb12.4-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.5.10-1~bpo12+1 (2023-11-23)
Feb 08 19:00:32 Zwerg kernel: igc 0000:0b:00.0 eno1: PCIe link lost, 
device now detached
Feb 08 19:02:38 Zwerg kernel: igc 0000:0b:00.0 (unnamed net_device) 
(uninitialized): PCIe link lost, device now detached
Feb 08 19:05:30 Zwerg kernel: Linux version 6.5.0-0.deb12.4-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.5.10-1~bpo12+1 (2023-11-23)
Feb 09 13:25:08 Zwerg kernel: igc 0000:0b:00.0 eno1: PCIe link lost, 
device now detached
Feb 09 13:27:17 Zwerg kernel: igc 0000:0b:00.0 (unnamed net_device) 
(uninitialized): PCIe link lost, device now detached
Feb 09 13:30:42 Zwerg kernel: Linux version 6.5.0-0.deb12.4-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.5.10-1~bpo12+1 (2023-11-23)
Feb 11 18:47:57 Zwerg kernel: igc 0000:0b:00.0 eno1: PCIe link lost, 
device now detached
Feb 12 10:55:30 Zwerg kernel: Linux version 6.1.0-17-amd64 
(debian-kernel@...ts.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.69-1 (2023-12-30)

The kernel version I used were

Debian 6.1.4-1 (2023-01-07)
Debian 6.1.38-4 (2023-08-08)
Debian 6.1.52-1 (2023-09-07)
Debian 6.1.55-1 (2023-09-29)
Debian 6.1.64-1 (2023-11-30)
Debian 6.1.66-1 (2023-12-09)
Debian 6.1.69-1 (2023-12-30)
Debian 6.5.10-1~bpo12+1 (2023-11-23)


At this point, it looks like at least one person with a bit of insight 
is convinced this is an upstream issue.

Of course I'll try to provide whatever information else may be needed.

Most importantly, I think, is the hardware surrounding the NIC:
This is an ASUSTeK COMPUTER INC. ROG STRIX X670E-A GAMING WIFI, i.e. AMD 
X670 chipset with fershly updated BIOS: 1904 01/29/2024. CPU is an AMD 
Ryzen 9 7900X.

I have not set any particular overclocking or performance options, just 
tried to have all firmware settings on "conservative".


Mass storage is a Western Digital SN850X NVMe device.

I have experienced two cases where the storage device apparently 
"vanished" from the PCIe bus, which resulted in a flood of journald 
messages that it could not log anything to persistent storage. I have 
never seen the first few lines of thos occurences, and obviously, I have 
no logs.

I did notice, however, that the system still responded to pings on the 
network.

All of this seems to indicate that this might be related to PCIe power 
management. I suspect that my gut feeling is not the best starting point 
to decide how to proceed here.

So, if you any way to improve this situation and make the system 
reliably usable, I'm willing to help in any way I can, but you'll have 
to tell me what to do!

Cheers,

Arno

-- 
Arno Lehmann

IT-Service Lehmann
Sandstr. 6, 49080 Osnabrück

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ