lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 9 Sep 2020 09:53:55 +0200
From:   Marc Leeman <marc.leeman@...il.com>
To:     netdev@...r.kernel.org
Subject: (pch_gbe): transmit queue 0 timed out

Hi
I'd like to get some feedback on an issue that has popped up on newer
systems (with increased load).

The system uses an older CPU (Atom) that uses an integrated MAC. When
flooding the NIC with multicast traffic (and multiple listeners), we
get the following:

-----

Aug 16 01:21:55 dss kernel: [ 1357.210634] NETDEV WATCHDOG: eth0
(pch_gbe): transmit queue 0 timed out
Aug 16 01:21:55 dss kernel: [ 1357.210680] WARNING: CPU: 1 PID: 1187
at net/sched/sch_generic.c:466 dev_watchdog+0x1b6/0x1c0
Aug 16 01:21:55 dss kernel: [ 1357.210683] Modules linked in: 8021q
garp stp mrp llc rfkill nft_chain_nat_ipv4 nf_nat_ipv4 xt_REDIRECT
nf_nat nf_log_ipv4 nf_log_common nft_counter xt_LOG i2c_dev ie6xx_wdt
lpc_sch xt_multiport i2c_i801 xt_pkttype xt_recent xt_state
xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
xt_tcpudp nft_compat nf_tables nfnetlink coretemp kvm irqbypass
serio_raw pcspkr gma500_gfx pch_can can_dev drm_kms_helper drm
pch_uart sg pch_dma pch_udc i2c_algo_bit udc_core fb_sys_fops
syscopyarea pch_phub sysfillrect evdev sysimgblt video pcc_cpufreq
button acpi_cpufreq ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2
crc32c_generic fscrypto ecb crypto_simd cryptd aes_i586 aufs(OE)
sd_mod i2c_isch psmouse mfd_core e1000e spi_topcliff_pch ahci ohci_pci
libahci ohci_hcd ehci_pci libata ehci_hcd sdhci_pci
Aug 16 01:21:55 dss kernel: [ 1357.210802]  usbcore cqhci pch_gbe
sdhci scsi_mod ptp_pch mmc_core mii ptp pps_core gpio_pch usb_common
[last unloaded: lpc_sch]
Aug 16 01:21:55 dss kernel: [ 1357.210831] CPU: 1 PID: 1187 Comm:
mysqld Tainted: G           OE     4.19.0-9-686 #1 Debian
4.19.118-2+deb10u1
Aug 16 01:21:55 dss kernel: [ 1357.210835] Hardware name: EKF
Elektronik GmbH PC2-LIMBO/PC2-LIMBO, BIOS 094 2017-02-01
Aug 16 01:21:55 dss kernel: [ 1357.210844] EIP: dev_watchdog+0x1b6/0x1c0
Aug 16 01:21:55 dss kernel: [ 1357.210853] Code: 8b 50 3c 89 f8 e8 ca
cd 10 00 8b 7e f0 eb a3 89 f8 c6 05 eb 4e 90 d7 01 e8 b7 dc fc ff 53
50 57 68 44 f7 82 d7 e8 4e ee ae ff <0f> 0b 83 c4 10 eb c9 8d 76 00 3e
8d 74 26 00 55 89 e5 57 56 89 d6
Aug 16 01:21:55 dss kernel: [ 1357.210859] EAX: 0000003b EBX: 00000000
ECX: f473ccac EDX: 00000007
Aug 16 01:21:55 dss kernel: [ 1357.210864] ESI: f41fc2e8 EDI: f41fc000
EBP: f417df68 ESP: f417df40
Aug 16 01:21:55 dss kernel: [ 1357.210871] DS: 007b ES: 007b FS: 00d8
GS: 00e0 SS: 0068 EFLAGS: 00010292
Aug 16 01:21:55 dss kernel: [ 1357.210876] CR0: 80050033 CR2: b78e1010
CR3: 1bbd7000 CR4: 000006d0
Aug 16 01:21:55 dss kernel: [ 1357.210880] Call Trace:
Aug 16 01:21:55 dss kernel: [ 1357.210887]  <SOFTIRQ>
Aug 16 01:21:55 dss kernel: [ 1357.210903]  ? pfifo_fast_enqueue+0xf0/0xf0
Aug 16 01:21:55 dss kernel: [ 1357.210913]  call_timer_fn+0x2f/0x130
Aug 16 01:21:55 dss kernel: [ 1357.210921]  ? pfifo_fast_enqueue+0xf0/0xf0
Aug 16 01:21:55 dss kernel: [ 1357.210930]  run_timer_softirq+0x1bd/0x3f0
Aug 16 01:21:55 dss kernel: [ 1357.210944]  __do_softirq+0xb2/0x275
Aug 16 01:21:55 dss kernel: [ 1357.210955]  ? __softirqentry_text_start+0x8/0x8
Aug 16 01:21:55 dss kernel: [ 1357.210964]  call_on_stack+0x12/0x50
Aug 16 01:21:55 dss kernel: [ 1357.210969]  </SOFTIRQ>
Aug 16 01:21:55 dss kernel: [ 1357.210977]  ? irq_exit+0xc5/0xd0
Aug 16 01:21:55 dss kernel: [ 1357.210986]  ?
smp_apic_timer_interrupt+0x6c/0x130
Aug 16 01:21:55 dss kernel: [ 1357.210996]  ? apic_timer_interrupt+0xd5/0xdc
Aug 16 01:21:55 dss kernel: [ 1357.211007]  ? nmi+0x8b/0x198
----

It looks eerily similar to the issue reported on this mailinglist 8 years ago:
https://www.spinics.net/lists/netdev/msg198234.html

where locking was tweaked to compensate.

When I compare the different kernels (4.19.132, 5.8.7), the code base
has changed little in
the driver, the locking was changed a bit (wrt patch where it was
confirmed to be a fix):

1. netif_tx_lock is used instead of
spin_lock(&tx_ring->tx_lock);

2. locking has been removed in  pch_gbe_xmit_frame

Is this again an issue with missing locks?

Since it has been quite some time since I did some kernel work, I
thought it better to
check first.


-- 
g. Marc

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ