netdev - Re: (pch_gbe): transmit queue 0 timed out

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAH+fs=-d=RUAgbCSn-OasVwVERVMP+v-Y0vvU7Y74p8c8TihCw@mail.gmail.com>
Date:   Thu, 10 Sep 2020 16:05:48 +0200
From:   Marc Leeman <marc.leeman@...il.com>
To:     netdev@...r.kernel.org
Subject: Re: (pch_gbe): transmit queue 0 timed out

A small update:

we started reproducing the problem by flooding the NIC with multicast
packets in a directly connected setup (no switch, just device and test
device with a single cable). At some point, the driver stops receiving
packets; but we've noticed that the link level leds are dead too.

Reloading the driver does not re-establish the link (no-carrier), only
a reboot when the CPU comes out of reset does.

It's not related to race conditions mentioned earlier.


On Wed, 9 Sep 2020 at 09:53, Marc Leeman <marc.leeman@...il.com> wrote:
>
> Hi
> I'd like to get some feedback on an issue that has popped up on newer
> systems (with increased load).
>
> The system uses an older CPU (Atom) that uses an integrated MAC. When
> flooding the NIC with multicast traffic (and multiple listeners), we
> get the following:
>
> -----
>
> Aug 16 01:21:55 dss kernel: [ 1357.210634] NETDEV WATCHDOG: eth0
> (pch_gbe): transmit queue 0 timed out
> Aug 16 01:21:55 dss kernel: [ 1357.210680] WARNING: CPU: 1 PID: 1187
> at net/sched/sch_generic.c:466 dev_watchdog+0x1b6/0x1c0
> Aug 16 01:21:55 dss kernel: [ 1357.210683] Modules linked in: 8021q
> garp stp mrp llc rfkill nft_chain_nat_ipv4 nf_nat_ipv4 xt_REDIRECT
> nf_nat nf_log_ipv4 nf_log_common nft_counter xt_LOG i2c_dev ie6xx_wdt
> lpc_sch xt_multiport i2c_i801 xt_pkttype xt_recent xt_state
> xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
> xt_tcpudp nft_compat nf_tables nfnetlink coretemp kvm irqbypass
> serio_raw pcspkr gma500_gfx pch_can can_dev drm_kms_helper drm
> pch_uart sg pch_dma pch_udc i2c_algo_bit udc_core fb_sys_fops
> syscopyarea pch_phub sysfillrect evdev sysimgblt video pcc_cpufreq
> button acpi_cpufreq ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2
> crc32c_generic fscrypto ecb crypto_simd cryptd aes_i586 aufs(OE)
> sd_mod i2c_isch psmouse mfd_core e1000e spi_topcliff_pch ahci ohci_pci
> libahci ohci_hcd ehci_pci libata ehci_hcd sdhci_pci
> Aug 16 01:21:55 dss kernel: [ 1357.210802]  usbcore cqhci pch_gbe
> sdhci scsi_mod ptp_pch mmc_core mii ptp pps_core gpio_pch usb_common
> [last unloaded: lpc_sch]
> Aug 16 01:21:55 dss kernel: [ 1357.210831] CPU: 1 PID: 1187 Comm:
> mysqld Tainted: G           OE     4.19.0-9-686 #1 Debian
> 4.19.118-2+deb10u1
> Aug 16 01:21:55 dss kernel: [ 1357.210835] Hardware name: EKF
> Elektronik GmbH PC2-LIMBO/PC2-LIMBO, BIOS 094 2017-02-01
> Aug 16 01:21:55 dss kernel: [ 1357.210844] EIP: dev_watchdog+0x1b6/0x1c0
> Aug 16 01:21:55 dss kernel: [ 1357.210853] Code: 8b 50 3c 89 f8 e8 ca
> cd 10 00 8b 7e f0 eb a3 89 f8 c6 05 eb 4e 90 d7 01 e8 b7 dc fc ff 53
> 50 57 68 44 f7 82 d7 e8 4e ee ae ff <0f> 0b 83 c4 10 eb c9 8d 76 00 3e
> 8d 74 26 00 55 89 e5 57 56 89 d6
> Aug 16 01:21:55 dss kernel: [ 1357.210859] EAX: 0000003b EBX: 00000000
> ECX: f473ccac EDX: 00000007
> Aug 16 01:21:55 dss kernel: [ 1357.210864] ESI: f41fc2e8 EDI: f41fc000
> EBP: f417df68 ESP: f417df40
> Aug 16 01:21:55 dss kernel: [ 1357.210871] DS: 007b ES: 007b FS: 00d8
> GS: 00e0 SS: 0068 EFLAGS: 00010292
> Aug 16 01:21:55 dss kernel: [ 1357.210876] CR0: 80050033 CR2: b78e1010
> CR3: 1bbd7000 CR4: 000006d0
> Aug 16 01:21:55 dss kernel: [ 1357.210880] Call Trace:
> Aug 16 01:21:55 dss kernel: [ 1357.210887]  <SOFTIRQ>
> Aug 16 01:21:55 dss kernel: [ 1357.210903]  ? pfifo_fast_enqueue+0xf0/0xf0
> Aug 16 01:21:55 dss kernel: [ 1357.210913]  call_timer_fn+0x2f/0x130
> Aug 16 01:21:55 dss kernel: [ 1357.210921]  ? pfifo_fast_enqueue+0xf0/0xf0
> Aug 16 01:21:55 dss kernel: [ 1357.210930]  run_timer_softirq+0x1bd/0x3f0
> Aug 16 01:21:55 dss kernel: [ 1357.210944]  __do_softirq+0xb2/0x275
> Aug 16 01:21:55 dss kernel: [ 1357.210955]  ? __softirqentry_text_start+0x8/0x8
> Aug 16 01:21:55 dss kernel: [ 1357.210964]  call_on_stack+0x12/0x50
> Aug 16 01:21:55 dss kernel: [ 1357.210969]  </SOFTIRQ>
> Aug 16 01:21:55 dss kernel: [ 1357.210977]  ? irq_exit+0xc5/0xd0
> Aug 16 01:21:55 dss kernel: [ 1357.210986]  ?
> smp_apic_timer_interrupt+0x6c/0x130
> Aug 16 01:21:55 dss kernel: [ 1357.210996]  ? apic_timer_interrupt+0xd5/0xdc
> Aug 16 01:21:55 dss kernel: [ 1357.211007]  ? nmi+0x8b/0x198
> ----
>
> It looks eerily similar to the issue reported on this mailinglist 8 years ago:
> https://www.spinics.net/lists/netdev/msg198234.html
>
> where locking was tweaked to compensate.
>
> When I compare the different kernels (4.19.132, 5.8.7), the code base
> has changed little in
> the driver, the locking was changed a bit (wrt patch where it was
> confirmed to be a fix):
>
> 1. netif_tx_lock is used instead of
> spin_lock(&tx_ring->tx_lock);
>
> 2. locking has been removed in  pch_gbe_xmit_frame
>
> Is this again an issue with missing locks?
>
> Since it has been quite some time since I did some kernel work, I
> thought it better to
> check first.
>
>
> --
> g. Marc



-- 
g. Marc