netdev - Re: r8169 driver from kernel 5.0 crashing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8f910b1339a741cdc780f3948c11a082a8a51b9e.camel@linux.intel.com>
Date:   Fri, 15 Mar 2019 15:09:08 -0700
From:   Alexander Duyck <alexander.h.duyck@...ux.intel.com>
To:     Heiner Kallweit <hkallweit1@...il.com>,
        VDR User <user.vdr@...il.com>
Cc:     netdev@...r.kernel.org
Subject: Re: r8169 driver from kernel 5.0 crashing - napi_consume_skb

On Fri, 2019-03-15 at 21:46 +0100, Heiner Kallweit wrote:
> On 15.03.2019 21:40, Alexander Duyck wrote:
> > On Fri, 2019-03-15 at 21:26 +0100, Heiner Kallweit wrote:
> > > On 15.03.2019 21:09, VDR User wrote:
> > > > > > > > Thanks for the additional info and for testing 4.20.15.
> > > > > > > > To rule out that the issue is caused by a regression in network or
> > > > > > > > some other subsystem: Can you take the r8169.c from 4.20.15 and test
> > > > > > > > it on top of 5.0?
> > > > > > > > Meanwhile I'll look at the changes in the driver between 4.20 and 5.0.
> > > > > > > 
> > > > > > > Sure, no problem! I'll copy the driver & recompile now actually.
> > > > > > > Hopefully there aren't a ton of changes to r8169.c to sift through and
> > > > > > > the cause isn't good at hiding itself!
> > > > > > > 
> > > > > > I checked the driver changes new in 5.0 and there are very few
> > > > > > functional changes. You could try to revert the following:
> > > > > > 
> > > > > > 5317d5c6d47e ("r8169: use napi_consume_skb where possible")
> > > > > 
> > > > > Will do, and fwiw, while I haven't been able to do tons of testing
> > > > > today, I haven't been able to trigger the crash after replacing
> > > > > 5.0.0's r8169.c with 4.20.15's r8169.c this morning. I'll restore the
> > > > > file and revert the change you mentioned, and report back my findings.
> > > > 
> > > > Heiner,
> > > > 
> > > > After going back to vanilla kernel 5.0 and then reverting 5317d5c6d47e
> > > > ("r8169: use napi_consume_skb where possible"), I so far have not had
> > > > any crashes after transferring roughly 30GB back & forth. I'm not
> > > > completely confident yet the crash is resolve with that revert and
> > > > will continue to do further testing throughout the weekend as well.
> > > > What confidence level do you have that 5317d5c6d47e is the culprit at
> > > > this point?
> > > > 
> > > Good, thanks for testing. I simply see no other change since 4.20 that
> > > could cause these symptoms.
> > > Using napi_consume_skb() at this place in r8169.c looks safe to me.
> > > Option 1 is that I miss something, option 2 is that there's an issue
> > > in the NAPI subsystem. However in the latter case I assume at least
> > > the Mellanox and/or Intel guys would have observed the same issue
> > > on their respective CI systems.
> > > Let me add Alexander, maybe he can provide a hint before we go and
> > > revert the change.
> > 
> > Do you have the crash log? I'd be curious what the issue is we are
> > seeing.
> > 
> > I agree I can't see anything obvious, but it is possible that we may be
> > running into something we hadn't seen with the Intel and Mellanox
> > parts.
> > 
> > - Alex
> > 
> > 
> Crash may have been the wrong word, network becomes unresponsive
> and the user sees the generic transmit queue timeout.
> Let me quote his original mail:
> 
> ------------------------------------------------------------------------
> 
> Hi, after updating to kernel 5.0, the nic driver (r8169) has been
> crashing whenever I start using heavy traffic on it (for example,
> xferring large files to the box across my lan). The destination
> harddrive may be sleeping and need to spin-up, or not, but the box
> itself does not suspend/hibernate. The nic becomes completely
> unresponsive and all connections to the box drop. After what I think
> is several minutes, the connection comes back to life. The problem
> happens consistently but seemingly not consistently at the same point.
> For example, I can xfer a few 4gb files and it will crash at around
> 2-3gb on the first file. The next time it might not crash until 2-3gb
> on the second file.Prior to kernel 5.0 I was using 4.19.12 and this
> problem didn't occur. I have since downgraded back to 4.19.12 pending
> what response this post gets.
> 
> Thanks for any help or assistance on how to proceed!
> -Derek
> 
> NOTE: I'm not subscribed to this mailing list so please CC me on any replies!
> 
> The nic is on-board ECS A75F-A:
> 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
>         Subsystem: Elitegroup Computer Systems RTL8111/8168/8411 PCI
> Express Gigabit Ethernet Controller
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0, Cache Line Size: 64 bytes
>         Interrupt: pin A routed to IRQ 17
>         Region 0: I/O ports at c000 [size=256]
>         Region 2: Memory at d0004000 (64-bit, prefetchable) [size=4K]
>         Region 4: Memory at d0000000 (64-bit, prefetchable) [size=16K]
>         Capabilities: <access denied>
>         Kernel driver in use: r8169
> 
> I see the following in the syslog:
> 
> [164572.785517] ------------[ cut here ]------------
> [164572.785595] NETDEV WATCHDOG: enp4s0 (r8169): transmit queue 0 timed out
> [164572.785677] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:461
> dev_watchdog+0x1bb/0x1e0
> [164572.785730] Modules linked in: snd_hda_codec_realtek
> snd_hda_codec_generic snd_hda_codec_hdmi ohci_pci snd_hda_intel
> snd_hda_codec snd_hwdep xhci_pci ehci_pci ohci_hcd xhci_hcd ehci_hcd
> usbcore usb_common snd_hda_core snd_pcm snd_timer snd soundcore nfsd
> auth_rpcgss oid_registry lockd grace sunrpc ip_tables x_tables ipv6
> [164572.785807] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.0.0-amd #2
> [164572.785842] Hardware name: ECS A75F-A/A75F-A, BIOS 4.6.5 09/14/2011
> [164572.785878] EIP: dev_watchdog+0x1bb/0x1e0
> [164572.785912] Code: 8b 50 3c 89 f8 e8 3d aa 0a 00 8b 7e f4 eb a4 89
> f8 c6 05 e7 1c 6d c1 01 e8 72 4f fd ff 53 50 57 68 78 05 66 c1 e8 25
> ad ba ff <0f> 0b 83 c4 10 eb c9 eb 1c 8d b4 26 00 00 00 00 8d b4 26 00
> 00 00
> [164572.785963] EAX: 0000003b EBX: 00000000 ECX: 00000800 EDX: 00000103
> [164572.785998] ESI: f4cbc264 EDI: f4cbc000 EBP: f4c99f74 ESP: f4c99f4c
> [164572.786033] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210296
> [164572.786067] CR0: 80050033 CR2: 020dc000 CR3: 2b80b000 CR4: 00000690
> [164572.786102] Call Trace:
> [164572.786135]  <SOFTIRQ>
> [164572.786168]  ? qdisc_put_unlocked+0x40/0x40
> [164572.786203]  call_timer_fn+0x19/0xa0
> [164572.786237]  run_timer_softirq+0x337/0x380
> [164572.786270]  ? qdisc_put_unlocked+0x40/0x40
> [164572.786302]  ? rcu_process_callbacks+0xcb/0x380
> [164572.786337]  __do_softirq+0xd6/0x21c
> [164572.786370]  ? __irqentry_text_end+0x18/0x18
> [164572.786404]  call_on_stack+0x10/0x60
> [164572.786435]  </SOFTIRQ>
> [164572.786467]  ? irq_exit+0x91/0xc0
> [164572.786495]  ? smp_apic_timer_interrupt+0x56/0xa0
> [164572.786525]  ? apic_timer_interrupt+0xd5/0xdc
> [164572.786555]  ? acpi_idle_enter_s2idle+0x60/0x60
> [164572.786584]  ? cpuidle_enter_state+0x122/0x360
> [164572.786614]  ? tick_nohz_idle_stop_tick+0x27b/0x2a0
> [164572.786644]  ? cpuidle_enter+0xf/0x20
> [164572.786673]  ? call_cpuidle+0x1c/0x40
> [164572.786701]  ? do_idle+0x1e6/0x220
> [164572.786730]  ? cpu_startup_entry+0x25/0x40
> [164572.786758]  ? start_secondary+0x1a5/0x220
> [164572.786787]  ? startup_32_smp+0x15f/0x164
> [164572.786816] ---[ end trace 662a5195537dbad8 ]---
> 

Okay, the patch suggested doesn't make much sense for the issue
described. I was expecting to see something that at least pointed to
the NAPI logic, and I don't see anything doing as such.

The issue looks like a problem of the ring being stopped at the QDISC
layer for an extended period of time. Basically somebody called
netif_stop_queue and then never got around to calling netif_wake_queue
to let the ring resume transmitting.

My first thoughts would be to look at an interrupt that are getting
masked and lost, or some ring accounting issue where you are flagging
the ring to stop and then not waking it up due to a math or logic error
somewhere. It looks like there were multiple patches messing with these
sort of items between 4.20 and 5.0 so I would recommend looking there.

- Alex