netdev - RE: [PATCH v1 net] lan743x: fix return value for lan743x_tx_napi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <90A7E81AE28BAE4CBDDB3B35F187D26440752710@CHN-SV-EXMX02.mchp-main.com>
Date:   Tue, 20 Nov 2018 21:39:44 +0000
From:   <Bryan.Whitehead@...rochip.com>
To:     <andrew@...n.ch>
CC:     <davem@...emloft.net>, <netdev@...r.kernel.org>,
        <UNGLinuxDriver@...rochip.com>
Subject: RE: [PATCH v1 net] lan743x: fix return value for
 lan743x_tx_napi_poll

> -----Original Message-----
> From: Andrew Lunn <andrew@...n.ch>
> Sent: Tuesday, November 20, 2018 2:31 PM
> To: Bryan Whitehead - C21958 <Bryan.Whitehead@...rochip.com>
> Cc: davem@...emloft.net; netdev@...r.kernel.org; UNGLinuxDriver
> <UNGLinuxDriver@...rochip.com>
> Subject: Re: [PATCH v1 net] lan743x: fix return value for
> lan743x_tx_napi_poll
> 
> On Tue, Nov 20, 2018 at 01:26:43PM -0500, Bryan Whitehead wrote:
> > It has been noticed that under stress the lan743x driver will
> > sometimes hang or cause a kernel panic. It has been noticed that
> > returning '0' instead of 'weight' fixes this issue.
> >
> > fixes: rare kernel panic under heavy traffic load.
> > Signed-off-by: Bryan Whitehead <Bryan.Whitehead@...rochip.com>
> 
> Hi Bryan
> 
> This sounds like a band aid over something which is broken, not a real fix.
> 
> Can you show us the stack trace from the panic?
> 
>     Andrew

Andrew,

Admittedly, my knowledge of what the kernel is doing behind the scenes is limited.

But according to documentation found on 
https://wiki.linuxfoundation.org/networking/napi

It states the following
"The poll() function may also process TX completions, in which case if it processes
the entire TX ring then it should count that work as the rest of the budget.
Otherwise, TX completions are not counted."

So based on that, the original driver was returning the full budget. But I was having
Issues with it. And the above documentation seems to suggest that I could return 0
As in "not counted" from above.

I tried it, and my lock up issues disappeared.

Regarding the kernel panic stack trace. So far its very hard to replicate that on the 
latest kernel. I've seen it more frequently when back porting to older kernels such
as 4.14, and 4.9. This same fix caused those kernel panics to disappear.
Are you interested in seeing a stack dump from older kernels?

In the latest kernel the issue manifests as a kernel message which states
"[  945.021101] enp48s0: Budget exhausted after napi rescheduled"

I'm not sure what that means. But it does not lock up immediately after seeing that
Message. But it usually locks up with in a minute of seeing that message.

And the sometimes I get the following warning
[ 1240.425020] ------------[ cut here ]------------
[ 1240.426014] NETDEV WATCHDOG: enp0s25 (e1000e): transmit queue 0 timed out
[ 1240.430027] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x1ef/0x200
[ 1240.430027] Modules linked in: lan743x
[ 1240.430027] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G          I       4.19.2 #1
[ 1240.430027] Hardware name: Hewlett-Packard HP Compaq dc7900 Convertible Minitower/3032h, BIOS 786G1 v01.16 03/05/2009
[ 1240.430027] RIP: 0010:dev_watchdog+0x1ef/0x200
[ 1240.430027] Code: 00 48 63 4d e0 eb 93 4c 89 e7 c6 05 68 30 b3 00 01 e8 25 3d fd ff 89 d9 48 89 c2 4c 89 e6 48 c7 c7 98 92 48 ab e8 f1 28 87 ff <0f> 0b eb c0 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 c7 47 08 00
[ 1240.430027] RSP: 0018:ffff98490be03e90 EFLAGS: 00010282
[ 1240.430027] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 1240.497168] RDX: 0000000000040400 RSI: 00000000000000f6 RDI: 0000000000000300
[ 1240.497168] RBP: ffff984908574440 R08: 0000000000000000 R09: 00000000000003a4
[ 1240.497168] R10: 0000000000000020 R11: ffffffffabc928ed R12: ffff984908574000
[ 1240.497168] R13: 0000000000000000 R14: 0000000000000000 R15: ffff98490be195b0
[ 1240.497168] FS:  0000000000000000(0000) GS:ffff98490be00000(0000) knlGS:0000000000000000
[ 1240.497168] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1240.497168] CR2: 00007f31cd4c0000 CR3: 0000000109bca000 CR4: 00000000000406f0
[ 1240.497168] Call Trace:
[ 1240.497168]  <IRQ>
[ 1240.497168]  ? qdisc_reset+0xe0/0xe0
[ 1240.497168]  call_timer_fn+0x26/0x130
[ 1240.497168]  run_timer_softirq+0x1cd/0x400
[ 1240.497168]  ? hpet_interrupt_handler+0x10/0x30
[ 1240.497168]  __do_softirq+0xed/0x2aa
[ 1240.497168]  irq_exit+0xb7/0xc0
[ 1240.497168]  do_IRQ+0x45/0xd0
[ 1240.497168]  common_interrupt+0xf/0xf
[ 1240.497168]  </IRQ>
[ 1240.497168] RIP: 0010:cpuidle_enter_state+0xa6/0x330
[ 1240.497168] Code: 65 8b 3d 1d b0 4d 55 e8 58 6a 95 ff 48 89 c3 66 66 66 66 90 31 ff e8 59 73 95 ff 80 7c 24 0b 00 0f 85 25 02 00 00 fb 4c 29 eb <48> ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7 ea b8 ff
[ 1240.497168] RSP: 0018:ffffffffab603e60 EFLAGS: 00000216 ORIG_RAX: ffffffffffffffde
[ 1240.497168] RAX: ffff98490be20a80 RBX: 000000000081035c RCX: 00000120cf178c49
[ 1240.497168] RDX: 00000120cf178ca0 RSI: 00000120cf178ca0 RDI: 0000000000000000
[ 1240.497168] RBP: ffff984908fbd000 R08: fffffffb58ea5f9e R09: 000001208e0b48df
[ 1240.497168] R10: 00000000000018c4 R11: 0000000000002468 R12: 0000000000000002
[ 1240.497168] R13: 00000120ce968944 R14: ffffffffab6a68a0 R15: ffffffffab611740
[ 1240.497168]  do_idle+0x1da/0x230
[ 1240.497168]  cpu_startup_entry+0x6a/0x70
[ 1240.497168]  start_kernel+0x4a2/0x4c2
[ 1240.497168]  secondary_startup_64+0xa4/0xb0
[ 1240.497168] ---[ end trace c6f3be34c214db4e ]---

Notice the warning is referring to a different adapter. So I suspect that whatever happened it froze
All network adapters.

If you have suggestions let me know.

Or if you would like to see the kernel panics from older kernels let me know.

Regards,
Bryan