[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fef628e51694ab7bae2098b762328830@nuclearcat.com>
Date: Wed, 21 Feb 2018 21:30:49 +0200
From: Denys Fedoryshchenko <nuclearcat@...learcat.com>
To: Guillaume Nault <g.nault@...halink.fr>
Cc: Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: ppp/pppoe, still panic 4.15.3 in ppp_push
On 2018-02-21 20:55, Guillaume Nault wrote:
> On Wed, Feb 21, 2018 at 12:26:51PM +0200, Denys Fedoryshchenko wrote:
>> It seems even rebuilding seemingly stable version triggering crashes
>> too
>> (but different ones)
> Different ones? The trace following your message looks very similar to
> your first KASAN report. Or are you refering to the lockup you posted
> on Sun, 18 Feb 2018?
>
> Also, which stable versions are you refering to?
Trace i sent in previous email - is latest kernel, vanilla, just more
debug options and few options disabled.
One of disabled was spitting some errors (it is obviously bug),
CONFIG_XFRM, in nf_xfrm_me_harder (i reported about it).
And i disabled namespaces, as they are often source of trouble.
Today i will try to revert just:
drivers, net, ppp: convert asyncppp.refcnt from atomic_t to refcount_t
drivers, net, ppp: convert syncppp.refcnt from atomic_t to refcount_t
drivers, net, ppp: convert ppp_file.refcnt from atomic_t to refcount_t
Because i suspect previously, after reverting this patches i got
different kernel
panic (and i didn't noticed that, now too late to identify between other
crashes),
seems it was not KASAN.
I will report results after testing, unfortunately i can't test it more
than once per day.
"Stable" for me was 4.14.2 - but it looks like on that kernel i am
getting different issue now.
I will paste it below.
Another observation, just hour ago, i noticed on another server, where i
am testing 4.15, and 4.14.20
(at moment of testing 4.14.20, but no debug at that moment), when i
killed accel-pppd (pppoe server software),
with 8k sessions online, i got some weird behaviour, accel-pppd process
got stuck, same as
ifconfig and "ip link", and even kexec -e didn't worked(got stuck too),
unless i did kexec -e -x
(so it wont try to make interfaces down on kexec).
I will try to reproduce this bug as well, with debug enabled (lockdep
and so) i hope it is not related.
>
> I'm interested in the ppp_generic.o file that produced the following
> trace. Just to be sure that the differences come from the new debugging
> options.
Also kernel config:
https://nuclearcat.com/bughunting/config.txt
https://nuclearcat.com/bughunting/ppp_generic.o
This is in 4.14.2, was seemingly stable before:
[50401.388670] NETDEV WATCHDOG: eth1 (ixgbe): transmit queue 1 timed out
[50401.389014] ------------[ cut here ]------------
[50401.389340] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:320
dev_watchdog+0x15c/0x1b9
[50401.389925] Modules linked in: pppoe pppox ppp_generic slhc
netconsole configfs coretemp nf_nat_pptp nf_nat_proto_gre
nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x
t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4
xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net
ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na
t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
ip_tables x_tables 8021q garp mrp stp llc ixgbe dca
[50401.391869] CPU: 3 PID: 0 Comm: swapper/3 Not tainted
4.14.2-build-0134 #4
[50401.392191] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
04/02/2015
[50401.392513] task: ffff880434d72640 task.stack: ffffc90001914000
[50401.392836] RIP: 0010:dev_watchdog+0x15c/0x1b9
[50401.393155] RSP: 0018:ffff8804364c3e90 EFLAGS: 00010286
[50401.393470] RAX: 0000000000000039 RBX: ffff88042f6e0000 RCX:
0000000000000000
[50401.393787] RDX: 0000000000000001 RSI: 0000000000000002 RDI:
ffffffff828dbc64
[50401.394103] RBP: ffff8804364c3eb0 R08: 0000000000000001 R09:
0000000000000000
[50401.394420] R10: 0000000000000002 R11: ffff8803fa075c00 R12:
0000000000000001
[50401.394739] R13: 0000000000000040 R14: 0000000000000003 R15:
ffffffff81e05108
[50401.395064] FS: 0000000000000000(0000) GS:ffff8804364c0000(0000)
knlGS:0000000000000000
[50401.395645] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50401.395970] CR2: 00007fff25fc20a8 CR3: 0000000001e09005 CR4:
00000000001606e0
[50401.396294] Call Trace:
[50401.396613] <IRQ>
[50401.396934] ? qdisc_rcu_free+0x3f/0x3f
[50401.397255] call_timer_fn.isra.4+0x17/0x7b
[50401.397576] expire_timers+0x6f/0x7e
[50401.397899] run_timer_softirq+0x6d/0x8f
[50401.398219] ? ktime_get+0x3b/0x8c
[50401.398540] ? lapic_next_event+0x18/0x1c
[50401.398862] ? clockevents_program_event+0xa3/0xbb
[50401.399186] __do_softirq+0xbc/0x1ab
[50401.399510] irq_exit+0x4d/0x8e
[50401.399832] smp_apic_timer_interrupt+0x73/0x80
[50401.400157] apic_timer_interrupt+0x8d/0xa0
[50401.400480] </IRQ>
[50401.400801] RIP: 0010:mwait_idle+0x4e/0x61
[50401.401123] RSP: 0018:ffffc90001917ec0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff10
[50401.401714] RAX: 0000000000000000 RBX: ffff880434d72640 RCX:
0000000000000000
[50401.402037] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
0000000000000000
[50401.402362] RBP: ffffc90001917ec0 R08: 0000000000000000 R09:
0000000000000001
[50401.402685] R10: ffffc90001917e58 R11: 000000000000037a R12:
0000000000000000
[50401.403008] R13: 0000000000000000 R14: ffff880434d72640 R15:
ffff880434d72640
[50401.403336] arch_cpu_idle+0xa/0xc
[50401.403660] default_idle_call+0x20/0x22
[50401.403983] do_idle+0xb3/0x13b
[50401.404303] cpu_startup_entry+0x1a/0x1c
[50401.404627] start_secondary+0x102/0x105
[50401.404949] secondary_startup_64+0xa5/0xa5
[50401.405271] Code: d9 54 58 00 00 75 38 48 89 df c6 05 cd 54 58 00 01
e8 dc 24 fe ff 44 89 e1 48 89 c2 48 89 de 48 c7 c7 51 df d6 81 e8 db f6
7c ff <0f> ff eb 10 41 ff c4 48 05 40 01 0
0 00 41 39 cc 75 9a eb 0d 48
[50401.406154] ---[ end trace 433f9a49b32a3b4f ]---
[50401.406489] ixgbe 0000:04:00.1 eth1: initiating reset due to tx
timeout
[50401.406828] ixgbe 0000:04:00.1 eth1: Reset adapter
[50416.686709] watchdog: BUG: soft lockup - CPU#5 stuck for 23s!
[tc:24389]
[50416.687042] Modules linked in: pppoe pppox ppp_generic slhc
netconsole configfs coretemp nf_nat_pptp nf_nat_proto_gre
nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x
t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4
xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net
ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na
t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
ip_tables x_tables 8021q garp mrp stp llc ixgbe dca
[50416.687707] watchdog: BUG: soft lockup - CPU#7 stuck for 23s!
[accel-pppd:24377]
[50416.687708] Modules linked in: pppoe pppox ppp_generic slhc
netconsole configfs coretemp nf_nat_pptp nf_nat_proto_gre
nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x
t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4
xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net
ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na
t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
ip_tables x_tables 8021q garp mrp stp llc ixgbe dca
[50416.687720] CPU: 7 PID: 24377 Comm: accel-pppd Tainted: G W
4.14.2-build-0134 #4
[50416.687721] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
04/02/2015
[50416.687722] task: ffff8803f278d940 task.stack: ffffc90011a38000
[50416.687727] RIP: 0010:queued_spin_lock_slowpath+0x125/0x16e
[50416.687728] RSP: 0018:ffffc90011a3bd20 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff10
[50416.687729] RAX: 0000000000000000 RBX: ffff8804035abfb8 RCX:
ffff8804365d9a40
[50416.687730] RDX: ffff880436419a60 RSI: 0000000000200000 RDI:
ffff8804035abfa0
[50416.687730] RBP: ffffc90011a3bd20 R08: 0000000000000001 R09:
00000000ffffffff
[50416.687731] R10: ffffc90011a3bda8 R11: ffff8803f278d940 R12:
ffff88041a5fde00
[50416.687731] R13: 0000000000008021 R14: ffff8803fc5778c0 R15:
ffff8803fc577840
[50416.687732] FS: 00007f7b5726e700(0000) GS:ffff8804365c0000(0000)
knlGS:0000000000000000
[50416.687733] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50416.687733] CR2: 0000000001d5f408 CR3: 000000042bf7c006 CR4:
00000000001606e0
[50416.687734] Call Trace:
[50416.687738] _raw_spin_lock+0x1a/0x1c
[50416.687741] ppp_push+0x56/0x49b [ppp_generic]
[50416.687743] ? __kmalloc_node_track_caller+0xba/0xdf
[50416.687746] __ppp_xmit_process+0x38/0x4c9 [ppp_generic]
[50416.687747] ppp_xmit_process+0x39/0x8d [ppp_generic]
[50416.687749] ppp_write+0xbb/0xc7 [ppp_generic]
[50416.687750] __vfs_write+0x21/0x111
[50416.687752] ? handle_mm_fault+0xb1/0xc0
[50416.687753] vfs_write+0xcd/0x176
[50416.687754] SyS_write+0x49/0x83
[50416.687755] entry_SYSCALL_64_fastpath+0x17/0x98
[50416.687756] RIP: 0033:0x7f7b5ba4e14d
[50416.687756] RSP: 002b:00007f7b5726dbd0 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[50416.687757] RAX: ffffffffffffffda RBX: 00007f7b5244aae8 RCX:
00007f7b5ba4e14d
[50416.687758] RDX: 000000000000000c RSI: 00007f7b4c7e6180 RDI:
0000000000004e9e
[50416.687758] RBP: 00007f7b5726dba0 R08: 0000000000000000 R09:
0000000000633e00
[50416.687758] R10: 0000000000000000 R11: 0000000000000293 R12:
0000000000000000
[50416.687759] R13: 00007f7b5716c4ff R14: 0000000000000003 R15:
00007f7b5716c590
[50416.687759] Code: 89 c2 c1 e8 12 48 c1 ea 0c ff c8 83 e2 30 48 98 48
81 c2 40 9a 01 00 48 03 14 c5 a0 73 d7 81 48 89 0a 8b 41 08 85 c0 75 04
f3 90 <eb> f5 48 8b 11 48 85 d2 74 03 0f 1
8 0a 44 8b 07 66 45 85 c0 0f
[50416.687771] Kernel panic - not syncing: softlockup: hung tasks
[50416.687772] CPU: 7 PID: 24377 Comm: accel-pppd Tainted: G W
L 4.14.2-build-0134 #4
[50416.687773] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
04/02/2015
[50416.687773] Call Trace:
[50416.687774] <IRQ>
[50416.687776] dump_stack+0x4d/0x63
[50416.687778] panic+0xcd/0x215
[50416.687780] watchdog_timer_fn+0x1a4/0x1c6
[50416.687782] __hrtimer_run_queues+0x76/0xc0
[50416.687783] hrtimer_interrupt+0xa0/0x162
[50416.687785] smp_apic_timer_interrupt+0x6e/0x80
[50416.687786] apic_timer_interrupt+0x8d/0xa0
[50416.687786] </IRQ>
[50416.687787] RIP: 0010:queued_spin_lock_slowpath+0x125/0x16e
[50416.687788] RSP: 0018:ffffc90011a3bd20 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff10
[50416.687789] RAX: 0000000000000000 RBX: ffff8804035abfb8 RCX:
ffff8804365d9a40
[50416.687789] RDX: ffff880436419a60 RSI: 0000000000200000 RDI:
ffff8804035abfa0
[50416.687789] RBP: ffffc90011a3bd20 R08: 0000000000000001 R09:
00000000ffffffff
[50416.687790] R10: ffffc90011a3bda8 R11: ffff8803f278d940 R12:
ffff88041a5fde00
[50416.687790] R13: 0000000000008021 R14: ffff8803fc5778c0 R15:
ffff8803fc577840
[50416.687792] _raw_spin_lock+0x1a/0x1c
[50416.687793] ppp_push+0x56/0x49b [ppp_generic]
[50416.687794] ? __kmalloc_node_track_caller+0xba/0xdf
[50416.687795] __ppp_xmit_process+0x38/0x4c9 [ppp_generic]
[50416.687797] ppp_xmit_process+0x39/0x8d [ppp_generic]
[50416.687798] ppp_write+0xbb/0xc7 [ppp_generic]
[50416.687798] __vfs_write+0x21/0x111
[50416.687799] ? handle_mm_fault+0xb1/0xc0
[50416.687800] vfs_write+0xcd/0x176
[50416.687801] SyS_write+0x49/0x83
[50416.687802] entry_SYSCALL_64_fastpath+0x17/0x98
[50416.687802] RIP: 0033:0x7f7b5ba4e14d
[50416.687803] RSP: 002b:00007f7b5726dbd0 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[50416.687803] RAX: ffffffffffffffda RBX: 00007f7b5244aae8 RCX:
00007f7b5ba4e14d
[50416.687804] RDX: 000000000000000c RSI: 00007f7b4c7e6180 RDI:
0000000000004e9e
[50416.687804] RBP: 00007f7b5726dba0 R08: 0000000000000000 R09:
0000000000633e00
[50416.687805] R10: 0000000000000000 R11: 0000000000000293 R12:
0000000000000000
[50416.687805] R13: 00007f7b5716c4ff R14: 0000000000000003 R15:
00007f7b5716c590
[50416.723251] CPU: 5 PID: 24389 Comm: tc Tainted: G W L
4.14.2-build-0134 #4
[50416.723831] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
04/02/2015
[50416.724151] task: ffff8803f0f19980 task.stack: ffffc90011a70000
[50416.724477] RIP: 0010:queued_spin_lock_slowpath+0x54/0x16e
[50416.724799] RSP: 0018:ffffc90011a73940 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff10
[50416.725384] RAX: 0000000000000101 RBX: ffff8803fc577000 RCX:
0000000000000101
[50416.725709] RDX: 0000000000000100 RSI: 0000000000000001 RDI:
ffff8803fc5778c0
[50416.726034] RBP: ffffc90011a73940 R08: 0000000000000001 R09:
0000000000002e88
[50416.726355] R10: 0000000000000024 R11: 0000000000000000 R12:
ffff8803eea1d060
[50416.726678] R13: ffff8803fc5778c0 R14: ffffc90011a739f0 R15:
0000000000000000
[50416.727001] FS: 00007f72228f2700(0000) GS:ffff880436540000(0000)
knlGS:0000000000000000
[50416.727582] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50416.727904] CR2: 0000000001aea038 CR3: 00000003fb09a001 CR4:
00000000001606e0
[50416.728230] Call Trace:
[50416.728550] _raw_spin_lock_bh+0x25/0x27
[50416.728872] ppp_get_stats64+0x4c/0xaf [ppp_generic]
[50416.729195] dev_get_stats+0x33/0x84
[50416.729513] rtnl_fill_stats+0x41/0x121
[50416.729831] rtnl_fill_ifinfo+0x567/0xdf0
[50416.730148] rtnl_dump_ifinfo+0x232/0x2e0
[50416.730470] ? __alloc_pages_nodemask+0xd2/0xa99
[50416.730790] ? handle_mm_fault+0xb1/0xc0
[50416.731109] ? __do_page_fault+0x31c/0x3b6
[50416.731431] netlink_dump+0xfc/0x271
[50416.731749] netlink_recvmsg+0x1af/0x2d9
[50416.732068] sock_recvmsg+0x14/0x16
[50416.732388] ___sys_recvmsg+0xea/0x1a1
[50416.732708] ? lru_deactivate_file_fn+0x29f/0x29f
[50416.733028] ? pagevec_lru_move_fn+0xad/0xc1
[50416.733348] ? alloc_pages_vma+0x152/0x19c
[50416.733669] ? page_add_new_anon_rmap+0xa0/0xa9
[50416.733988] ? __handle_mm_fault+0xda4/0xf40
[50416.734310] ? sock_sendmsg+0x12/0x1d
[50416.734628] __sys_recvmsg+0x3d/0x5e
[50416.734945] ? __sys_recvmsg+0x3d/0x5e
[50416.735264] SyS_recvmsg+0xd/0x17
[50416.735576] entry_SYSCALL_64_fastpath+0x17/0x98
[50416.735890] RIP: 0033:0x7f7221afdac0
[50416.736202] RSP: 002b:00007ffcbbc1f328 EFLAGS: 00000246 ORIG_RAX:
000000000000002f
[50416.736772] RAX: ffffffffffffffda RBX: 00007f7221da8b98 RCX:
00007f7221afdac0
[50416.737092] RDX: 0000000000000000 RSI: 00007ffcbbc1f390 RDI:
0000000000000003
[50416.737411] RBP: 00007f7221da8b98 R08: 0000000000007d7c R09:
0000000000000000
[50416.737734] R10: 00007ffcbbc1f3d0 R11: 0000000000000246 R12:
000000000066a520
[50416.738055] R13: 0000000000000040 R14: 00007f7221da8b40 R15:
0000000000002710
[50416.738377] Code: ff ff 75 33 83 fe 01 89 ca 89 f0 41 0f 45 d0 f0 0f
b1 17 39 c6 74 04 89 c6 eb e1 ff ca 0f 84 20 01 00 00 8b 07 84 c0 74 04
f3 90 <eb> f6 66 c7 07 01 00 e9 0c 01 00 0
0 48 c7 c0 40 9a 01 00 65 48
Feb 21 00:04:31 10.0.252.9 [50417.724420] Shutting down cpus with NMI
Feb 21 00:04:31 10.0.252.9 [50417.724748] Kernel Offset: disabled
Feb 21 00:04:31 10.0.252.9 [50417.728891] Rebooting in 5 seconds..
Powered by blists - more mailing lists