netdev - Re: probably serious conntrack/netfilter panic, 4.8.14, timers and intel turbo

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0abc1c5d671644e9efc63ad8a1171f9a@nuclearcat.com>
Date:   Wed, 11 Jan 2017 19:31:28 +0200
From:   Denys Fedoryshchenko <nuclearcat@...learcat.com>
To:     Guillaume Nault <g.nault@...halink.fr>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>,
        linux-kernel@...r.kernel.org,
        Pablo Neira Ayuso <pablo@...filter.org>,
        netfilter-devel@...r.kernel.org
Subject: Re: probably serious conntrack/netfilter panic, 4.8.14, timers and
 intel turbo

On 2017-01-11 19:22, Guillaume Nault wrote:
> Cc: netfilter-devel@...r.kernel.org, I'm afraid I'll need some help
> for this case.
> 
> On Sat, Dec 17, 2016 at 09:48:13PM +0200, Denys Fedoryshchenko wrote:
>> Hi,
>> 
>> I posted recently several netfilter related crashes, didn't got any 
>> answers,
>> one of them started to happen quite often on loaded NAT (17Gbps),
>> so after trying endless ways to make it stable, i found out that in
>> backtrace i can often see timers, and this bug probably appearing on 
>> older
>> releases,
>> i've seen such backtrace with timer fired for conntrack on them.
>> I disabled Intel turbo for cpus on this loaded NAT, and voila, panic
>> disappeared for 2nd day!
>> * by wrmsr -a 0x1a0 0x4000850089
>> I am not sure timers is the reason, but probably turbo creating some
>> condition for bug.
>> 
> 
> Re-formatting the stack-trace for easier reference:
> 
> [28904.162607] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000008
> [28904.163210] IP: [<ffffffffa00ab07d>]
> nf_ct_add_to_dying_list+0x55/0x61 [nf_conntrack]
> [28904.163745] PGD 0
> [28904.164058] Oops: 0002 [#1] SMP
> [28904.164323] Modules linked in: nf_nat_pptp nf_nat_proto_gre
> xt_TCPMSS xt_connmark ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_nat
> xt_rateest xt_RATEEST nf_conntrack_pptp nf_conntrack_proto_gre xt_CT
> xt_set xt_hl xt_tcpudp ip_set_hash_net ip_set nfnetlink iptable_raw
> iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
> nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables
> netconsole configfs 8021q garp mrp stp llc bonding ixgbe dca
> [28904.168132] CPU: 27 PID: 0 Comm: swapper/27 Not tainted 
> 4.8.14-build-0124 #2
> [28904.168398] Hardware name: Intel Corporation S2600WTT/S2600WTT,
> BIOS SE5C610.86B.01.01.1008.031920151331 03/19/2015
> [28904.168853] task: ffff885fa42e8c40 task.stack: ffff885fa42f0000
> [28904.169114] RIP: 0010:[<ffffffffa00ab07d>] [<ffffffffa00ab07d>]
> nf_ct_add_to_dying_list+0x55/0x61 [nf_conntrack]
> [28904.169643] RSP: 0018:ffff885fbccc3dd8 EFLAGS: 00010246
> [28904.169901] RAX: 0000000000000000 RBX: ffff885fbccc0000 RCX: 
> ffff885fbccc0010
> [28904.170169] RDX: ffff885f87a1c150 RSI: 0000000000000142 RDI: 
> ffff885fbccc0000
> [28904.170437] RBP: ffff885fbccc3de8 R08: 00000000cbdee177 R09: 
> 0000000000000100
> [28904.170704] R10: ffff885fbccc3dd0 R11: ffffffff820050c0 R12: 
> ffff885f87a1c140
> [28904.170971] R13: 000000000005d948 R14: 00000000000ea942 R15: 
> ffff885f87a1c160
> [28904.171237] FS: 0000000000000000(0000) GS:ffff885fbccc0000(0000)
> knlGS:0000000000000000
> [28904.171688] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [28904.171964] CR2: 0000000000000008 CR3: 000000607f006000 CR4: 
> 00000000001406e0
> [28904.172231] Stack:
> [28904.172482] ffff885f87a1c140 ffffffff820a1405 ffff885fbccc3e28
> ffffffffa00abb30
> [28904.173182] 00000002820a1405 ffff885f87a1c140 ffff885f99a28201
> 0000000000000000
> [28904.173884] 0000000000000000 ffffffff820050c8 ffff885fbccc3e58
> ffffffffa00abc62
> [28904.174585] Call Trace:
> [28904.174835] <IRQ>
> [28904.174912] [<ffffffffa00abb30>] nf_ct_delete_from_lists+0xc9/0xf2
> [nf_conntrack]
> [28904.175613] [<ffffffffa00abc62>] nf_ct_delete+0x109/0x12c 
> [nf_conntrack]
> [28904.175894] [<ffffffffa00abc85>] ? nf_ct_delete+0x12c/0x12c 
> [nf_conntrack]
> [28904.176169] [<ffffffffa00abc92>] death_by_timeout+0xd/0xf 
> [nf_conntrack]
> [28904.176443] [<ffffffff81109922>] call_timer_fn.isra.5+0x17/0x6b
> [28904.176714] [<ffffffff811099e5>] expire_timers+0x6f/0x7e
> [28904.176975] [<ffffffff81109add>] run_timer_softirq+0x69/0x8b
> [28904.177238] [<ffffffff811141bb>] ? 
> clockevents_program_event+0xd0/0xe8
> [28904.177504] [<ffffffff810d000c>] __do_softirq+0xbd/0x1aa
> [28904.177765] [<ffffffff810d0240>] irq_exit+0x37/0x7c
> [28904.178026] [<ffffffff8102c519>] 
> smp_trace_apic_timer_interrupt+0x7b/0x88
> [28904.178300] [<ffffffff8102c52f>] smp_apic_timer_interrupt+0x9/0xb
> [28904.178565] [<ffffffff818f4f1c>] apic_timer_interrupt+0x7c/0x90
> [28904.178835] <EOI>
> [28904.178907] [<ffffffff8101b929>] ? mwait_idle+0x64/0x7a
> [28904.179436] [<ffffffff810e3d8f>] ? 
> atomic_notifier_call_chain+0x13/0x15
> [28904.179712] [<ffffffff8101bd04>] arch_cpu_idle+0xa/0xc
> [28904.179976] [<ffffffff810f7c3d>] default_idle_call+0x27/0x29
> [28904.180244] [<ffffffff810f7d5c>] cpu_startup_entry+0x11d/0x1c7
> [28904.180508] [<ffffffff8102af13>] start_secondary+0xe8/0xeb
> [28904.180767] Code: 80 2f 0b 82 48 89 df e8 da 90 84 e1 48 8b 43 10
> 49 8d 54 24 10 48 8d 4b 10 49 89 4c 24 18 a8 01 49 89 44 24 10 48 89
> 53 10 75 04 <89> 50 08 c6 03 00 5b 41 5c 5d c3 48 8b 05 10 be 00 00 89
> f6
> [28904.185546] RIP [<ffffffffa00ab07d>]
> nf_ct_add_to_dying_list+0x55/0x61 [nf_conntrack]
> [28904.186065] RSP <ffff885fbccc3dd8>
> [28904.186319] CR2: 0000000000000008
> [28904.186593] ---[ end trace 35cbc6c885a5c2d8 ]---
> [28904.186860] Kernel panic - not syncing: Fatal exception in interrupt
> [28904.187155] Kernel Offset: disabled
> [28904.187419] Rebooting in 5 seconds..
> [28909.193662] ACPI MEMORY or I/O RESET_REG.
> 
> And here's decodecode's output:
> 
> All code
> ========
>    0:   80 2f 0b                subb   $0xb,(%rdi)
>    3:   82                      (bad)
>    4:   48 89 df                mov    %rbx,%rdi
>    7:   e8 da 90 84 e1          callq  0xffffffffe18490e6
>    c:   48 8b 43 10             mov    0x10(%rbx),%rax
>   10:   49 8d 54 24 10          lea    0x10(%r12),%rdx
>   15:   48 8d 4b 10             lea    0x10(%rbx),%rcx
>   19:   49 89 4c 24 18          mov    %rcx,0x18(%r12)
>   1e:   a8 01                   test   $0x1,%al
>   20:   49 89 44 24 10          mov    %rax,0x10(%r12)
>   25:   48 89 53 10             mov    %rdx,0x10(%rbx)
>   29:   75 04                   jne    0x2f
>   2b:*  89 50 08                mov    %edx,0x8(%rax)           <--
> trapping instruction
>   2e:   c6 03 00                movb   $0x0,(%rbx)
>   31:   5b                      pop    %rbx
>   32:   41 5c                   pop    %r12
>   34:   5d                      pop    %rbp
>   35:   c3                      retq
>   36:   48 8b 05 10 be 00 00    mov    0xbe10(%rip),%rax        # 
> 0xbe4d
>   3d:   89 f6                   mov    %esi,%esi
> 
> 
> Fortunately, nf_ct_add_to_dying_list() is small:
> 
> static void nf_ct_add_to_dying_list(struct nf_conn *ct)
> {
> 	struct ct_pcpu *pcpu;
> 
> 	/* add this conntrack to the (per cpu) dying list */
> 	ct->cpu = smp_processor_id();
> 	pcpu = per_cpu_ptr(nf_ct_net(ct)->ct.pcpu_lists, ct->cpu);
> 
> 	spin_lock(&pcpu->lock);
> 	hlist_nulls_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
> 			     &pcpu->dying);
> 	spin_unlock(&pcpu->lock);
> }
> 
> 
> So "callq  0xffffffffe18490e6" probably refers to the
> "spin_lock(&pcpu->lock)" call, which means RBX == pcpu.
> Considering the offsets used, we certainly have R12 == ct.
> 
> From there, we can interpret the assembly code as follow:
>   Original register values:
>     rbx:        pcpu
>     0x10(%rbx): &pcpu->dying.first
> 
>     r12:        ct
>     0x10(%r12): &ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.next
>     0x18(%r12): &ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.pprev
> 
>   Code:
>     mov    %rbx,%rdi          ; RBX == pcpu
>     callq  0xffffffffe18490e6 ; spin_lock(&pcpu->lock)
> 
>     mov    0x10(%rbx),%rax    ; rax = pcpu->dying.first
>     lea    0x10(%r12),%rdx    ; rdx =
> &ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.next
>     lea    0x10(%rbx),%rcx    ; rcx = &pcpu->dying.first
> 
>     mov    %rcx,0x18(%r12)    ;
> ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.pprev = &pcpu->dying.first
> 
>     test   $0x1,%al           ; test: is_a_nulls(pcpu->dying.first)
> 
>     mov    %rax,0x10(%r12)    ;
> &ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.next = pcpu->dying.first
>     mov    %rdx,0x10(%rbx)    ; pcpu->dying.first =
> &ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode
> 
>     jne    0x2f               ; jump if is_a_nulls(pcpu->dying.first)
> 
>     mov    %edx,0x8(%rax)     <-- trapping instruction, RAX == 0
> 
> 
> So RAX, which has been set to pcpu->dying.first, is NULL. The
> pcpu->dying list is thus corrupted.
> 
> I'm not familiar with nf_conntrack_core.c and I still can't figure out
> how we could end up with pcpu->dying.first being a NULL pointer.
> 
> Probably there's a race somewhere and activating turbo mode makes it
> more likely to happen. But still, I'd expect a locking or RCU issue to
> give us garbage here, not a plain NULL pointer.
> 
> Does anyone have a clue?

Maybe it will give any hints, and why not that many people facing
This server is NUMA (2xE5 cpu), with dual ixgbe cards (2x10G), with rps 
enabled
I tried to change timers from TSC to HPET, didnt helped at all.
I noticed similar (but not same place) error happen on this load on 
older kernel as well, latest stable 4.4 on that moment.
If i will gain a bit stability for few days, i will try to turn off one 
of CPU to make it single NUMA node, for testing, if NUMA affects anyhow, 
but it might just increase chance of race condition as turbo.