lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 24 Jan 2018 15:59:11 -0800
From:   Ben Greear <greearb@...delatech.com>
To:     David Ahern <dsahern@...il.com>, Michal Kubecek <mkubecek@...e.cz>
Cc:     Cong Wang <xiyou.wangcong@...il.com>,
        Eric Dumazet <eric.dumazet@...il.com>,
        netdev <netdev@...r.kernel.org>
Subject: Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+

On 06/20/2017 08:03 PM, David Ahern wrote:
> On 6/20/17 5:41 PM, Ben Greear wrote:
>> On 06/20/2017 11:05 AM, Michal Kubecek wrote:
>>> On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote:
>>>> On 06/14/2017 03:25 PM, David Ahern wrote:
>>>>> On 6/14/17 4:23 PM, Ben Greear wrote:
>>>>>> On 06/13/2017 07:27 PM, David Ahern wrote:
>>>>>>
>>>>>>> Let's try a targeted debug patch. See attached
>>>>>>
>>>>>> I had to change it to pr_err so it would go to our serial console
>>>>>> since the system locked hard on crash,
>>>>>> and that appears to be enough to change the timing where we can no
>>>>>> longer
>>>>>> reproduce the problem.
>>>>>
>>>>>
>>>>> ok, let's figure out which one is doing that. There are 3 debug
>>>>> statements. I suspect fib6_del_route is the one setting the state to
>>>>> FWS_U. Can you remove the debug prints in fib6_repair_tree and
>>>>> fib6_walk_continue and try again?
>>>>
>>>> We cannot reproduce with just that one printf in the kernel either.  It
>>>> must change the timing too much to trigger the bug.
>>>
>>> You might try trace_printk() which should have less impact (don't forget
>>> to enable /proc/sys/kernel/ftrace_dump_on_oops).
>>
>> We cannot reproduce with trace_printk() either.
>
> I think that suggests the walker state is set to FWS_U in
> fib6_del_route, and it is the FWS_U case in fib6_walk_continue that
> triggers the fault -- the null parent (pn = fn->parent). So we have the
> 2 areas of code that are interacting.
>
> I'm on a road trip through the end of this week with little time to
> focus on this problem. I'll get back to you another suggestion when I can.

So, though I don't know the right way to fix it, the patch below appears
to make the system not crash.


diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 68b9cc7..bf19a14 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1614,6 +1614,12 @@ static int fib6_walk_continue(struct fib6_walker *w)
                         pn = fn->parent;
                         w->node = pn;
  #ifdef CONFIG_IPV6_SUBTREES
+                       if (WARN_ON_ONCE(!pn)) {
+                               pr_err("FWS-U, w: %p  fn: %p  pn: %p\n",
+                                      w, fn, pn);
+                               /* Attempt to work around crash that has been here forever. --Ben */
+                               return 0;
+                       }
                         if (FIB6_SUBTREE(pn) == fn) {
                                 WARN_ON(!(fn->fn_flags & RTN_ROOT));
                                 w->state = FWS_L;



The printout looks like this (when adding 4000 mac-vlans, so it is pretty rare).  PN is definitely NULL sometimes:

[root@...6n ~]# journalctl -f|grep FWS
Jan 24 15:48:05 2u-6n kernel: IPv6: FWS-U, w: ffff8807ea121ba0  fn: ffff880856a09260  pn:           (null)
Jan 24 15:51:15 2u-6n kernel: IPv6: FWS-U, w: ffff8807e3963de0  fn: ffff880856a09260  pn:           (null)
Jan 24 15:51:15 2u-6n kernel: IPv6: FWS-U, w: ffff88081ac22de0  fn: ffff880856a09260  pn:           (null)
Jan 24 15:53:13 2u-6n kernel: IPv6: FWS-U, w: ffff8808290c69c0  fn: ffff8807e369f920  pn:           (null)
Jan 24 15:53:24 2u-6n kernel: IPv6: FWS-U, w: ffff8807ea3156c0  fn: ffff88082d1eeb60  pn:           (null)



8066 Jan 24 15:48:04 2u-6n kernel: 8021q: adding VLAN 0 to HW filter on device eth2#1006
  8067 Jan 24 15:48:05 2u-6n kernel: ------------[ cut here ]------------
  8068 Jan 24 15:48:05 2u-6n kernel: WARNING: CPU: 5 PID: 3346 at /home/greearb/git/linux-4.13.dev.y/net/ipv6/ip6_fib.c:1617 fib6_walk_continue+ 
0x154/0x1b0 [ipv6]
  8069 Jan 24 15:48:05 2u-6n kernel: Modules linked in: 8021q garp mrp stp llc fuse macvlan wanlink(O) pktgen ipmi_ssif coretemp intel_rapl            sb_edac 
x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm ath9k irqbypass iTCO_wdt ath9k_common iTCO_vendor_support ath9k_hw ath              i2c_i801 mac80211 joydev 
lpc_ich cfg80211 ioatdma shpchp tpm_tis tpm_tis_core wmi tpm ipmi_si ipmi_devintf ipmi_msghandler acpi_pad             acpi_power_meter nfsd auth_rpcgss nfs_acl 
sch_fq_codel lockd grace sunrpc ast drm_kms_helper ttm drm igb hwmon ptp pps_core dca                 i2c_algo_bit i2c_core ipv6 crc_ccitt
  8070 Jan 24 15:48:05 2u-6n kernel: CPU: 5 PID: 3346 Comm: ip Tainted: G           O    4.13.16+ #22
  8071 Jan 24 15:48:05 2u-6n kernel: Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017
  8072 Jan 24 15:48:05 2u-6n kernel: task: ffff8807e9ef1dc0 task.stack: ffffc9002083c000
  8073 Jan 24 15:48:05 2u-6n kernel: RIP: 0010:fib6_walk_continue+0x154/0x1b0 [ipv6]
  8074 Jan 24 15:48:05 2u-6n kernel: RSP: 0018:ffffc9002083fbc0 EFLAGS: 00010246
  8075 Jan 24 15:48:05 2u-6n kernel: RAX: 0000000000000000 RBX: ffff8807ea121ba0 RCX: 0000000000000000
  8076 Jan 24 15:48:05 2u-6n kernel: RDX: ffff880856a09260 RSI: ffffc9002083fc00 RDI: ffffffff81ef2140
  8077 Jan 24 15:48:05 2u-6n kernel: RBP: ffffc9002083fbc8 R08: 0000000000000008 R09: ffff8807e36f6b25
  8078 Jan 24 15:48:05 2u-6n kernel: R10: ffffc9002083fb70 R11: 0000000000000000 R12: 0000000000000002
  8079 Jan 24 15:48:05 2u-6n kernel: R13: 0000000000000002 R14: ffff8807ea121ba0 R15: ffff8807ebcc8d80
  8080 Jan 24 15:48:05 2u-6n kernel: FS:  00007f77a5d0f700(0000) GS:ffff88087fd40000(0000) knlGS:0000000000000000
  8081 Jan 24 15:48:05 2u-6n kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  8082 Jan 24 15:48:05 2u-6n kernel: CR2: 0000000003d56c88 CR3: 00000007f3106000 CR4: 00000000003406e0
  8083 Jan 24 15:48:05 2u-6n kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  8084 Jan 24 15:48:05 2u-6n kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  8085 Jan 24 15:48:05 2u-6n kernel: Call Trace:
  8086 Jan 24 15:48:05 2u-6n kernel:  inet6_dump_fib+0x1ab/0x2a0 [ipv6]
  8087 Jan 24 15:48:05 2u-6n kernel:  netlink_dump+0x22c/0x2b0
  8088 Jan 24 15:48:05 2u-6n kernel:  netlink_recvmsg+0x260/0x3f0
  8089 Jan 24 15:48:05 2u-6n kernel:  sock_recvmsg+0x38/0x40
  8090 Jan 24 15:48:05 2u-6n kernel:  ___sys_recvmsg+0xe9/0x230
  8091 Jan 24 15:48:05 2u-6n kernel:  ? alloc_pages_vma+0x83/0x1e0
  8092 Jan 24 15:48:05 2u-6n kernel:  ? page_add_new_anon_rmap+0x88/0xc0
  8093 Jan 24 15:48:05 2u-6n kernel:  ? lru_cache_add_active_or_unevictable+0x31/0xb0
  8094 Jan 24 15:48:05 2u-6n kernel:  ? __handle_mm_fault+0x5e5/0xfa0
  8095 Jan 24 15:48:05 2u-6n kernel:  __sys_recvmsg+0x3d/0x70
  8096 Jan 24 15:48:05 2u-6n kernel:  ? __sys_recvmsg+0x3d/0x70
  8097 Jan 24 15:48:05 2u-6n kernel:  SyS_recvmsg+0xd/0x20
  8098 Jan 24 15:48:05 2u-6n kernel:  do_syscall_64+0x56/0xc0
  8099 Jan 24 15:48:05 2u-6n kernel:  entry_SYSCALL64_slow_path+0x25/0x25
  8100 Jan 24 15:48:05 2u-6n kernel: RIP: 0033:0x7f77a5644030
  8101 Jan 24 15:48:05 2u-6n kernel: RSP: 002b:00007ffc3e783e68 EFLAGS: 00000246 ORIG_RAX: 000000000000002f
  8102 Jan 24 15:48:05 2u-6n kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f77a5644030
  8103 Jan 24 15:48:05 2u-6n kernel: RDX: 0000000000000000 RSI: 00007ffc3e783ed0 RDI: 0000000000000004
  8104 Jan 24 15:48:05 2u-6n kernel: RBP: 00007ffc3e787ef4 R08: 0000000000003fe4 R09: 0
8105 Jan 24 15:48:05 2u-6n kernel: R10: 00007ffc3e783f10 R11: 0000000000000246 R12: 000000000064f360
  8106 Jan 24 15:48:05 2u-6n kernel: R13: 00007ffc3e787f60 R14: 0000000000003fe4 R15: 0000000000000000
  8107 Jan 24 15:48:05 2u-6n kernel: Code: ff 24 c5 a8 e5 04 a0 f6 42 2a 02 74 68 c7 43 28 01 00 00 00 48 89 c2 e9 c7 fe ff ff c7 43 28 02 00 00       00 48 89 
c2 e9 b8 fe ff ff <0f> ff 31 c9 48 89 de 48 c7 c7 78 36 05 a0 e8 65 e4 14 e1 31 c0
  8108 Jan 24 15:48:05 2u-6n kernel: ---[ end trace 1d1c7028c9dec459 ]---
  8109 Jan 24 15:48:05 2u-6n kernel: IPv6: FWS-U, w: ffff8807ea121ba0  fn: ffff880856a09260  pn:           (null)
  8110 Jan 24 15:48:05 2u-6n kernel: 8021q: adding VLAN 0 to HW filter on device eth2#1008
  8111 Jan 24 15:48:05 2u-6n kernel: 8021q: adding VLAN 0 to HW filter on device eth2#1009
....

Thanks,
Ben

-- 
Ben Greear <greearb@...delatech.com>
Candela Technologies Inc  http://www.candelatech.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ