linux-kernel - Re: Which came first, hard kernel lockup or SATA errors?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1507614301-56408-1-git-send-email-eswierk@skyportsystems.com>
Date:   Mon,  9 Oct 2017 22:45:01 -0700
From:   Ed Swierk <eswierk@...portsystems.com>
To:     eswierk@...portsystems.com, linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: Which came first, hard kernel lockup or SATA errors?

On Fri, Oct 6, 2017 at 6:25 PM, Ed Swierk <eswierk@...portsystems.com> wrote:
> I'm trying to untangle a series of problems that suddenly occurred on
> a dual-socket Xeon server system that had been running a bunch of KVM
> workloads just fine for over 6 weeks (4.4.52-grsec kernel,
> Debian-derived userspace).

I think I've narrowed the hard lockup to a race between
task_numa_migrate() and idle_balance().

Here are the two lockup messages again, this time with approximate
code references.

[3851435.777762] NMI watchdog: Watchdog detected hard LOCKUP on cpu 13
[3851435.784582] Modules linked in: vhost_net tun vhost macvtap 8021q garp mrp drbg ansi_cprng dm_crypt algif_skcipher af_alg ip_gre ip_tunnel gre macvlan ebtable_filter ebtables openvswitch nf_defrag_ipv6 libcrc32c nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul raid10 crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac lpc_ich mfd_core mei_me mei i2c_i801 ioatdma ipmi_ssif ipmi_msghandler acpi_pad acpi_cpufreq squashfs lz4_decompress xhci_pci xhci_hcd ixgbe mdio vxlan ip6_udp_tunnel udp_tunnel ptp pps_core dca ehci_pci ehci_hcd usbcore ahci libahci usb_common tpm_tis
[3851435.874884] irq event stamp: 3828060294
[3851435.879398] hardirqs last  enabled at (3828060293): [<ffffffff810db86c>] ffffffff810db86c
[3851435.888810] hardirqs last disabled at (3828060294): [<ffffffff8170ff74>] ffffffff8170ff74
[3851435.898254] softirqs last  enabled at (3828060156): [<ffffffff8107f31e>] ffffffff8107f31e
[3851435.907666] softirqs last disabled at (3828060139): [<ffffffff8107f648>] ffffffff8107f648
[3851435.917094] CPU: 13 PID: 2146 Comm: pocviexg Not tainted 4.4.52-grsec #1
[3851435.925252] Hardware name: Intel S2600WTTR, BIOS SE5C610.86B.01.01.0016.C1.033120161139 03/31/2016
[3851435.937473] task: ffff881c9ff35c00 ti: ffff881c9ff372b0 task.ti: ffff881c9ff372b0
[3851435.946104] RIP: 0010:[<ffffffff810ad854>]  [<ffffffff810ad854>] try_to_wake_up+0xe4/0x500
[3851435.955614] RSP: 0000:ffffc90016beba40  EFLAGS: 00000002
[3851435.961780] RAX: 0000000000000000 RBX: ffff88102cc8e3d8 RCX: 0000000000000001
[3851435.970016] RDX: 0000000000000001 RSI: 00000000000001e4 RDI: 0000000000000000
[3851435.978259] RBP: ffffc90016beba80 R08: 0000000000000000 R09: 0000000000000001
[3851435.986493] R10: 00000000000000e0 R11: ffff881c9ff35c00 R12: 0000000000000082
[3851435.994727] R13: ffff88102cc8dc00 R14: 00000000001d3f00 R15: ffffffff81f25060
[3851436.002962] FS:  00006e79227fc700(0000) GS:ffff882034000000(0000) knlGS:0000000000000000
[3851436.012266] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3851436.018920] CR2: 00006e79227fba38 CR3: 000000201d2fe000 CR4: 00000000003626f0
[3851436.027155] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[3851436.035391] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[3851436.043626] Stack:
[3851436.046098]  ffff8820341ce620 ffff8820341ce608 0000000000000000 ffff8820341ce600
[3851436.054695]  ffffc90016bebb48 000000000000000d 0000000000000000 ffff881033fce600
[3851436.063259]  ffffc90016beba90 ffffffff810adc80 ffffc90016bebbe8 ffffffff8112c830
[3851436.071868] Call Trace:
[3851436.074862]  [<ffffffff810adc80>] wake_up_process+0x10/0x20                https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/core.c#L1997
[3851436.081317]  [<ffffffff8112c830>] stop_two_cpus+0x1b0/0x280                https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/stop_machine.c#L235
[3851436.087778]  [<ffffffff8112c490>] ? cpu_stop_park+0x40/0x40
[3851436.094237]  [<ffffffff8112c490>] ? cpu_stop_park+0x40/0x40
[3851436.100715]  [<ffffffff810ad560>] ? __migrate_swap_task.part.94+0x70/0x70
[3851436.108539]  [<ffffffff810ade01>] migrate_swap+0xf1/0x1f0                  https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/core.c#L1408
[3851436.114800]  [<ffffffff810b9d55>] task_numa_migrate+0x345/0xb60            https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L1572
[3851436.121656]  [<ffffffff811092e3>] ? futex_wait_queue_me+0xf3/0x160
[3851436.128801]  [<ffffffff810ba5e3>] numa_migrate_preferred+0x73/0x80         https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L1597
[3851436.135936]  [<ffffffff810bb944>] task_numa_fault+0x7a4/0xee0              https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L2137
[3851436.142604]  [<ffffffff810bafe0>] ? should_numa_migrate_memory+0x50/0x130
[3851436.150437]  [<ffffffff811b2db6>] handle_mm_fault+0xfd6/0x1e10             https://github.com/skyportsystems/linux/blob/4.4.52-grsec/mm/memory.c#L3503
[3851436.157189]  [<ffffffff811b1e28>] ? handle_mm_fault+0x48/0x1e10
[3851436.164051]  [<ffffffff810a85aa>] ? ___might_sleep+0x1ea/0x220
[3851436.170804]  [<ffffffff8106405f>] __do_page_fault+0x1bf/0x460
[3851436.177475]  [<ffffffff8110ca44>] ? sys_futex+0x74/0x230
[3851436.183642]  [<ffffffff8106432c>] do_page_fault+0x2c/0x40
[3851436.189909]  [<ffffffff817128b8>] page_fault+0x28/0x30
[3851436.195892] Code: 0f 85 e0 02 00 00 41 8b 85 b8 16 00 00 89 45 d0 41 8b 7d 4c 49 c7 c6 00 3f 1d 00 85 ff 0f 85 68 01 00 00 41 8b 4d 30 85 c9 74 0a <f3> 90 41 8b 55 30 85 d2 75 f6 49 8b 45 00 a8 02 b8 00 00 00 00

[3851438.906435] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
[3851438.913158] Modules linked in: vhost_net tun vhost macvtap 8021q garp mrp drbg ansi_cprng dm_crypt algif_skcipher af_alg ip_gre ip_tunnel gre macvlan ebtable_filter ebtables openvswitch nf_defrag_ipv6 libcrc32c nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul raid10 crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac lpc_ich mfd_core mei_me mei i2c_i801 ioatdma ipmi_ssif ipmi_msghandler acpi_pad acpi_cpufreq squashfs lz4_decompress xhci_pci xhci_hcd ixgbe mdio vxlan ip6_udp_tunnel udp_tunnel ptp pps_core dca ehci_pci ehci_hcd usbcore ahci libahci usb_common tpm_tis
[3851439.003452] irq event stamp: 25935634
[3851439.007769] hardirqs last  enabled at (25935633): [<ffffffff81710341>] ffffffff81710341
[3851439.017010] hardirqs last disabled at (25935634): [<ffffffff8170ff74>] ffffffff8170ff74
[3851439.026226] softirqs last  enabled at (25935520): [<ffffffff8107f31e>] ffffffff8107f31e
[3851439.035441] softirqs last disabled at (25935445): [<ffffffff8107f648>] ffffffff8107f648
[3851439.044657] CPU: 0 PID: 10 Comm: migration/0 Not tainted 4.4.52-grsec #1
[3851439.052892] Hardware name: Intel S2600WTTR, BIOS SE5C610.86B.01.01.0016.C1.033120161139 03/31/2016
[3851439.065111] task: ffff88102cc8dc00 ti: ffff88102cc8f2b0 task.ti: ffff88102cc8f2b0
[3851439.073747] RIP: 0010:[<ffffffff813c4935>]  [<ffffffff813c4935>] delay_tsc+0x25/0x70
[3851439.082679] RSP: 0000:ffffc900006ebb40  EFLAGS: 00000002
[3851439.088852] RAX: 000000002591f96b RBX: ffff8820341ce608 RCX: 0000000000000018
[3851439.101462] RDX: 001cab2e00000000 RSI: 001cab2e2591f96b RDI: 0000000000000001
[3851439.109695] RBP: ffffc900006ebb40 R08: 0000000000000000 R09: 0000000000000000
[3851439.117950] R10: 0000000000000038 R11: 0000000000000000 R12: 0000000021d37081
[3851439.126198] R13: 0000000000000001 R14: 000000007d234b28 R15: 0000000000000000
[3851439.134440] FS:  0000000000000000(0000) GS:ffff881033e00000(0000) knlGS:0000000000000000
[3851439.143741] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3851439.150393] CR2: 000000c420148008 CR3: 0000000001f36000 CR4: 00000000003626f0
[3851439.158627] Stack:
[3851439.161096]  ffffc900006ebb50 ffffffff813c487a ffffc900006ebb80 ffffffff810dba87
[3851439.169669]  ffff8820341ce608 ffff8820341ce620 0000000000000086 00000000000006ae
[3851439.178228]  ffffc900006ebbb0 ffffffff817101a2 ffffffff8112c1cb ffff8820341ce600
[3851439.186825] Call Trace:
[3851439.189784]  [<ffffffff813c487a>] __delay+0xa/0x10
[3851439.195385]  [<ffffffff810dba87>] do_raw_spin_lock+0x87/0x160
[3851439.202062]  [<ffffffff817101a2>] _raw_spin_lock_irqsave+0xa2/0xb0
[3851439.209238]  [<ffffffff8112c1cb>] ? cpu_stop_queue_work+0x2b/0x70
[3851439.216287]  [<ffffffff8112c1cb>] cpu_stop_queue_work+0x2b/0x70            https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/stop_machine.c#L89
[3851439.223149]  [<ffffffff8112c92b>] stop_one_cpu_nowait+0x2b/0x30            https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/stop_machine.c#L310
[3851439.229999]  [<ffffffff810c1127>] load_balance+0x827/0x960                 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L7184
[3851439.236371]  [<ffffffff810eae3d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[3851439.243995]  [<ffffffff810c1608>] pick_next_task_fair+0x3a8/0x750          https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L5348
[3851439.251062]  [<ffffffff810c14db>] ? pick_next_task_fair+0x27b/0x750
[3851439.258289]  [<ffffffff81708fdd>] __schedule+0x1ad/0xbd0                   https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/core.c#L3196
[3851439.264456]  [<ffffffff810d461d>] ? trace_hardirqs_on_caller+0x13d/0x1d0
[3851439.272204]  [<ffffffff81709a37>] schedule+0x37/0x80
[3851439.277984]  [<ffffffff810a51cb>] smpboot_thread_fn+0x16b/0x1e0
[3851439.284822]  [<ffffffff810a5060>] ? sort_range+0x20/0x20
[3851439.291001]  [<ffffffff810a124c>] kthread+0xfc/0x120
[3851439.296796]  [<ffffffff8170adf1>] ? wait_for_completion+0xe1/0x110
[3851439.303936]  [<ffffffff810a1150>] ? kthread_create_on_node+0x240/0x240
[3851439.311464]  [<ffffffff81710aae>] ret_from_fork+0x3e/0x70
[3851439.317738]  [<ffffffff810a1150>] ? kthread_create_on_node+0x240/0x240
[3851439.325281] Code: 00 5d c3 0f 1f 00 55 65 ff 05 b8 65 c4 7e 48 89 e5 65 44 8b 05 35 48 c4 7e 0f ae e8 0f 31 48 c1 e2 20 48 89 d6 48 09 c6 0f ae e8 <0f> 31 48 c1 e2 20 48 09 d0 48 89 c1 48 29 f1 48 39 cf 76 27 65

Based on the addresses in the stack and registers, here's what I think
happened.

On cpu 13:

- task_numa_fault() calls task_numa_migrate(), which selects the task
  on cpu 0 as the dst_task.
- migrate_swap() calls stop_two_cpus(), which acquires the cpu_stopper
  locks for the dst_cpu (cpu 0, at 0xffff881033fce600) and src_cpu
  (cpu X, at 0xffff8820341ce600).
- stop_two_cpus() calls wake_up_process() on the lower-numbered cpu
  first, which has to be cpu 0.
- wake_up_process() spins until the cpu 0 task (at 0xffff88102cc8dc00)
  is no longer on_cpu.

On cpu 0:

- pick_next_task_fair() calls idle_balance(). According to the "This
  is OK" comment, current is on_cpu at this point.
- idle_balance() calls load_balance() for dst_cpu 0.
- load_balance() decides to move a task from cpu X, so calls
  stop_one_cpu_nowait() on cpu X.
- stop_one_cpu_nowait() spins trying to acquire the cpu_stopper lock
  for cpu X (at 0xffff8820341ce600).

So idle_balance() on cpu 0 is stuck waiting for task_numa_fault() to
move a task to cpu 0, which is blocked on idle_balance() completing.

Meanwhile, the SATA controller interrupt handler, which is placed on
cpu 0, doesn't get a chance to run, causing the command failure
messages from ata7 and ata8.

Is this a plausible theory?

Any hints on reproducing such a scenario, by adding delays or other
hacks to the code?

--Ed