linux-kernel - Re: sched/fair: Hard lockup from idle_balance()/task_numa

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1508992510-116718-1-git-send-email-eswierk@skyportsystems.com>
Date:   Wed, 25 Oct 2017 21:35:10 -0700
From:   Ed Swierk <eswierk@...portsystems.com>
To:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>
Cc:     linux-kernel@...r.kernel.org,
        Ed Swierk <eswierk@...portsystems.com>
Subject: Re: sched/fair: Hard lockup from idle_balance()/task_numa_migrate() race

Ping?

On Mon, Oct 16, 2017 at 4:11 PM, Ed Swierk <eswierk@...portsystems.com> wrote:
> 
> Ping for Peter, Ingo and other sched maintainers:
> 
> I'd appreciate any feedback on this hard lockup issue, which occurred
> on a system running kernel 4.4.52-grsec.
> 
> To recap: a dual-socket Xeon (E5 v4) server system had been running a
> bunch of KVM workloads just fine for over 6 weeks. Suddenly hard
> lockups occurred on cpu 13 in task_numa_migrate(), and cpu 0 in
> idle_balance(). That condition persisted until the system was rebooted
> after a couple of hours, and meanwhile all manner of soft lockups and
> even SATA errors appeared.
> 
> [3851435.777762] NMI watchdog: Watchdog detected hard LOCKUP on cpu 13
> [3851435.784582] Modules linked in: vhost_net tun vhost macvtap 8021q garp mrp drbg ansi_cprng dm_crypt algif_skcipher af_alg ip_gre ip_tunnel gre macvlan ebtable_filter ebtables openvswitch nf_defrag_ipv6 libcrc32c nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul raid10 crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac lpc_ich mfd_core mei_me mei i2c_i801 ioatdma ipmi_ssif ipmi_msghandler acpi_pad acpi_cpufreq squashfs lz4_decompress xhci_pci xhci_hcd ixgbe mdio vxlan ip6_udp_tunnel udp_tunnel ptp pps_core dca ehci_pci ehci_hcd usbcore ahci libahci usb_common tpm_tis
> [3851435.874884] irq event stamp: 3828060294
> [3851435.879398] hardirqs last  enabled at (3828060293): [<ffffffff810db86c>] ffffffff810db86c
> [3851435.888810] hardirqs last disabled at (3828060294): [<ffffffff8170ff74>] ffffffff8170ff74
> [3851435.898254] softirqs last  enabled at (3828060156): [<ffffffff8107f31e>] ffffffff8107f31e
> [3851435.907666] softirqs last disabled at (3828060139): [<ffffffff8107f648>] ffffffff8107f648
> [3851435.917094] CPU: 13 PID: 2146 Comm: pocviexg Not tainted 4.4.52-grsec #1
> [3851435.925252] Hardware name: Intel S2600WTTR, BIOS SE5C610.86B.01.01.0016.C1.033120161139 03/31/2016
> [3851435.937473] task: ffff881c9ff35c00 ti: ffff881c9ff372b0 task.ti: ffff881c9ff372b0
> [3851435.946104] RIP: 0010:[<ffffffff810ad854>]  [<ffffffff810ad854>] try_to_wake_up+0xe4/0x500
> [3851435.955614] RSP: 0000:ffffc90016beba40  EFLAGS: 00000002
> [3851435.961780] RAX: 0000000000000000 RBX: ffff88102cc8e3d8 RCX: 0000000000000001
> [3851435.970016] RDX: 0000000000000001 RSI: 00000000000001e4 RDI: 0000000000000000
> [3851435.978259] RBP: ffffc90016beba80 R08: 0000000000000000 R09: 0000000000000001
> [3851435.986493] R10: 00000000000000e0 R11: ffff881c9ff35c00 R12: 0000000000000082
> [3851435.994727] R13: ffff88102cc8dc00 R14: 00000000001d3f00 R15: ffffffff81f25060
> [3851436.002962] FS:  00006e79227fc700(0000) GS:ffff882034000000(0000) knlGS:0000000000000000
> [3851436.012266] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [3851436.018920] CR2: 00006e79227fba38 CR3: 000000201d2fe000 CR4: 00000000003626f0
> [3851436.027155] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [3851436.035391] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [3851436.043626] Stack:
> [3851436.046098]  ffff8820341ce620 ffff8820341ce608 0000000000000000 ffff8820341ce600
> [3851436.054695]  ffffc90016bebb48 000000000000000d 0000000000000000 ffff881033fce600
> [3851436.063259]  ffffc90016beba90 ffffffff810adc80 ffffc90016bebbe8 ffffffff8112c830
> [3851436.071868] Call Trace:
> [3851436.074862]  [<ffffffff810adc80>] wake_up_process+0x10/0x20                https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/core.c#L1997
> [3851436.081317]  [<ffffffff8112c830>] stop_two_cpus+0x1b0/0x280                https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/stop_machine.c#L235
> [3851436.087778]  [<ffffffff8112c490>] ? cpu_stop_park+0x40/0x40
> [3851436.094237]  [<ffffffff8112c490>] ? cpu_stop_park+0x40/0x40
> [3851436.100715]  [<ffffffff810ad560>] ? __migrate_swap_task.part.94+0x70/0x70
> [3851436.108539]  [<ffffffff810ade01>] migrate_swap+0xf1/0x1f0                  https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/core.c#L1408
> [3851436.114800]  [<ffffffff810b9d55>] task_numa_migrate+0x345/0xb60            https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L1572
> [3851436.121656]  [<ffffffff811092e3>] ? futex_wait_queue_me+0xf3/0x160
> [3851436.128801]  [<ffffffff810ba5e3>] numa_migrate_preferred+0x73/0x80         https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L1597
> [3851436.135936]  [<ffffffff810bb944>] task_numa_fault+0x7a4/0xee0              https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L2137
> [3851436.142604]  [<ffffffff810bafe0>] ? should_numa_migrate_memory+0x50/0x130
> [3851436.150437]  [<ffffffff811b2db6>] handle_mm_fault+0xfd6/0x1e10             https://github.com/skyportsystems/linux/blob/4.4.52-grsec/mm/memory.c#L3503
> [3851436.157189]  [<ffffffff811b1e28>] ? handle_mm_fault+0x48/0x1e10
> [3851436.164051]  [<ffffffff810a85aa>] ? ___might_sleep+0x1ea/0x220
> [3851436.170804]  [<ffffffff8106405f>] __do_page_fault+0x1bf/0x460
> [3851436.177475]  [<ffffffff8110ca44>] ? sys_futex+0x74/0x230
> [3851436.183642]  [<ffffffff8106432c>] do_page_fault+0x2c/0x40
> [3851436.189909]  [<ffffffff817128b8>] page_fault+0x28/0x30
> [3851436.195892] Code: 0f 85 e0 02 00 00 41 8b 85 b8 16 00 00 89 45 d0 41 8b 7d 4c 49 c7 c6 00 3f 1d 00 85 ff 0f 85 68 01 00 00 41 8b 4d 30 85 c9 74 0a <f3> 90 41 8b 55 30 85 d2 75 f6 49 8b 45 00 a8 02 b8 00 00 00 00
> 
> [3851438.906435] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
> [3851438.913158] Modules linked in: vhost_net tun vhost macvtap 8021q garp mrp drbg ansi_cprng dm_crypt algif_skcipher af_alg ip_gre ip_tunnel gre macvlan ebtable_filter ebtables openvswitch nf_defrag_ipv6 libcrc32c nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul raid10 crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac lpc_ich mfd_core mei_me mei i2c_i801 ioatdma ipmi_ssif ipmi_msghandler acpi_pad acpi_cpufreq squashfs lz4_decompress xhci_pci xhci_hcd ixgbe mdio vxlan ip6_udp_tunnel udp_tunnel ptp pps_core dca ehci_pci ehci_hcd usbcore ahci libahci usb_common tpm_tis
> [3851439.003452] irq event stamp: 25935634
> [3851439.007769] hardirqs last  enabled at (25935633): [<ffffffff81710341>] ffffffff81710341
> [3851439.017010] hardirqs last disabled at (25935634): [<ffffffff8170ff74>] ffffffff8170ff74
> [3851439.026226] softirqs last  enabled at (25935520): [<ffffffff8107f31e>] ffffffff8107f31e
> [3851439.035441] softirqs last disabled at (25935445): [<ffffffff8107f648>] ffffffff8107f648
> [3851439.044657] CPU: 0 PID: 10 Comm: migration/0 Not tainted 4.4.52-grsec #1
> [3851439.052892] Hardware name: Intel S2600WTTR, BIOS SE5C610.86B.01.01.0016.C1.033120161139 03/31/2016
> [3851439.065111] task: ffff88102cc8dc00 ti: ffff88102cc8f2b0 task.ti: ffff88102cc8f2b0
> [3851439.073747] RIP: 0010:[<ffffffff813c4935>]  [<ffffffff813c4935>] delay_tsc+0x25/0x70
> [3851439.082679] RSP: 0000:ffffc900006ebb40  EFLAGS: 00000002
> [3851439.088852] RAX: 000000002591f96b RBX: ffff8820341ce608 RCX: 0000000000000018
> [3851439.101462] RDX: 001cab2e00000000 RSI: 001cab2e2591f96b RDI: 0000000000000001
> [3851439.109695] RBP: ffffc900006ebb40 R08: 0000000000000000 R09: 0000000000000000
> [3851439.117950] R10: 0000000000000038 R11: 0000000000000000 R12: 0000000021d37081
> [3851439.126198] R13: 0000000000000001 R14: 000000007d234b28 R15: 0000000000000000
> [3851439.134440] FS:  0000000000000000(0000) GS:ffff881033e00000(0000) knlGS:0000000000000000
> [3851439.143741] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [3851439.150393] CR2: 000000c420148008 CR3: 0000000001f36000 CR4: 00000000003626f0
> [3851439.158627] Stack:
> [3851439.161096]  ffffc900006ebb50 ffffffff813c487a ffffc900006ebb80 ffffffff810dba87
> [3851439.169669]  ffff8820341ce608 ffff8820341ce620 0000000000000086 00000000000006ae
> [3851439.178228]  ffffc900006ebbb0 ffffffff817101a2 ffffffff8112c1cb ffff8820341ce600
> [3851439.186825] Call Trace:
> [3851439.189784]  [<ffffffff813c487a>] __delay+0xa/0x10
> [3851439.195385]  [<ffffffff810dba87>] do_raw_spin_lock+0x87/0x160
> [3851439.202062]  [<ffffffff817101a2>] _raw_spin_lock_irqsave+0xa2/0xb0
> [3851439.209238]  [<ffffffff8112c1cb>] ? cpu_stop_queue_work+0x2b/0x70
> [3851439.216287]  [<ffffffff8112c1cb>] cpu_stop_queue_work+0x2b/0x70            https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/stop_machine.c#L89
> [3851439.223149]  [<ffffffff8112c92b>] stop_one_cpu_nowait+0x2b/0x30            https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/stop_machine.c#L310
> [3851439.229999]  [<ffffffff810c1127>] load_balance+0x827/0x960                 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L7184
> [3851439.236371]  [<ffffffff810eae3d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
> [3851439.243995]  [<ffffffff810c1608>] pick_next_task_fair+0x3a8/0x750          https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L5348
> [3851439.251062]  [<ffffffff810c14db>] ? pick_next_task_fair+0x27b/0x750
> [3851439.258289]  [<ffffffff81708fdd>] __schedule+0x1ad/0xbd0                   https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/core.c#L3196
> [3851439.264456]  [<ffffffff810d461d>] ? trace_hardirqs_on_caller+0x13d/0x1d0
> [3851439.272204]  [<ffffffff81709a37>] schedule+0x37/0x80
> [3851439.277984]  [<ffffffff810a51cb>] smpboot_thread_fn+0x16b/0x1e0
> [3851439.284822]  [<ffffffff810a5060>] ? sort_range+0x20/0x20
> [3851439.291001]  [<ffffffff810a124c>] kthread+0xfc/0x120
> [3851439.296796]  [<ffffffff8170adf1>] ? wait_for_completion+0xe1/0x110
> [3851439.303936]  [<ffffffff810a1150>] ? kthread_create_on_node+0x240/0x240
> [3851439.311464]  [<ffffffff81710aae>] ret_from_fork+0x3e/0x70
> [3851439.317738]  [<ffffffff810a1150>] ? kthread_create_on_node+0x240/0x240
> [3851439.325281] Code: 00 5d c3 0f 1f 00 55 65 ff 05 b8 65 c4 7e 48 89 e5 65 44 8b 05 35 48 c4 7e 0f ae e8 0f 31 48 c1 e2 20 48 89 d6 48 09 c6 0f ae e8 <0f> 31 48 c1 e2 20 48 09 d0 48 89 c1 48 29 f1 48 39 cf 76 27 65
> 
> [3851450] NMI watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [pocviexg:2136]
> [3851450] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [CPU 0/KVM:9662]
> [3851450] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [CPU 1/KVM:10302]
> 
> [3851478] NMI watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [pocviexg:2136]
> [3851478] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [CPU 0/KVM:9662]
> [3851478] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [CPU 1/KVM:10302]
> 
> [3851482] BUG: spinlock lockup suspected on CPU#0, migration/0/10
> 
> [3851482] Sending NMI to all CPUs:
> [3851482] NMI backtrace for cpu 0
> [3851483] NMI backtrace for cpu 1
> [3851483] NMI backtrace for cpu 2
> [3851483] NMI backtrace for cpu 3
> [3851483] NMI backtrace for cpu 4
> 
> [3851484] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen **
> [3851484] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen **
> 
> [3851484] NMI backtrace for cpu 5
> [3851484] NMI backtrace for cpu 6
> ...
> [3851489] NMI backtrace for cpu 24
> 
> [3851489] ata8.00: qc timeout (cmd 0xec) **
> [3851489] ata7.00: qc timeout (cmd 0xec) **
> 
> [3851489] NMI backtrace for cpu 25
> ...
> [3851490] NMI backtrace for cpu 31
> 
> [3851499] ata8.00: qc timeout (cmd 0xec) **
> [3851499] ata7.00: qc timeout (cmd 0xec) **
> 
> [3851510] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [CPU 0/KVM:9662]
> ...and so on for 2 more hours, until the system was finally rebooted
> 
> Based on the addresses in the stack and registers, here's what I think
> happened.
> 
> On cpu 13:
> 
> - task_numa_fault() calls task_numa_migrate(), which selects the task
>   on cpu 0 as the dst_task.
> - migrate_swap() calls stop_two_cpus(), which acquires the cpu_stopper
>   locks for the dst_cpu (cpu 0, at 0xffff881033fce600) and src_cpu
>   (cpu X, at 0xffff8820341ce600).
> - stop_two_cpus() calls wake_up_process() on the lower-numbered cpu
>   first, which has to be cpu 0.
> - wake_up_process() spins until the cpu 0 task (at 0xffff88102cc8dc00)
>   is no longer on_cpu.
> 
> On cpu 0:
> 
> - pick_next_task_fair() calls idle_balance(). According to the "This
>   is OK" comment, current is on_cpu at this point.
> - idle_balance() calls load_balance() for dst_cpu 0.
> - load_balance() decides to move a task from cpu X, so calls
>   stop_one_cpu_nowait() on cpu X.
> - stop_one_cpu_nowait() spins trying to acquire the cpu_stopper lock
>   for cpu X (at 0xffff8820341ce600).
> 
> So idle_balance() on cpu 0 is stuck waiting for task_numa_fault() to
> move a task to cpu 0, which is blocked on idle_balance() completing.
> 
> Meanwhile, the SATA controller interrupt handler, which is placed on
> cpu 0, doesn't get a chance to run, causing the command failure
> messages from ata7 and ata8.
> 
> Also, it appears that task_numa_fault() tries to migrate current, so
> the src_cpu X used by task_numa_migrate() is cpu 13 in this
> case. Though the key issue is that both task_numa_migrate() and
> idle_balance() are trying to stop the same cpu, regardless of whether
> it's the cpu task_numa_migrate() is running on.
> 
> Is this a plausible theory?
> 
> This is the only instance of a hard lockup among a number of identical
> systems running the same kernel, so I'm not seeing an easy way to
> reproduce this issue, but any suggestions would be welcome.

--Ed