linux-kernel - Re: [PATCH] smp/call: Detect stuck CSD locks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150416110423.GA15760@gmail.com>
Date:	Thu, 16 Apr 2015 13:04:23 +0200
From:	Ingo Molnar <mingo@...nel.org>
To:	Chris J Arges <chris.j.arges@...onical.com>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Rafael David Tinoco <inaddy@...ntu.com>,
	Peter Anvin <hpa@...or.com>,
	Jiang Liu <jiang.liu@...ux.intel.com>,
	Peter Zijlstra <peterz@...radead.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Jens Axboe <axboe@...nel.dk>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Gema Gomez <gema.gomez-solano@...onical.com>,
	the arch/x86 maintainers <x86@...nel.org>
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks


* Chris J Arges <chris.j.arges@...onical.com> wrote:

> Ingo,
> 
> Below are the patches and data I've gathered from the reproducer. My 
> methodology was as described previously; however I used gdb on the 
> qemu process in order to breakpoint L1 once we've detected the hang. 
> This made dumping the kvm_lapic structures on L0 more reliable.

Thanks!

So I have trouble interpreting the L1 backtrace, because it shows 
something entirely new (to me).

First lets clarify the terminology, to make sure I got the workload 
all right:

 - L0 is the host kernel, running native Linux. It's not locking up.

 - L1 is the guest kernel, running virtualized Linux. This is the one 
   that is locking up.

 - L2 is the nested guest kernel, running whatever test workload you 
   were running - this is obviously locking up together with L1.

Right?

So with that cleared up, the backtrace on L1 looks like this:

> * Crash dump backtrace from L1:
> 
> crash> bt -a
> PID: 26     TASK: ffff88013a4f1400  CPU: 0   COMMAND: "ksmd"
>  #0 [ffff88013a5039f0] machine_kexec at ffffffff8109d3ec
>  #1 [ffff88013a503a50] crash_kexec at ffffffff8114a763
>  #2 [ffff88013a503b20] panic at ffffffff818068e0
>  #3 [ffff88013a503ba0] csd_lock_wait at ffffffff8113f1e4
>  #4 [ffff88013a503bf0] generic_exec_single at ffffffff8113f2d0
>  #5 [ffff88013a503c60] smp_call_function_single at ffffffff8113f417
>  #6 [ffff88013a503c90] smp_call_function_many at ffffffff8113f7a4
>  #7 [ffff88013a503d20] flush_tlb_page at ffffffff810b3bf9
>  #8 [ffff88013a503d50] ptep_clear_flush at ffffffff81205e5e
>  #9 [ffff88013a503d80] try_to_merge_with_ksm_page at ffffffff8121a445
> #10 [ffff88013a503e00] ksm_scan_thread at ffffffff8121ac0e
> #11 [ffff88013a503ec0] kthread at ffffffff810df0fb
> #12 [ffff88013a503f50] ret_from_fork at ffffffff8180fc98

So this one, VCPU0, is trying to send an IPI to VCPU1:

> PID: 1674   TASK: ffff8800ba4a9e00  CPU: 1   COMMAND: "qemu-system-x86"
>  #0 [ffff88013fd05e20] crash_nmi_callback at ffffffff81091521
>  #1 [ffff88013fd05e30] nmi_handle at ffffffff81062560
>  #2 [ffff88013fd05ea0] default_do_nmi at ffffffff81062b0a
>  #3 [ffff88013fd05ed0] do_nmi at ffffffff81062c88
>  #4 [ffff88013fd05ef0] end_repeat_nmi at ffffffff81812241
>     [exception RIP: vmx_vcpu_run+992]
>     RIP: ffffffff8104cef0  RSP: ffff88013940bcb8  RFLAGS: 00000082
>     RAX: 0000000080000202  RBX: ffff880139b30000  RCX: ffff880139b30000
>     RDX: 0000000000000200  RSI: ffff880139b30000  RDI: ffff880139b30000
>     RBP: ffff88013940bd28   R8: 00007fe192b71110   R9: 00007fe192b71140
>     R10: 00007fff66407d00  R11: 00007fe1927f0060  R12: 0000000000000000
>     R13: 0000000000000001  R14: 0000000000000001  R15: 0000000000000000
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
>  #5 [ffff88013940bcb8] vmx_vcpu_run at ffffffff8104cef0
>  #6 [ffff88013940bcf8] vmx_handle_external_intr at ffffffff81040c18
>  #7 [ffff88013940bd30] kvm_arch_vcpu_ioctl_run at ffffffff8101b5ad
>  #8 [ffff88013940be00] kvm_vcpu_ioctl at ffffffff81007894
>  #9 [ffff88013940beb0] do_vfs_ioctl at ffffffff81253190
> #10 [ffff88013940bf30] sys_ioctl at ffffffff81253411
> #11 [ffff88013940bf80] system_call_fastpath at ffffffff8180fd4d

So the problem here that I can see is that L1's VCPU1 appears to be 
looping with interrupts disabled:

>     RIP: ffffffff8104cef0  RSP: ffff88013940bcb8  RFLAGS: 00000082

Look how RFLAGS doesn't have 0x200 set - so it's executing with 
interrupts disabled.

That is why the IPI does not get through to it, but kdump's NMI had no 
problem getting through.

This (assuming all backtraces are exact!):

>  #5 [ffff88013940bcb8] vmx_vcpu_run at ffffffff8104cef0
>  #6 [ffff88013940bcf8] vmx_handle_external_intr at ffffffff81040c18
>  #7 [ffff88013940bd30] kvm_arch_vcpu_ioctl_run at ffffffff8101b5ad

suggests that we called vmx_vcpu_run() from 
vmx_handle_external_intr(), and that we are executing L2 guest code 
with interrupts disabled.

How is this supposed to work? What mechanism does KVM have against a 
(untrusted) guest interrupt handler locking up?

I might be misunderstanding how this works at the KVM level, but from 
the APIC perspective the situation appears to be pretty clear: CPU1's 
interrupts are turned off, so it cannot receive IPIs, the CSD wait 
will eventually time out.

Now obviously it appears to be anomalous (assuming my analysis is 
correct) that the interrupt handler has locked up, but it's 
immaterial: a nested kernel must not allow its guest to lock it up.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/