linux-kernel - Re: [PATCH] smp/call: Detect stuck CSD locks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150413061450.GA10857@gmail.com>
Date:	Mon, 13 Apr 2015 08:14:51 +0200
From:	Ingo Molnar <mingo@...nel.org>
To:	Chris J Arges <chris.j.arges@...onical.com>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Rafael David Tinoco <inaddy@...ntu.com>,
	Peter Anvin <hpa@...or.com>,
	Jiang Liu <jiang.liu@...ux.intel.com>,
	Peter Zijlstra <peterz@...radead.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Jens Axboe <axboe@...nel.dk>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Gema Gomez <gema.gomez-solano@...onical.com>,
	the arch/x86 maintainers <x86@...nel.org>
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

* Chris J Arges <chris.j.arges@...onical.com> wrote:

> /sys/module/kvm_intel/parameters/enable_apicv on the affected 
> hardware is not enabled, and unfortunately my hardware doesn't have 
> the necessary features to enable it. So we are dealing with KVM's 
> lapic implementation only.

That's actually pretty fortunate, as we don't have to worry about 
hardware state nearly as much!

> FYI, I'm working on getting better data at the moment and here is my approach:
> * For the L0 kernel:
>  - In arch/x86/kvm/lapic.c, I enabled 'apic_debug' to get more output (and print
>    the addresses of various useful structures)
>  - Setup crash to live dump kvm_lapic structures and associated registers for
>    both vCPUs

It would also be nice to double check the stuck vCPU's normal CPU 
state: is it truly able to receive interrupts? (IRQ flags are on, or 
is it sitting in the idle loop, etc.?)

If the IRQ flag (in EFLAGS) is off then the vCPU is not able to 
receive interrupts, regardless of local APIC state.

> * For the L1 kernel:
>  - Dump a stacktrace when we detect a lockup.
>  - Detect a lockup and try to not alter the state.
>  - Have a reliable signal such that the L0 hypervisor can dump the lapic
>    structures and registers when csd_lock_wait detects a softlockup.

I'd also suggest adding a printk() to IPI receipt, to make sure it's 
not the CSD code that is not getting called into after the IPI resend 
attempt. To make sure you only get messages after the CPU got stuck, 
add a 'locked_up' flag that signals this, and only print the messages 
if the lockup scenario is happening.

I'd do it by adding something like this to 
kernel/smp.c::generic_smp_call_function_single_interrupt():

	if (csd_locked_up)
		printk("CSD: Function call IPI callback on CPU#%d\n", raw_smp_processor_id());

Having this message in place would ensure that the IPI indeed did not 
get generated on the stuck vCPU. (Because we'd not get this message.)

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/