lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150413061450.GA10857@gmail.com>
Date:	Mon, 13 Apr 2015 08:14:51 +0200
From:	Ingo Molnar <mingo@...nel.org>
To:	Chris J Arges <chris.j.arges@...onical.com>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Rafael David Tinoco <inaddy@...ntu.com>,
	Peter Anvin <hpa@...or.com>,
	Jiang Liu <jiang.liu@...ux.intel.com>,
	Peter Zijlstra <peterz@...radead.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Jens Axboe <axboe@...nel.dk>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Gema Gomez <gema.gomez-solano@...onical.com>,
	the arch/x86 maintainers <x86@...nel.org>
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks


* Chris J Arges <chris.j.arges@...onical.com> wrote:

> /sys/module/kvm_intel/parameters/enable_apicv on the affected 
> hardware is not enabled, and unfortunately my hardware doesn't have 
> the necessary features to enable it. So we are dealing with KVM's 
> lapic implementation only.

That's actually pretty fortunate, as we don't have to worry about 
hardware state nearly as much!

> FYI, I'm working on getting better data at the moment and here is my approach:
> * For the L0 kernel:
>  - In arch/x86/kvm/lapic.c, I enabled 'apic_debug' to get more output (and print
>    the addresses of various useful structures)
>  - Setup crash to live dump kvm_lapic structures and associated registers for
>    both vCPUs

It would also be nice to double check the stuck vCPU's normal CPU 
state: is it truly able to receive interrupts? (IRQ flags are on, or 
is it sitting in the idle loop, etc.?)

If the IRQ flag (in EFLAGS) is off then the vCPU is not able to 
receive interrupts, regardless of local APIC state.

> * For the L1 kernel:
>  - Dump a stacktrace when we detect a lockup.
>  - Detect a lockup and try to not alter the state.
>  - Have a reliable signal such that the L0 hypervisor can dump the lapic
>    structures and registers when csd_lock_wait detects a softlockup.

I'd also suggest adding a printk() to IPI receipt, to make sure it's 
not the CSD code that is not getting called into after the IPI resend 
attempt. To make sure you only get messages after the CPU got stuck, 
add a 'locked_up' flag that signals this, and only print the messages 
if the lockup scenario is happening.

I'd do it by adding something like this to 
kernel/smp.c::generic_smp_call_function_single_interrupt():

	if (csd_locked_up)
		printk("CSD: Function call IPI callback on CPU#%d\n", raw_smp_processor_id());

Having this message in place would ensure that the IPI indeed did not 
get generated on the stuck vCPU. (Because we'd not get this message.)

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ