[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150413061450.GA10857@gmail.com>
Date: Mon, 13 Apr 2015 08:14:51 +0200
From: Ingo Molnar <mingo@...nel.org>
To: Chris J Arges <chris.j.arges@...onical.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Rafael David Tinoco <inaddy@...ntu.com>,
Peter Anvin <hpa@...or.com>,
Jiang Liu <jiang.liu@...ux.intel.com>,
Peter Zijlstra <peterz@...radead.org>,
LKML <linux-kernel@...r.kernel.org>,
Jens Axboe <axboe@...nel.dk>,
Frederic Weisbecker <fweisbec@...il.com>,
Gema Gomez <gema.gomez-solano@...onical.com>,
the arch/x86 maintainers <x86@...nel.org>
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks
* Chris J Arges <chris.j.arges@...onical.com> wrote:
> /sys/module/kvm_intel/parameters/enable_apicv on the affected
> hardware is not enabled, and unfortunately my hardware doesn't have
> the necessary features to enable it. So we are dealing with KVM's
> lapic implementation only.
That's actually pretty fortunate, as we don't have to worry about
hardware state nearly as much!
> FYI, I'm working on getting better data at the moment and here is my approach:
> * For the L0 kernel:
> - In arch/x86/kvm/lapic.c, I enabled 'apic_debug' to get more output (and print
> the addresses of various useful structures)
> - Setup crash to live dump kvm_lapic structures and associated registers for
> both vCPUs
It would also be nice to double check the stuck vCPU's normal CPU
state: is it truly able to receive interrupts? (IRQ flags are on, or
is it sitting in the idle loop, etc.?)
If the IRQ flag (in EFLAGS) is off then the vCPU is not able to
receive interrupts, regardless of local APIC state.
> * For the L1 kernel:
> - Dump a stacktrace when we detect a lockup.
> - Detect a lockup and try to not alter the state.
> - Have a reliable signal such that the L0 hypervisor can dump the lapic
> structures and registers when csd_lock_wait detects a softlockup.
I'd also suggest adding a printk() to IPI receipt, to make sure it's
not the CSD code that is not getting called into after the IPI resend
attempt. To make sure you only get messages after the CPU got stuck,
add a 'locked_up' flag that signals this, and only print the messages
if the lockup scenario is happening.
I'd do it by adding something like this to
kernel/smp.c::generic_smp_call_function_single_interrupt():
if (csd_locked_up)
printk("CSD: Function call IPI callback on CPU#%d\n", raw_smp_processor_id());
Having this message in place would ensure that the IPI indeed did not
get generated on the stuck vCPU. (Because we'd not get this message.)
Thanks,
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists