linux-kernel - Re: [debug PATCHes] Re: smp_call_function

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Date:	Wed, 1 Apr 2015 10:00:23 +0800
From:	Daniel J Blueman <daniel@...ra.org>
To:	Chris J Arges <chris.j.arges@...onical.com>
Cc:	Linux Kernel <linux-kernel@...r.kernel.org>,
	"x86@...nel.org" <x86@...nel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Rafael David Tinoco <inaddy@...ntu.com>,
	Peter Anvin <hpa@...or.com>,
	Jiang Liu <jiang.liu@...ux.intel.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Jens Axboe <axboe@...nel.dk>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Gema Gomez <gema.gomez-solano@...onical.com>
Subject: Re: [debug PATCHes] Re: smp_call_function_single lockups

On Wednesday, April 1, 2015 at 6:40:06 AM UTC+8, Chris J Arges wrote:
> On Tue, Mar 31, 2015 at 12:56:56PM +0200, Ingo Molnar wrote:
> >
> > * Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> >
> > > Ok, interesting. So the whole "we try to do an APIC ACK with the ISR
> > > bit clear" seems to be a real issue.
> >
> > It's interesting in particular when it happens with an edge-triggered
> > interrupt source: it's much harder to miss level triggered IRQs, which
> > stay around until actively handled. Edge triggered irqs are more
> > fragile to loss of event processing.
> >
> > > > Anyway, maybe this sheds some more light on this issue. I can
> > > > reproduce this at will, so let me know of other experiments to do.
> >
> > Btw., could you please describe (again) what your current best method
> > for reproduction is? It's been a long discussion ...
> >
>
> Ingo,
>
> To set this up, I've done the following on a Xeon E5620 / Xeon E312xx machine
> ( Although I've heard of others that have reproduced on other machines. )
>
> 1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce)
> 2) Create an L2 KVM VM inside the L1 VM with 1 vCPU
> 3) Add the following to the L1 cmdline:
> nmi_watchdog=panic hung_task_panic=1 softlockup_panic=1 unknown_nmi_panic
> 3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM
>
> Sometimes this is sufficient to reproduce the issue, I've observed that running
> KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others).
> If this doesn't reproduce then you can do the following:
> 4) Migrate the L2 vCPU randomly (via virsh vcpupin --live  OR tasksel) between
> L1 vCPUs until the hang occurs.
>
> I attempted to write a module that used smp_call_function_single calls to
> trigger IPIs but have been unable to create a more simple reproducer.

A non-intrusive way of generating a lot of IPIs, is calling
stop_machine() via something like:

while :; do
    echo "base=0x20000000000 size=0x8000000 type=write-back" >/proc/mtrr
    echo "disable=4" >| /proc/mtrr
done

Of course, ensure base is above DRAM and any 64-bit MMIO for no
side-effects and ensure it'll be entry 4. Onlining and offlining cores
in parallel will generate IPIs also.

Dan
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/