linux-kernel - Re: [PATCH] smp/call: Detect stuck CSD locks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+55aFzJAxcfO3=ECVHpY7qyNzpaAF9b2RY9J57VNasOLqU7vw@mail.gmail.com>
Date:	Tue, 7 Apr 2015 14:15:11 -0700
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Chris J Arges <chris.j.arges@...onical.com>
Cc:	Ingo Molnar <mingo@...nel.org>,
	Rafael David Tinoco <inaddy@...ntu.com>,
	Peter Anvin <hpa@...or.com>,
	Jiang Liu <jiang.liu@...ux.intel.com>,
	Peter Zijlstra <peterz@...radead.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Jens Axboe <axboe@...nel.dk>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Gema Gomez <gema.gomez-solano@...onical.com>,
	"the arch/x86 maintainers" <x86@...nel.org>
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

On Tue, Apr 7, 2015 at 1:59 PM, Chris J Arges
<chris.j.arges@...onical.com> wrote:
>
> Here is the log leading up to the soft lockup (I adjusted CSD_LOCK_TIMEOUT to 5s):
> [   22.669630] kvm [1523]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
> [   38.712710] csd: Detected non-responsive CSD lock (#1) on CPU#00, waiting 5.000 secs for CPU#01
> [   38.712715] csd: Re-sending CSD lock (#1) IPI from CPU#00 to CPU#01
> [   43.712709] csd: Detected non-responsive CSD lock (#2) on CPU#00, waiting 5.000 secs for CPU#01
> [   43.712713] csd: Re-sending CSD lock (#2) IPI from CPU#00 to CPU#01
> [   48.712708] csd: Detected non-responsive CSD lock (#3) on CPU#00, waiting 5.000 secs for CPU#01
> [   48.712732] csd: Re-sending CSD lock (#3) IPI from CPU#00 to CPU#01
> [   53.712708] csd: Detected non-responsive CSD lock (#4) on CPU#00, waiting 5.000 secs for CPU#01
> [   53.712712] csd: Re-sending CSD lock (#4) IPI from CPU#00 to CPU#01
> [   58.712707] csd: Detected non-responsive CSD lock (#5) on CPU#00, waiting 5.000 secs for CPU#01
> [   58.712712] csd: Re-sending CSD lock (#5) IPI from CPU#00 to CPU#01
> [   60.080005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ksmd:26]

Ok, so it's not the IPI just "getting lost". I'm not hugely surprised.

But it does look like CPU1 never actually reacts to the IPI, even when re-sent:

> Looking at the call_single_queue I see the following (I crashed during the soft lockup):
>
> crash> p call_single_queue
> PER-CPU DATA TYPE:
>   struct llist_head call_single_queue;
> PER-CPU ADDRESSES:
>   [0]: ffff88013fc16580
>   [1]: ffff88013fd16580
> crash> list -s call_single_data ffff88013fc16580
> ffff88013fc16580
> struct call_single_data {
>   llist = {
>     next = 0x0
>   },
>   func = 0x0,
>   info = 0x0,
>   flags = 0
> }
> crash> list -s call_single_data ffff88013fd16580
> ffff88013fd16580
> struct call_single_data {
>   llist = {
>     next = 0xffff88013a517c08
>   },
>   func = 0x0,
>   info = 0x0,
>   flags = 0
> }
> ffff88013a517c08
> struct call_single_data {
>   llist = {
>     next = 0x0
>   },
>   func = 0xffffffff81067f30 <flush_tlb_func>,
>   info = 0xffff88013a517d00,
>   flags = 3
> }
>
> This seems consistent with previous crash dumps.

The above seems to show that CPU1 has never picked up the CSD.  Which
is consistent with CPU0 waiting forever for it.

It really looks like CPU1 is simply not reacting to the IPI even when
resent. It's possibly masked at the APIC level, or it's somehow stuck.

> As I mentioned here: https://lkml.org/lkml/2015/4/6/186
> I'm able to reproduce this easily on certain hardware w/
> b6b8a1451fc40412c57d10c94b62e22acab28f94 applied and not
> 9242b5b60df8b13b469bc6b7be08ff6ebb551ad3 on the L0 kernel. I think it makes
> sense to get as clear of a picture with this more trivial reproducer, then
> re-run this on a L0 w/ v4.0-rcX. Most likely the later case will take many days
> to reproduce.

Well, those commits imply that it's a kvm virtualization problem, but
at least DaveJ's problems was while running on raw hardware with no
virtualization.

Ho humm.

                      Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/