linux-kernel - Re: [PATCH RFC] x86/cpu: fix intermittent lockup on poweroff

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <D9668AEC-9650-4840-A03D-553994B414DD@zytor.com>
Date:   Tue, 25 Apr 2023 17:10:42 -0700
From:   "H. Peter Anvin" <hpa@...or.com>
To:     Dave Hansen <dave.hansen@...el.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Tony Battersby <tonyb@...ernetics.com>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org
CC:     Mario Limonciello <mario.limonciello@....com>,
        Tom Lendacky <thomas.lendacky@....com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Andi Kleen <ak@...ux.intel.com>
Subject: Re: [PATCH RFC] x86/cpu: fix intermittent lockup on poweroff

On April 25, 2023 3:29:49 PM PDT, Dave Hansen <dave.hansen@...el.com> wrote:
>On 4/25/23 14:05, Thomas Gleixner wrote:
>> The only consequence of looking at bit 0 of some random other leaf is
>> that all CPUs which run stop_this_cpu() issue WBINVD in parallel, which
>> is slow but should not be a fatal issue.
>> 
>> Tony observed this is a 50% chance to hang, which means this is a timing
>> issue.
>
>I _think_ the system in question is a dual-socket Westmere.  I don't see
>any obvious errata that we could pin this on:
>
>> https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-5600-specification-update.pdf
>
>Andi Kleen had an interesting theory.  WBINVD is a pretty expensive
>operation.  It's possible that it has some degenerative behavior when
>it's called on a *bunch* of CPUs all at once (which this path can do).
>If the instruction takes too long, it could trigger one of the CPU's
>internal lockup detectors and trigger a machine check.  At that point,
>all hell breaks loose.
>
>I don't know the cache coherency protocol well enough to say for sure,
>but I wonder if there's a storm of cache coherency traffic as all those
>lines get written back.  One of the CPUs gets starved from making enough
>forward progress and trips a CPU-internal watchdog.
>
>Andi also says that it _should_ log something in the machine check banks
>when this happens so there should be at least some kind of breadcrumb.
>
>Either way, I'm hoping this hand waving satiates tglx's morbid curiosity
>about hardware that came out from before I even worked at Intel. ;)

"Pretty expensive" doesn't really cover it. It is by far the longest time an x86 CPU can block out all outside events.