[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4f3290a5-7fd9-1d40-5183-2fffcf10b2f3@cybernetics.com>
Date: Fri, 16 Jun 2023 12:36:22 -0400
From: Tony Battersby <tonyb@...ernetics.com>
To: Ashok Raj <ashok.raj@...el.com>,
Thomas Gleixner <tglx@...utronix.de>
Cc: LKML <linux-kernel@...r.kernel.org>, x86@...nel.org,
Mario Limonciello <mario.limonciello@....com>,
Tom Lendacky <thomas.lendacky@....com>,
Ashok Raj <ashok.raj@...ux.intel.com>,
Tony Luck <tony.luck@...el.com>,
Arjan van de Veen <arjan@...ux.intel.com>,
Eric Biederman <ebiederm@...ssion.com>
Subject: Re: [patch v3 1/7] x86/smp: Make stop_other_cpus() more robust
On 6/15/23 21:58, Ashok Raj wrote:
> Hi Thomas,
>
> On Thu, Jun 15, 2023 at 10:33:50PM +0200, Thomas Gleixner wrote:
>> Tony reported intermittent lockups on poweroff. His analysis identified the
>> wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
>> on SME enabled machines a kexec() does not leave any stale data in the
>> caches when switching from encrypted to non-encrypted mode or vice versa.
>>
>> That wbindv() is conditional on the SME feature bit which is read directly
>> from CPUID. But that readout does not check whether the CPUID leaf is
>> available or not. If it's not available the CPU will return the value of
>> the highest supported leaf instead. Depending on the content the "SME" bit
>> might be set or not.
>>
>> That's incorrect but harmless. Making the CPUID readout conditional makes
>> the observed hangs go away, but it does not fix the underlying problem:
>>
>> CPU0 CPU1
>>
>> stop_other_cpus()
>> send_IPIs(REBOOT); stop_this_cpu()
>> while (num_online_cpus() > 1); set_online(false);
>> proceed... -> hang
>> wbinvd()
>>
>> WBINVD is an expensive operation and if multiple CPUs issue it at the same
>> time the resulting delays are even larger.
>>
>> But CPU0 already observed num_online_cpus() going down to 1 and proceeds
>> which causes the system to hang.
>>
>> This issue exists independent of WBINVD, but the delays caused by WBINVD
>> make it more prominent.
>>
>> Make this more robust by adding a cpumask which is initialized to the
>> online CPU mask before sending the IPIs and CPUs clear their bit in
>> stop_this_cpu() after the WBINVD completed. Check for that cpumask to
>> become empty in stop_other_cpus() instead of watching num_online_cpus().
>>
>> The cpumask cannot plug all holes either, but it's better than a raw
>> counter and allows to restrict the NMI fallback IPI to be sent only to
>> the CPUs which have not reported within the timeout window.
>>
>> Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
>> Reported-by: Tony Battersby <tonyb@...ernetics.com>
>> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
>> Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
>> ---
>> V3: Use a cpumask to make the NMI case slightly safer - Ashok
>> ---
>> arch/x86/include/asm/cpu.h | 2 +
>> arch/x86/kernel/process.c | 23 +++++++++++++-
>> arch/x86/kernel/smp.c | 71 +++++++++++++++++++++++++++++++--------------
>> 3 files changed, 73 insertions(+), 23 deletions(-)
> I tested them and seems to work fine on my system.
>
> Maybe Tony can check in his setup would be great.
>
plain 6.4-rc6: 50% failure rate
poweroff success: 2
poweroff fail: 2
6.4-rc6 with tglx v3 patch #1 only: 0% failure rate
poweroff success: 10
poweroff fail: 0
6.4-rc6 with all 7 tglx v3 patches: 0% failure rate
poweroff success: 10
poweroff fail: 0
Fixes my problem.
Tested-by: Tony Battersby <tonyb@...ernetics.com>
Powered by blists - more mailing lists