[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CALCETrUBzP1YPob3JhOZZepBPbyFXuTG2pb8Sq_o+c_WpioGTw@mail.gmail.com>
Date: Tue, 30 Aug 2022 12:15:23 -0700
From: Andy Lutomirski <luto@...nel.org>
To: Ashok Raj <ashok.raj@...el.com>
Cc: Borislav Petkov <bp@...en8.de>,
Thomas Gleixner <tglx@...utronix.de>,
Tony Luck <tony.luck@...el.com>,
Dave Hansen <dave.hansen@...el.com>,
LKML Mailing List <linux-kernel@...r.kernel.org>,
X86-kernel <x86@...nel.org>,
Tom Lendacky <thomas.lendacky@....com>,
Jacon Jun Pan <jacob.jun.pan@...el.com>
Subject: Re: [PATCH v3 5/5] x86/microcode: Place siblings in NMI loop while
update in progress
On Tue, Aug 16, 2022 at 10:12 PM Ashok Raj <ashok.raj@...el.com> wrote:
>
> Microcode updates need a guarantee that the thread sibling that is waiting
> for the update to finish on the primary core will not execute any
> instructions until the update is complete. This is required to guarantee
> any MSR or instruction that's being patched will be executed before the
> update is complete.
>
> After the stop_machine() rendezvous, an NMI handler is registered. If an
> NMI were to happen while the microcode update is not complete, the
> secondary thread will spin until the ucode update state is cleared.
>
> Couple of choices discussed are:
>
> 1. Rendezvous inside the NMI handler, and also perform the update from
> within the handler. This seemed too risky and might cause instability
> with the races that we would need to solve. This would be a difficult
> choice.
> 1.a Since the primary thread of every core is performing a wrmsr
> for the update, once the wrmsr has started, it can't be
> interrupted. Hence its not required to NMI the primary thread of
> the core. Only the secondary thread needs to be parked in NMI
> before the update begins.
> Suggested by From Andy Cooper
> 2. Thomas (tglx) suggested that we could look into masking all the LVT
> originating NMI's. Such as LINT1, Perf control LVT entries and such.
> Since we are in the rendezvous loop, we don't need to worry about any
> NMI IPI's generated by the OS.
>
> The one we didn't have any control over is the ACPI mechanism of sending
> notifications to kernel for Firmware First Processing (FFM). Apparently
> it seems there is a PCH register that BIOS in SMI would write to
> generate such an interrupt (ACPI GHES).
> 3. This is a simpler option. OS registers an NMI handler and doesn't do any
> NMI rendezvous dance. But if an NMI were to happen, we check if any of
> the CPUs thread siblings have an update in progress. Only those CPUs
> would take an NMI. The thread performing the wrmsr() will only take an
> NMI after the completion of the wrmsr 0x79 flow.
>
> [ Lutomirsky thinks this is weak, and what happens from taking the
> interrupt and the path to the registered callback handler might be
> exposed.]
>
> Seems like 1.a is the best candidate.
>
> The algorithm is something like this:
>
> After stop_machine() all threads are executing __reload_late()
>
> nmi_callback()
> {
> if (!in_ucode_update)
> return NMI_DONE;
> if (cpu not in sibling_mask)
> return NMI_DONE;
> update sibling reached NMI for primary to continue
>
> while (cpu in sibling_mask)
> wait;
> return NMI_HANDLED;
> }
>
> __reload_late()
> {
>
> entry_rendezvous(&late_cpus_in);
> set_mcip()
> if (this_cpu is first_cpu in the core)
> wait for siblings to drop in NMI
> apply_microcode()
> else {
> send self_ipi(NMI_VECTOR);
> goto wait_for_siblings;
> }
>
> wait_for_siblings:
> exit_rendezvous(&late_cpus_out);
> clear_mcip
> }
>
> reload_late()
> {
> register_nmi_handler()
> prepare_mask of all sibling cpus()
> update state = ucode in progress;
> stop_machine();
> unregister_nmi_handler();
> }
>
> Signed-off-by: Ashok Raj <ashok.raj@...el.com>
> ---
> arch/x86/kernel/cpu/microcode/core.c | 218 ++++++++++++++++++++++++++-
> 1 file changed, 211 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c
> index d24e1c754c27..fd3b8ce2c82a 100644
> --- a/arch/x86/kernel/cpu/microcode/core.c
> +++ b/arch/x86/kernel/cpu/microcode/core.c
> @@ -39,7 +39,9 @@
> #include <asm/processor.h>
> #include <asm/cmdline.h>
> #include <asm/setup.h>
> +#include <asm/apic.h>
> #include <asm/mce.h>
> +#include <asm/nmi.h>
>
> #define DRIVER_VERSION "2.2"
>
> @@ -411,6 +413,13 @@ static int check_online_cpus(void)
>
> static atomic_t late_cpus_in;
> static atomic_t late_cpus_out;
> +static atomic_t nmi_cpus; // number of CPUs that enter NMI
> +static atomic_t nmi_timeouts; // number of siblings that timeout
> +static atomic_t nmi_siblings; // Nmber of siblings that enter NMI
> +static atomic_t in_ucode_update;// Are we in microcode update?
> +static atomic_t nmi_exit; // Siblings that exit NMI
Some of these variables seem oddly managed and just for debugging.
> +
> +static struct cpumask all_sibling_mask;
>
> static int __wait_for_cpus(atomic_t *t, long long timeout)
> {
> @@ -433,6 +442,104 @@ static int __wait_for_cpus(atomic_t *t, long long timeout)
> return 0;
> }
>
> +struct core_rendez {
> + int num_core_cpus;
> + atomic_t callin;
> + atomic_t core_done;
> +};
> +
> +static DEFINE_PER_CPU(struct core_rendez, core_sync);
> +
> +static int __wait_for_update(atomic_t *t, long long timeout)
> +{
> + while (!atomic_read(t)) {
> + if (timeout < SPINUNIT)
> + return 1;
Since you're using signed arithmetic, timeout < 0 would be a less
error-prone condition.
Anyway, this patch is full of debugging stuff, so I won't do a
line-for-line review, but I do have a suggestion. Instead of all this
bookkeeping, maybe just track the number of cores to park in NMI, kind
of like this (hand-wavy pseudocode):
static struct cpumask cpus_to_park_in_nmi;
/* fill out the cpumask */
static atomic_t nmi_parked_cpus;
static bool park_enabled;
Then, after __wait_for_cpus (once everything is stopped), one cpu sets
up the nmi handler, sets park_enabled, and sends the NMI IPI to all
the CPUs parked in there. The handler does:
if (this cpu is in cpus_to_mark_in_nmi) {
WARN_ON_ONCE(!park_enabled);
atomic_inc(&nmi_parked_cpus);
while (READ_ONCE(park_enabled))
; /* because Intel won't promise that cpu_relax() is okay */
atomic_dec(&nmi_parked_cpus);
}
and the CPUs that aren't supposed to park wait for nmi_parked_cpus to
have the right value. After the update, park_enabled gets cleared and
everything resumes.
Does this seem reasonable?
I was thinking it would be straightforward to have __wait_for_cpus
handle this, but that would only really be easy in a language with
closures or continuation passing.
--Andy
Powered by blists - more mailing lists