linux-kernel - RE: [PATCH] x86/hyper-v: Zero out the VP assist page to fix CPU offlining

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.21.1907180846290.1778@nanos.tec.linutronix.de>
Date:   Thu, 18 Jul 2019 09:00:50 +0200 (CEST)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Dexuan Cui <decui@...rosoft.com>
cc:     Haiyang Zhang <haiyangz@...rosoft.com>,
        KY Srinivasan <kys@...rosoft.com>,
        Stephen Hemminger <sthemmin@...rosoft.com>,
        Sasha Levin <Alexander.Levin@...rosoft.com>,
        "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
        Michael Kelley <mikelley@...rosoft.com>,
        Long Li <longli@...rosoft.com>, vkuznets <vkuznets@...hat.com>,
        Ingo Molnar <mingo@...nel.org>,
        "H. Peter Anvin" <hpa@...or.com>, Borislav Petkov <bp@...en8.de>,
        "x86@...nel.org" <x86@...nel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "marcelo.cerri@...onical.com" <marcelo.cerri@...onical.com>,
        "driverdev-devel@...uxdriverproject.org" 
        <driverdev-devel@...uxdriverproject.org>,
        "olaf@...fle.de" <olaf@...fle.de>,
        "apw@...onical.com" <apw@...onical.com>,
        "jasowang@...hat.com" <jasowang@...hat.com>
Subject: RE: [PATCH] x86/hyper-v: Zero out the VP assist page to fix CPU
 offlining

On Thu, 18 Jul 2019, Dexuan Cui wrote:
> > On Thu, 4 Jul 2019, Dexuan Cui wrote:
> > This is the allocation when the CPU is brought online for the first
> > time. So what effect has zeroing at allocation time vs. offlining and
> > potentially receiving IPIs? That allocation is never freed.
> > 
> > Neither the comment nor the changelog make any sense to me.
> > 	tglx
> 
> That allocation was introduced by the commit
> a46d15cc1ae5 ("x86/hyper-v: allocate and use Virtual Processor Assist Pages").
> 
> I think it's ok to not free the page when a CPU is offlined: every
> CPU uses only 1 page and CPU offlining is not really a very usual
> operation except for the scenario of hibernation (and suspend-to-memory), 
> where the CPUs are quickly onlined again, when we resume from hibernation.
> IMO Vitaly intentionally decided to not free the page for simplicity of the
> code.
> 
> When a CPU (e.g. CPU1) is being onlined, in hv_cpu_init(), we allocate the
> VP_ASSIST_PAGE page and enable the PV EOI optimization for this CPU by
> writing the MSR HV_X64_MSR_VP_ASSIST_PAGE. From now on, this CPU
> *always* uses hvp->apic_assist (which is updated by the hypervisor) to
> decide if it needs to write the EOI MSR:
> 
> static void hv_apic_eoi_write(u32 reg, u32 val)
> {
>         struct hv_vp_assist_page *hvp = hv_vp_assist_page[smp_processor_id()];
> 
>         if (hvp && (xchg(&hvp->apic_assist, 0) & 0x1))
>                 return;
> 
>         wrmsr(HV_X64_MSR_EOI, val, 0);
> }
> 
> When a CPU (e.g. CPU1) is being offlined, on this CPU, we do:
> 1. in hv_cpu_die(), we disable the PV EOI optimizaton for this CPU;
> 2. we finish the remaining work of stopping this CPU;
> 3. this CPU is completed stopped.
> 
> Between 1 and 3, this CPU can still receive interrupts (e.g. IPIs from CPU0,
> and Local APIC timer interrupts), and this CPU *must* write the EOI MSR for
> every interrupt received, otherwise the hypervisor may not deliver further
> interrupts, which may be needed to stop this CPU completely.
> 
> So we need to make sure hvp->apic_assist.bit0 is zero, after we run the line
> "wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);" in hv_cpu_die(). The easiest
> way is what I do in this patch. Alternatively, we can use the below patch:
> 
> @@ -188,8 +188,12 @@ static int hv_cpu_die(unsigned int cpu)
>         local_irq_restore(flags);
>         free_page((unsigned long)input_pg);
> 
> -       if (hv_vp_assist_page && hv_vp_assist_page[cpu])
> +       if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
> +               local_irq_save(flags);
>                 wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);
> +               hvp->apic_assist &= ~1;
> +               local_irq_restore(flags);
> +       }
> 
>         if (hv_reenlightenment_cb == NULL)
>                 return 0;
> 
> This second version needs 3+ lines, so I prefer the one-line version. :-)

Those are two different things. The GPF_ZERO allocation makes sense on it's
own but it _cannot_ prevent the following scenario:

    cpu_init()
      if (!hvp)
      	 hvp = vmalloc(...., GFP_ZERO);
    ...

    hvp->apic_assist |= 1;

#1   cpu_die()
      if (....)
           wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);

   ---> IPI
   	if (!(hvp->apic_assist & 1))	
	   wrmsr(APIC_EOI);    <- PATH not taken

#3   cpu is dead

    cpu_init()
       if (!hvp)
          hvp = vmalloc(....m, GFP_ZERO);  <- NOT TAKEN because hvp != NULL

So you have to come up with a better fairy tale why GFP_ZERO 'fixes' this.

Allocating hvp with GFP_ZERO makes sense on it's own so the allocated
memory has a defined state, but that's a different story.

The 3 liner patch above makes way more sense and you can spare the
local_irq_save/restore by moving the whole condition into the
irq_save/restore region above.

Thanks,

	tglx