linux-kernel - Re: [PATCH v8 20/40] x86/sev: Use SEV-SNP AP creation to start secondary CPUs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Yc8jerEP5CrxfFi4@zn.tnic>
Date:   Fri, 31 Dec 2021 16:36:26 +0100
From:   Borislav Petkov <bp@...en8.de>
To:     Brijesh Singh <brijesh.singh@....com>,
        Vlastimil Babka <vbabka@...e.cz>
Cc:     x86@...nel.org, linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
        linux-efi@...r.kernel.org, platform-driver-x86@...r.kernel.org,
        linux-coco@...ts.linux.dev, linux-mm@...ck.org,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Joerg Roedel <jroedel@...e.de>,
        Tom Lendacky <thomas.lendacky@....com>,
        "H. Peter Anvin" <hpa@...or.com>, Ard Biesheuvel <ardb@...nel.org>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Sean Christopherson <seanjc@...gle.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Jim Mattson <jmattson@...gle.com>,
        Andy Lutomirski <luto@...nel.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Sergio Lopez <slp@...hat.com>, Peter Gonda <pgonda@...gle.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        David Rientjes <rientjes@...gle.com>,
        Dov Murik <dovmurik@...ux.ibm.com>,
        Tobin Feldman-Fitzthum <tobin@....com>,
        Michael Roth <michael.roth@....com>,
        "Kirill A . Shutemov" <kirill@...temov.name>,
        Andi Kleen <ak@...ux.intel.com>,
        "Dr . David Alan Gilbert" <dgilbert@...hat.com>,
        tony.luck@...el.com, marcorr@...gle.com,
        sathyanarayanan.kuppuswamy@...ux.intel.com
Subject: Re: [PATCH v8 20/40] x86/sev: Use SEV-SNP AP creation to start
 secondary CPUs

On Fri, Dec 10, 2021 at 09:43:12AM -0600, Brijesh Singh wrote:
> diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
> index 123a96f7dff2..38c14601ae4a 100644
> --- a/arch/x86/include/asm/sev-common.h
> +++ b/arch/x86/include/asm/sev-common.h
> @@ -104,6 +104,7 @@ enum psc_op {
>  	(((u64)(v) & GENMASK_ULL(63, 12)) >> 12)
>  
>  #define GHCB_HV_FT_SNP			BIT_ULL(0)
> +#define GHCB_HV_FT_SNP_AP_CREATION	(BIT_ULL(1) | GHCB_HV_FT_SNP)

Why is bit 0 ORed in? Because it "Requires SEV-SNP Feature."?

You can still enforce that requirement in the test though.

Or all those SEV features should not be bits but masks -
GHCB_HV_FT_SNP_AP_CREATION_MASK for example, seeing how the others
require the previous bits to be set too.

...

>  static DEFINE_PER_CPU(struct sev_es_runtime_data*, runtime_data);
>  DEFINE_STATIC_KEY_FALSE(sev_es_enable_key);
>  
> +static DEFINE_PER_CPU(struct sev_es_save_area *, snp_vmsa);

This is what I mean: the struct is called "sev_es... " but the variable
"snp_...". I.e., it is all sev_<something>.

> +
>  static __always_inline bool on_vc_stack(struct pt_regs *regs)
>  {
>  	unsigned long sp = regs->sp;
> @@ -814,6 +818,231 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
>  	pvalidate_pages(vaddr, npages, 1);
>  }
>  
> +static int snp_set_vmsa(void *va, bool vmsa)
> +{
> +	u64 attrs;
> +
> +	/*
> +	 * The RMPADJUST instruction is used to set or clear the VMSA bit for
> +	 * a page. A change to the VMSA bit is only performed when running
> +	 * at VMPL0 and is ignored at other VMPL levels. If too low of a target

What does "too low" mean here exactly?

The kernel is not at VMPL0 but the specified level is lower? Weird...

> +	 * VMPL level is specified, the instruction can succeed without changing
> +	 * the VMSA bit should the kernel not be in VMPL0. Using a target VMPL
> +	 * level of 1 will return a FAIL_PERMISSION error if the kernel is not
> +	 * at VMPL0, thus ensuring that the VMSA bit has been properly set when
> +	 * no error is returned.

We do check whether we run at VMPL0 earlier when starting the guest -
see enforce_vmpl0().

I don't think you need any of that additional verification here - just
assume you are at VMPL0.

> +	 */
> +	attrs = 1;
> +	if (vmsa)
> +		attrs |= RMPADJUST_VMSA_PAGE_BIT;
> +
> +	return rmpadjust((unsigned long)va, RMP_PG_SIZE_4K, attrs);
> +}
> +
> +#define __ATTR_BASE		(SVM_SELECTOR_P_MASK | SVM_SELECTOR_S_MASK)
> +#define INIT_CS_ATTRIBS		(__ATTR_BASE | SVM_SELECTOR_READ_MASK | SVM_SELECTOR_CODE_MASK)
> +#define INIT_DS_ATTRIBS		(__ATTR_BASE | SVM_SELECTOR_WRITE_MASK)
> +
> +#define INIT_LDTR_ATTRIBS	(SVM_SELECTOR_P_MASK | 2)
> +#define INIT_TR_ATTRIBS		(SVM_SELECTOR_P_MASK | 3)
> +
> +static void *snp_safe_alloc_page(void)

safe?

And you don't need to say "safe" - snp_alloc_vmsa_page() is perfectly fine.

> +{
> +	unsigned long pfn;
> +	struct page *p;
> +
> +	/*
> +	 * Allocate an SNP safe page to workaround the SNP erratum where
> +	 * the CPU will incorrectly signal an RMP violation  #PF if a
> +	 * hugepage (2mb or 1gb) collides with the RMP entry of VMSA page.

		2MB or 1GB

Collides how? The 4K frame is inside the hugepage?

> +	 * The recommeded workaround is to not use the large page.

Unknown word [recommeded] in comment, suggestions:
        ['recommended', 'recommend', 'recommitted', 'commended', 'commandeered']

> +	 *
> +	 * Allocate one extra page, use a page which is not 2mb aligned

2MB-aligned

> +	 * and free the other.
> +	 */
> +	p = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1);
> +	if (!p)
> +		return NULL;
> +
> +	split_page(p, 1);
> +
> +	pfn = page_to_pfn(p);
> +	if (IS_ALIGNED(__pfn_to_phys(pfn), PMD_SIZE)) {
> +		pfn++;
> +		__free_page(p);
> +	} else {
> +		__free_page(pfn_to_page(pfn + 1));
> +	}

AFAICT, this is doing all this stuff so that you can make sure you get a
non-2M-aligned page. I wonder if there's a way to simply ask mm to give
you such page directly.

vbabka?

> +
> +	return page_address(pfn_to_page(pfn));
> +}
> +
> +static int wakeup_cpu_via_vmgexit(int apic_id, unsigned long start_ip)
> +{
> +	struct sev_es_save_area *cur_vmsa, *vmsa;
> +	struct ghcb_state state;
> +	unsigned long flags;
> +	struct ghcb *ghcb;
> +	int cpu, err, ret;
> +	u8 sipi_vector;
> +	u64 cr4;
> +
> +	if ((sev_hv_features & GHCB_HV_FT_SNP_AP_CREATION) != GHCB_HV_FT_SNP_AP_CREATION)
> +		return -EOPNOTSUPP;
> +
> +	/*
> +	 * Verify the desired start IP against the known trampoline start IP
> +	 * to catch any future new trampolines that may be introduced that
> +	 * would require a new protected guest entry point.
> +	 */
> +	if (WARN_ONCE(start_ip != real_mode_header->trampoline_start,
> +		      "Unsupported SEV-SNP start_ip: %lx\n", start_ip))
> +		return -EINVAL;
> +
> +	/* Override start_ip with known protected guest start IP */
> +	start_ip = real_mode_header->sev_es_trampoline_start;
> +
> +	/* Find the logical CPU for the APIC ID */
> +	for_each_present_cpu(cpu) {
> +		if (arch_match_cpu_phys_id(cpu, apic_id))
> +			break;
> +	}
> +	if (cpu >= nr_cpu_ids)
> +		return -EINVAL;
> +
> +	cur_vmsa = per_cpu(snp_vmsa, cpu);
> +
> +	/*
> +	 * A new VMSA is created each time because there is no guarantee that
> +	 * the current VMSA is the kernels or that the vCPU is not running. If

kernel's.

And if it is not the kernel's, whose it is?

> +	 * an attempt was done to use the current VMSA with a running vCPU, a
> +	 * #VMEXIT of that vCPU would wipe out all of the settings being done
> +	 * here.

I don't understand - this is waking up a CPU, how can it ever be a
running vCPU which is using the current VMSA?!

There is per_cpu(snp_vmsa, cpu), who else can be using that one currently?

> +	 */
> +	vmsa = (struct sev_es_save_area *)snp_safe_alloc_page();
> +	if (!vmsa)
> +		return -ENOMEM;
> +
> +	/* CR4 should maintain the MCE value */
> +	cr4 = native_read_cr4() & X86_CR4_MCE;
> +
> +	/* Set the CS value based on the start_ip converted to a SIPI vector */
> +	sipi_vector		= (start_ip >> 12);
> +	vmsa->cs.base		= sipi_vector << 12;
> +	vmsa->cs.limit		= 0xffff;
> +	vmsa->cs.attrib		= INIT_CS_ATTRIBS;
> +	vmsa->cs.selector	= sipi_vector << 8;
> +
> +	/* Set the RIP value based on start_ip */
> +	vmsa->rip		= start_ip & 0xfff;
> +
> +	/* Set VMSA entries to the INIT values as documented in the APM */
> +	vmsa->ds.limit		= 0xffff;
> +	vmsa->ds.attrib		= INIT_DS_ATTRIBS;
> +	vmsa->es		= vmsa->ds;
> +	vmsa->fs		= vmsa->ds;
> +	vmsa->gs		= vmsa->ds;
> +	vmsa->ss		= vmsa->ds;
> +
> +	vmsa->gdtr.limit	= 0xffff;
> +	vmsa->ldtr.limit	= 0xffff;
> +	vmsa->ldtr.attrib	= INIT_LDTR_ATTRIBS;
> +	vmsa->idtr.limit	= 0xffff;
> +	vmsa->tr.limit		= 0xffff;
> +	vmsa->tr.attrib		= INIT_TR_ATTRIBS;
> +
> +	vmsa->efer		= 0x1000;	/* Must set SVME bit */

verify_comment_style: Warning: No tail comments please:
 arch/x86/kernel/sev.c:954 [+   vmsa->efer              = 0x1000;       /* Must set SVME bit */]

> +	vmsa->cr4		= cr4;
> +	vmsa->cr0		= 0x60000010;
> +	vmsa->dr7		= 0x400;
> +	vmsa->dr6		= 0xffff0ff0;
> +	vmsa->rflags		= 0x2;
> +	vmsa->g_pat		= 0x0007040600070406ULL;
> +	vmsa->xcr0		= 0x1;
> +	vmsa->mxcsr		= 0x1f80;
> +	vmsa->x87_ftw		= 0x5555;
> +	vmsa->x87_fcw		= 0x0040;

Yah, those definitely need macros or at least comments ontop denoting
what those naked values are.

> +
> +	/*
> +	 * Set the SNP-specific fields for this VMSA:
> +	 *   VMPL level
> +	 *   SEV_FEATURES (matches the SEV STATUS MSR right shifted 2 bits)
> +	 */

Like this^^

> +	vmsa->vmpl		= 0;
> +	vmsa->sev_features	= sev_status >> 2;
> +
> +	/* Switch the page over to a VMSA page now that it is initialized */
> +	ret = snp_set_vmsa(vmsa, true);
> +	if (ret) {
> +		pr_err("set VMSA page failed (%u)\n", ret);
> +		free_page((unsigned long)vmsa);
> +
> +		return -EINVAL;
> +	}
> +
> +	/* Issue VMGEXIT AP Creation NAE event */
> +	local_irq_save(flags);
> +
> +	ghcb = __sev_get_ghcb(&state);
> +
> +	vc_ghcb_invalidate(ghcb);
> +	ghcb_set_rax(ghcb, vmsa->sev_features);
> +	ghcb_set_sw_exit_code(ghcb, SVM_VMGEXIT_AP_CREATION);
> +	ghcb_set_sw_exit_info_1(ghcb, ((u64)apic_id << 32) | SVM_VMGEXIT_AP_CREATE);
> +	ghcb_set_sw_exit_info_2(ghcb, __pa(vmsa));
> +
> +	sev_es_wr_ghcb_msr(__pa(ghcb));
> +	VMGEXIT();
> +
> +	if (!ghcb_sw_exit_info_1_is_valid(ghcb) ||
> +	    lower_32_bits(ghcb->save.sw_exit_info_1)) {
> +		pr_alert("SNP AP Creation error\n");

alert?

> +		ret = -EINVAL;
> +	}
> +
> +	__sev_put_ghcb(&state);
> +
> +	local_irq_restore(flags);
> +
> +	/* Perform cleanup if there was an error */
> +	if (ret) {
> +		err = snp_set_vmsa(vmsa, false);
> +		if (err)
> +			pr_err("clear VMSA page failed (%u), leaking page\n", err);
> +		else
> +			free_page((unsigned long)vmsa);

That...

> +
> +		vmsa = NULL;
> +	}
> +
> +	/* Free up any previous VMSA page */
> +	if (cur_vmsa) {
> +		err = snp_set_vmsa(cur_vmsa, false);
> +		if (err)
> +			pr_err("clear VMSA page failed (%u), leaking page\n", err);
> +		else
> +			free_page((unsigned long)cur_vmsa);

.. and that wants to be in a common helper.

> +	}
> +
> +	/* Record the current VMSA page */
> +	per_cpu(snp_vmsa, cpu) = vmsa;
> +
> +	return ret;
> +}

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette