linux-kernel - Re: [PATCHv4 07/30] x86/traps: Add #VE support for TDX guest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220225193059.6zn6owzpbggxfqqv@black.fi.intel.com>
Date:   Fri, 25 Feb 2022 22:30:59 +0300
From:   "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
To:     Dave Hansen <dave.hansen@...el.com>
Cc:     tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
        luto@...nel.org, peterz@...radead.org,
        sathyanarayanan.kuppuswamy@...ux.intel.com, aarcange@...hat.com,
        ak@...ux.intel.com, dan.j.williams@...el.com, david@...hat.com,
        hpa@...or.com, jgross@...e.com, jmattson@...gle.com,
        joro@...tes.org, jpoimboe@...hat.com, knsathya@...nel.org,
        pbonzini@...hat.com, sdeep@...are.com, seanjc@...gle.com,
        tony.luck@...el.com, vkuznets@...hat.com, wanpengli@...cent.com,
        thomas.lendacky@....com, brijesh.singh@....com, x86@...nel.org,
        linux-kernel@...r.kernel.org,
        Sean Christopherson <sean.j.christopherson@...el.com>
Subject: Re: [PATCHv4 07/30] x86/traps: Add #VE support for TDX guest

On Thu, Feb 24, 2022 at 10:36:02AM -0800, Dave Hansen wrote:
> On 2/24/22 07:56, Kirill A. Shutemov wrote:
> > Virtualization Exceptions (#VE) are delivered to TDX guests due to
> > specific guest actions which may happen in either user space or the
> > kernel:
> > 
> >  * Specific instructions (WBINVD, for example)
> >  * Specific MSR accesses
> >  * Specific CPUID leaf accesses
> >  * Access to unmapped pages (EPT violation)
> 
> Considering that you're talking partly about userspace, it would be nice
> to talk about what "unmapped" really means here.

I'm not sure what you want to see here. Doesn't EPT violation describe it?

It can happen to userspace too, but we don't expect it to be use used and
SIGSEGV the process if it happens.

> > In the settings that Linux will run in, virtualization exceptions are
> > never generated on accesses to normal, TD-private memory that has been
> > accepted.
> 
> This is getting into nit territory.  But, at this point a normal reader
> has no idea what "accepted" memory is.

I will add: "(prepared to be used in the TD)". Okay?

> > @@ -58,6 +59,65 @@ static void get_info(void)
> >  	td_info.attributes = out.rdx;
> >  }
> >  
> > +void tdx_get_ve_info(struct ve_info *ve)
> > +{
> > +	struct tdx_module_output out;
> > +
> > +	/*
> > +	 * Retrieve the #VE info from the TDX module, which also clears the "#VE
> > +	 * valid" flag.  This must be done before anything else as any #VE that
> > +	 * occurs while the valid flag is set, i.e. before the previous #VE info
> > +	 * was consumed, is morphed to a #DF by the TDX module. 
> 
> 
> That's a really weird sentence.  It doesn't really parse for me.  It
> might be the misplaced comma after "consumed,".
> 
> For what it's worth, I think "i.e." and "e.g." have been over used in
> the TDX text (sorry Sean).  They lead to really weird sentence structure.
> 
> 								 Note, the TDX
> > +	 * module also treats virtual NMIs as inhibited if the #VE valid flag is
> > +	 * set, e.g. so that NMI=>#VE will not result in a #DF.
> > +	 */
> 
> Are we missing anything valuable if we just trim the comment down to
> something like:
> 
> 	/*
> 	 * Called during #VE handling to retrieve the #VE info from the
> 	 * TDX module.
>  	 *
> 	 * This should called done early in #VE handling.  A "nested"
> 	 * #VE which occurs before this will raise a #DF and is not
> 	 * recoverable.
> 	 */

This variant of the comment lost information about #VE-valid flag and
doesn't describe how NMI is inhibited.

Sean proposed this wording as reply to Thomas' questions:

http://lore.kernel.org/r/YfmlnJ6LS935AMS4@google.com

Do we need to keep the info?

> For what it's worth, I don't think we care who "morphs" things.  We just
> care about the fallout.
> 
> > +	tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out);
> 
> How about a one-liner below here:
> 
> 	/* Interrupts and NMIs can be delivered again. */
> 
> > +	ve->exit_reason = out.rcx;
> > +	ve->exit_qual   = out.rdx;
> > +	ve->gla         = out.r8;
> > +	ve->gpa         = out.r9;
> > +	ve->instr_len   = lower_32_bits(out.r10);
> > +	ve->instr_info  = upper_32_bits(out.r10);
> > +}
> > +
> > +/*
> > + * Handle the user initiated #VE.
> > + *
> > + * For example, executing the CPUID instruction from user space
> > + * is a valid case and hence the resulting #VE has to be handled.
> > + *
> > + * For dis-allowed or invalid #VE just return failure.
> > + */
> 
> This is just insane to have in the series at this point.  It says that
> the "#VE has to be handled" and then doesn't handle it!

I can't see why it's a big deal, but okay.

> > +static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
> > +{
> > +	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> > +	return false;
> > +}
> > +
> > +/* Handle the kernel #VE */
> > +static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> > +{
> > +	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> > +	return false;
> > +}
> > +
> > +bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
> > +{
> > +	bool ret;
> > +
> > +	if (user_mode(regs))
> > +		ret = virt_exception_user(regs, ve);
> > +	else
> > +		ret = virt_exception_kernel(regs, ve);
> > +
> > +	/* After successful #VE handling, move the IP */
> > +	if (ret)
> > +		regs->ip += ve->instr_len;
> > +
> > +	return ret;
> > +}
> 
> At this point in the series, these three functions can be distilled down to:
> 
> bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
> {
> 	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> 
> 	return false;
> }

I will do as you want, but I don't feel it is right.

The patch adds a little more infrastructure that makes following patches
cleaner.


> > +#ifdef CONFIG_INTEL_TDX_GUEST
> > +
> > +#define VE_FAULT_STR "VE fault"
> > +
> > +static void ve_raise_fault(struct pt_regs *regs, long error_code)
> > +{
> > +	if (user_mode(regs)) {
> > +		gp_user_force_sig_segv(regs, X86_TRAP_VE, error_code, VE_FAULT_STR);
> > +		return;
> > +	}
> > +
> > +	if (gp_try_fixup_and_notify(regs, X86_TRAP_VE, error_code, VE_FAULT_STR))
> > +		return;
> > +
> > +	die_addr(VE_FAULT_STR, regs, error_code, 0);
> > +}
> > +
> > +/*
> > + * Virtualization Exceptions (#VE) are delivered to TDX guests due to
> > + * specific guest actions which may happen in either user space or the
> > + * kernel:
> > + *
> > + *  * Specific instructions (WBINVD, for example)
> > + *  * Specific MSR accesses
> > + *  * Specific CPUID leaf accesses
> > + *  * Access to unmapped pages (EPT violation)
> > + *
> > + * In the settings that Linux will run in, virtualization exceptions are
> > + * never generated on accesses to normal, TD-private memory that has been
> > + * accepted.
> 
> This actually makes a lot more sense as a code comment than changelog.
> It would be really nice to circle back here and actually refer to the
> functions that accept memory.

We don't have such functions at this point in the patchset. Do you want
the comment to be updated once we get them introduced?
> 
> > + * Syscall entry code has a critical window where the kernel stack is not
> > + * yet set up. Any exception in this window leads to hard to debug issues
> > + * and can be exploited for privilege escalation. Exceptions in the NMI
> > + * entry code also cause issues. Returning from the exception handler with
> > + * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
> > + *
> > + * For these reasons, the kernel avoids #VEs during the syscall gap and
> > + * the NMI entry code. Entry code paths do not access TD-shared memory,
> > + * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
> > + * that might generate #VE. VMM can remove memory from TD at any point,
> > + * but access to unaccepted (or missing) private memory leads to VM
> > + * termination, not to #VE.
> > + *
> > + * Similarly to page faults and breakpoints, #VEs are allowed in NMI
> > + * handlers once the kernel is ready to deal with nested NMIs.
> > + *
> > + * During #VE delivery, all interrupts, including NMIs, are blocked until
> > + * TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
> > + * the VE info.
> > + *
> > + * If a guest kernel action which would normally cause a #VE occurs in
> > + * the interrupt-disabled region before TDGETVEINFO, a #DF (fault
> > + * exception) is delivered to the guest which will result in an oops.
> > + */
> > +DEFINE_IDTENTRY(exc_virtualization_exception)
> > +{
> > +	struct ve_info ve;
> > +
> > +	/*
> > +	 * NMIs/Machine-checks/Interrupts will be in a disabled state
> > +	 * till TDGETVEINFO TDCALL is executed. This ensures that VE
> > +	 * info cannot be overwritten by a nested #VE.
> > +	 */
> > +	tdx_get_ve_info(&ve);
> > +
> > +	cond_local_irq_enable(regs);
> > +
> > +	/*
> > +	 * If tdx_handle_virt_exception() could not process
> > +	 * it successfully, treat it as #GP(0) and handle it.
> > +	 */
> > +	if (!tdx_handle_virt_exception(regs, &ve))
> > +		ve_raise_fault(regs, 0);
> > +
> > +	cond_local_irq_disable(regs);
> > +}
> > +
> > +#endif
> > +
> >  #ifdef CONFIG_X86_32
> >  DEFINE_IDTENTRY_SW(iret_error)
> >  {
> 

-- 
 Kirill A. Shutemov