[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c6ad42a0-ab19-befd-5760-2bcc992df732@intel.com>
Date: Thu, 24 Feb 2022 10:36:02 -0800
From: Dave Hansen <dave.hansen@...el.com>
To: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
luto@...nel.org, peterz@...radead.org
Cc: sathyanarayanan.kuppuswamy@...ux.intel.com, aarcange@...hat.com,
ak@...ux.intel.com, dan.j.williams@...el.com, david@...hat.com,
hpa@...or.com, jgross@...e.com, jmattson@...gle.com,
joro@...tes.org, jpoimboe@...hat.com, knsathya@...nel.org,
pbonzini@...hat.com, sdeep@...are.com, seanjc@...gle.com,
tony.luck@...el.com, vkuznets@...hat.com, wanpengli@...cent.com,
thomas.lendacky@....com, brijesh.singh@....com, x86@...nel.org,
linux-kernel@...r.kernel.org,
Sean Christopherson <sean.j.christopherson@...el.com>
Subject: Re: [PATCHv4 07/30] x86/traps: Add #VE support for TDX guest
On 2/24/22 07:56, Kirill A. Shutemov wrote:
> Virtualization Exceptions (#VE) are delivered to TDX guests due to
> specific guest actions which may happen in either user space or the
> kernel:
>
> * Specific instructions (WBINVD, for example)
> * Specific MSR accesses
> * Specific CPUID leaf accesses
> * Access to unmapped pages (EPT violation)
Considering that you're talking partly about userspace, it would be nice
to talk about what "unmapped" really means here.
> In the settings that Linux will run in, virtualization exceptions are
> never generated on accesses to normal, TD-private memory that has been
> accepted.
This is getting into nit territory. But, at this point a normal reader
has no idea what "accepted" memory is.
> Syscall entry code has a critical window where the kernel stack is not
> yet set up. Any exception in this window leads to hard to debug issues
> and can be exploited for privilege escalation. Exceptions in the NMI
> entry code also cause issues. Returning from the exception handler with
> IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
>
> For these reasons, the kernel avoids #VEs during the syscall gap and
> the NMI entry code. Entry code paths do not access TD-shared memory,
> MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
> that might generate #VE. VMM can remove memory from TD at any point,
> but access to unaccepted (or missing) private memory leads to VM
> termination, not to #VE.
>
> Similarly to page faults and breakpoints, #VEs are allowed in NMI
> handlers once the kernel is ready to deal with nested NMIs.
>
> During #VE delivery, all interrupts, including NMIs, are blocked until
> TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
> the VE info.
>
> If a guest kernel action which would normally cause a #VE occurs in
> the interrupt-disabled region before TDGETVEINFO, a #DF (fault
> exception) is delivered to the guest which will result in an oops.
>
> Add basic infrastructure to handle any #VE which occurs in the kernel
> or userspace. Later patches will add handling for specific #VE
> scenarios.
>
> For now, convert unhandled #VE's (everything, until later in this
> series) so that they appear just like a #GP by calling the
> ve_raise_fault() directly. The ve_raise_fault() function is similar
> to #GP handler and is responsible for sending SIGSEGV to userspace
> and CPU die and notifying debuggers and other die chain users.
>
> Co-developed-by: Sean Christopherson <sean.j.christopherson@...el.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@...el.com>
> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@...ux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@...ux.intel.com>
> Reviewed-by: Andi Kleen <ak@...ux.intel.com>
> Reviewed-by: Tony Luck <tony.luck@...el.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@...ux.intel.com>
> ---
> arch/x86/coco/tdx.c | 60 ++++++++++++++
> arch/x86/include/asm/idtentry.h | 4 +
> arch/x86/include/asm/tdx.h | 21 +++++
> arch/x86/kernel/idt.c | 3 +
> arch/x86/kernel/traps.c | 138 ++++++++++++++++++++++++++------
> 5 files changed, 203 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
> index 14c085930b5f..86a2f35e7308 100644
> --- a/arch/x86/coco/tdx.c
> +++ b/arch/x86/coco/tdx.c
> @@ -10,6 +10,7 @@
>
> /* TDX module Call Leaf IDs */
> #define TDX_GET_INFO 1
> +#define TDX_GET_VEINFO 3
>
> static struct {
> unsigned int gpa_width;
> @@ -58,6 +59,65 @@ static void get_info(void)
> td_info.attributes = out.rdx;
> }
>
> +void tdx_get_ve_info(struct ve_info *ve)
> +{
> + struct tdx_module_output out;
> +
> + /*
> + * Retrieve the #VE info from the TDX module, which also clears the "#VE
> + * valid" flag. This must be done before anything else as any #VE that
> + * occurs while the valid flag is set, i.e. before the previous #VE info
> + * was consumed, is morphed to a #DF by the TDX module.
That's a really weird sentence. It doesn't really parse for me. It
might be the misplaced comma after "consumed,".
For what it's worth, I think "i.e." and "e.g." have been over used in
the TDX text (sorry Sean). They lead to really weird sentence structure.
Note, the TDX
> + * module also treats virtual NMIs as inhibited if the #VE valid flag is
> + * set, e.g. so that NMI=>#VE will not result in a #DF.
> + */
Are we missing anything valuable if we just trim the comment down to
something like:
/*
* Called during #VE handling to retrieve the #VE info from the
* TDX module.
*
* This should called done early in #VE handling. A "nested"
* #VE which occurs before this will raise a #DF and is not
* recoverable.
*/
For what it's worth, I don't think we care who "morphs" things. We just
care about the fallout.
> + tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out);
How about a one-liner below here:
/* Interrupts and NMIs can be delivered again. */
> + ve->exit_reason = out.rcx;
> + ve->exit_qual = out.rdx;
> + ve->gla = out.r8;
> + ve->gpa = out.r9;
> + ve->instr_len = lower_32_bits(out.r10);
> + ve->instr_info = upper_32_bits(out.r10);
> +}
> +
> +/*
> + * Handle the user initiated #VE.
> + *
> + * For example, executing the CPUID instruction from user space
> + * is a valid case and hence the resulting #VE has to be handled.
> + *
> + * For dis-allowed or invalid #VE just return failure.
> + */
This is just insane to have in the series at this point. It says that
the "#VE has to be handled" and then doesn't handle it!
> +static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
> +{
> + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> + return false;
> +}
> +
> +/* Handle the kernel #VE */
> +static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> +{
> + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> + return false;
> +}
> +
> +bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
> +{
> + bool ret;
> +
> + if (user_mode(regs))
> + ret = virt_exception_user(regs, ve);
> + else
> + ret = virt_exception_kernel(regs, ve);
> +
> + /* After successful #VE handling, move the IP */
> + if (ret)
> + regs->ip += ve->instr_len;
> +
> + return ret;
> +}
At this point in the series, these three functions can be distilled down to:
bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
{
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return false;
}
> void __init tdx_early_init(void)
> {
> u32 eax, sig[3];
> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
> index 1345088e9902..8ccc81d653b3 100644
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -625,6 +625,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER, exc_xen_hypervisor_callback);
> DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER, exc_xen_unknown_trap);
> #endif
>
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +DECLARE_IDTENTRY(X86_TRAP_VE, exc_virtualization_exception);
> +#endif
> +
> /* Device interrupts common/spurious */
> DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt);
> #ifdef CONFIG_X86_LOCAL_APIC
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 557227e40da9..34cf998ad534 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -5,6 +5,7 @@
>
> #include <linux/bits.h>
> #include <linux/init.h>
> +#include <asm/ptrace.h>
>
> #define TDX_CPUID_LEAF_ID 0x21
> #define TDX_IDENT "IntelTDX "
> @@ -47,6 +48,22 @@ struct tdx_hypercall_args {
> u64 r15;
> };
>
> +/*
> + * Used by the #VE exception handler to gather the #VE exception
> + * info from the TDX module. This is a software only structure
> + * and not part of the TDX module/VMM ABI.
> + */
> +struct ve_info {
> + u64 exit_reason;
> + u64 exit_qual;
> + /* Guest Linear (virtual) Address */
> + u64 gla;
> + /* Guest Physical (virtual) Address */
> + u64 gpa;
"Physical (virtual) Address"?
> + u32 instr_len;
> + u32 instr_info;
> +};
> +
> #ifdef CONFIG_INTEL_TDX_GUEST
>
> void __init tdx_early_init(void);
> @@ -58,6 +75,10 @@ u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> /* Used to request services from the VMM */
> u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags);
>
> +void tdx_get_ve_info(struct ve_info *ve);
> +
> +bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
> +
> #else
>
> static inline void tdx_early_init(void) { };
> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> index df0fa695bb09..1da074123c16 100644
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -68,6 +68,9 @@ static const __initconst struct idt_data early_idts[] = {
> */
> INTG(X86_TRAP_PF, asm_exc_page_fault),
> #endif
> +#ifdef CONFIG_INTEL_TDX_GUEST
> + INTG(X86_TRAP_VE, asm_exc_virtualization_exception),
> +#endif
> };
>
> /*
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 7ef00dee35be..b2510af38158 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -62,6 +62,7 @@
> #include <asm/insn.h>
> #include <asm/insn-eval.h>
> #include <asm/vdso.h>
> +#include <asm/tdx.h>
>
> #ifdef CONFIG_X86_64
> #include <asm/x86_init.h>
> @@ -611,13 +612,43 @@ static bool try_fixup_enqcmd_gp(void)
> #endif
> }
>
> +static bool gp_try_fixup_and_notify(struct pt_regs *regs, int trapnr,
> + unsigned long error_code, const char *str)
> +{
> + int ret;
> +
> + if (fixup_exception(regs, trapnr, error_code, 0))
> + return true;
> +
> + current->thread.error_code = error_code;
> + current->thread.trap_nr = trapnr;
> +
> + /*
> + * To be potentially processing a kprobe fault and to trust the result
> + * from kprobe_running(), we have to be non-preemptible.
> + */
> + if (!preemptible() && kprobe_running() &&
> + kprobe_fault_handler(regs, trapnr))
> + return true;
> +
> + ret = notify_die(DIE_GPF, str, regs, error_code, trapnr, SIGSEGV);
> + return ret == NOTIFY_STOP;
> +}
> +
> +static void gp_user_force_sig_segv(struct pt_regs *regs, int trapnr,
> + unsigned long error_code, const char *str)
> +{
> + current->thread.error_code = error_code;
> + current->thread.trap_nr = trapnr;
> + show_signal(current, SIGSEGV, "", str, regs, error_code);
> + force_sig(SIGSEGV);
> +}
> +
> DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> {
> char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
> enum kernel_gp_hint hint = GP_NO_HINT;
> - struct task_struct *tsk;
> unsigned long gp_addr;
> - int ret;
>
> if (user_mode(regs) && try_fixup_enqcmd_gp())
> return;
> @@ -636,40 +667,21 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> return;
> }
>
> - tsk = current;
> -
> if (user_mode(regs)) {
> if (fixup_iopl_exception(regs))
> goto exit;
>
> - tsk->thread.error_code = error_code;
> - tsk->thread.trap_nr = X86_TRAP_GP;
> -
> if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0))
> goto exit;
>
> - show_signal(tsk, SIGSEGV, "", desc, regs, error_code);
> - force_sig(SIGSEGV);
> + gp_user_force_sig_segv(regs, X86_TRAP_GP, error_code, desc);
> goto exit;
> }
>
> if (fixup_exception(regs, X86_TRAP_GP, error_code, 0))
> goto exit;
>
> - tsk->thread.error_code = error_code;
> - tsk->thread.trap_nr = X86_TRAP_GP;
> -
> - /*
> - * To be potentially processing a kprobe fault and to trust the result
> - * from kprobe_running(), we have to be non-preemptible.
> - */
> - if (!preemptible() &&
> - kprobe_running() &&
> - kprobe_fault_handler(regs, X86_TRAP_GP))
> - goto exit;
> -
> - ret = notify_die(DIE_GPF, desc, regs, error_code, X86_TRAP_GP, SIGSEGV);
> - if (ret == NOTIFY_STOP)
> + if (gp_try_fixup_and_notify(regs, X86_TRAP_GP, error_code, desc))
> goto exit;
>
> if (error_code)
> @@ -1267,6 +1279,86 @@ DEFINE_IDTENTRY(exc_device_not_available)
> }
> }
I'm glad the exc_general_protection() code is getting refactored and not
copied. That's nice. The refactoring really needs to be in a separate
patch, though.
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +
> +#define VE_FAULT_STR "VE fault"
> +
> +static void ve_raise_fault(struct pt_regs *regs, long error_code)
> +{
> + if (user_mode(regs)) {
> + gp_user_force_sig_segv(regs, X86_TRAP_VE, error_code, VE_FAULT_STR);
> + return;
> + }
> +
> + if (gp_try_fixup_and_notify(regs, X86_TRAP_VE, error_code, VE_FAULT_STR))
> + return;
> +
> + die_addr(VE_FAULT_STR, regs, error_code, 0);
> +}
> +
> +/*
> + * Virtualization Exceptions (#VE) are delivered to TDX guests due to
> + * specific guest actions which may happen in either user space or the
> + * kernel:
> + *
> + * * Specific instructions (WBINVD, for example)
> + * * Specific MSR accesses
> + * * Specific CPUID leaf accesses
> + * * Access to unmapped pages (EPT violation)
> + *
> + * In the settings that Linux will run in, virtualization exceptions are
> + * never generated on accesses to normal, TD-private memory that has been
> + * accepted.
This actually makes a lot more sense as a code comment than changelog.
It would be really nice to circle back here and actually refer to the
functions that accept memory.
> + * Syscall entry code has a critical window where the kernel stack is not
> + * yet set up. Any exception in this window leads to hard to debug issues
> + * and can be exploited for privilege escalation. Exceptions in the NMI
> + * entry code also cause issues. Returning from the exception handler with
> + * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
> + *
> + * For these reasons, the kernel avoids #VEs during the syscall gap and
> + * the NMI entry code. Entry code paths do not access TD-shared memory,
> + * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
> + * that might generate #VE. VMM can remove memory from TD at any point,
> + * but access to unaccepted (or missing) private memory leads to VM
> + * termination, not to #VE.
> + *
> + * Similarly to page faults and breakpoints, #VEs are allowed in NMI
> + * handlers once the kernel is ready to deal with nested NMIs.
> + *
> + * During #VE delivery, all interrupts, including NMIs, are blocked until
> + * TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
> + * the VE info.
> + *
> + * If a guest kernel action which would normally cause a #VE occurs in
> + * the interrupt-disabled region before TDGETVEINFO, a #DF (fault
> + * exception) is delivered to the guest which will result in an oops.
> + */
> +DEFINE_IDTENTRY(exc_virtualization_exception)
> +{
> + struct ve_info ve;
> +
> + /*
> + * NMIs/Machine-checks/Interrupts will be in a disabled state
> + * till TDGETVEINFO TDCALL is executed. This ensures that VE
> + * info cannot be overwritten by a nested #VE.
> + */
> + tdx_get_ve_info(&ve);
> +
> + cond_local_irq_enable(regs);
> +
> + /*
> + * If tdx_handle_virt_exception() could not process
> + * it successfully, treat it as #GP(0) and handle it.
> + */
> + if (!tdx_handle_virt_exception(regs, &ve))
> + ve_raise_fault(regs, 0);
> +
> + cond_local_irq_disable(regs);
> +}
> +
> +#endif
> +
> #ifdef CONFIG_X86_32
> DEFINE_IDTENTRY_SW(iret_error)
> {
Powered by blists - more mailing lists