lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-Id: <6A6FFE5E-16C3-4054-837D-77D28A490C85@sjtu.edu.cn> Date: Tue, 22 Feb 2022 15:19:47 +0800 From: Dingji Li <dj_lee@...u.edu.cn> To: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com> Cc: tglx@...utronix.de, mingo@...hat.com, bp@...en8.de, dave.hansen@...el.com, luto@...nel.org, peterz@...radead.org, sathyanarayanan.kuppuswamy@...ux.intel.com, aarcange@...hat.com, ak@...ux.intel.com, dan.j.williams@...el.com, david@...hat.com, hpa@...or.com, jgross@...e.com, jmattson@...gle.com, joro@...tes.org, jpoimboe@...hat.com, knsathya@...nel.org, pbonzini@...hat.com, sdeep@...are.com, seanjc@...gle.com, tony.luck@...el.com, vkuznets@...hat.com, wanpengli@...cent.com, x86@...nel.org, linux-kernel@...r.kernel.org, Sean Christopherson <sean.j.christopherson@...el.com> Subject: Re: [PATCHv3 08/32] x86/traps: Add #VE support for TDX guest Hi all, I hope it is appropriate to ask these questions here: I'm wondering if there are any performance comparisons available between TDX guests and VMX guests. The #VE processing adds non-trivial overhead to various VM exits, but how does it affect the performance of real-world applications? Existing patches have listed alternative methods to avoid the #VE in the first place, but there are trade-offs (e.g., bloated code, reduced generality). Besides, how much does the time spent in the TDX module affect VM exits / applications? (I guess the TDX module has a low overhead when compared to the #VE processing, but there is no public data.) Maybe some performance data can help make better trade-offs? Thanks! -- Best Regards, Dingji Li > On Feb 19, 2022, at 12:16 AM, Kirill A. Shutemov <kirill.shutemov@...ux.intel.com> wrote: > > Virtualization Exceptions (#VE) are delivered to TDX guests due to > specific guest actions which may happen in either user space or the > kernel: > > * Specific instructions (WBINVD, for example) > * Specific MSR accesses > * Specific CPUID leaf accesses > * Access to unmapped pages (EPT violation) > > In the settings that Linux will run in, virtualization exceptions are > never generated on accesses to normal, TD-private memory that has been > accepted. > > Syscall entry code has a critical window where the kernel stack is not > yet set up. Any exception in this window leads to hard to debug issues > and can be exploited for privilege escalation. Exceptions in the NMI > entry code also cause issues. Returning from the exception handler with > IRET will re-enable NMIs and nested NMI will corrupt the NMI stack. > > For these reasons, the kernel avoids #VEs during the syscall gap and > the NMI entry code. Entry code paths do not access TD-shared memory, > MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves > that might generate #VE. VMM can remove memory from TD at any point, > but access to unaccepted (or missing) private memory leads to VM > termination, not to #VE. > > Similarly to page faults and breakpoints, #VEs are allowed in NMI > handlers once the kernel is ready to deal with nested NMIs. > > During #VE delivery, all interrupts, including NMIs, are blocked until > TDGETVEINFO is called. It prevents #VE nesting until the kernel reads > the VE info. > > If a guest kernel action which would normally cause a #VE occurs in > the interrupt-disabled region before TDGETVEINFO, a #DF (fault > exception) is delivered to the guest which will result in an oops. > > Add basic infrastructure to handle any #VE which occurs in the kernel > or userspace. Later patches will add handling for specific #VE > scenarios. > > For now, convert unhandled #VE's (everything, until later in this > series) so that they appear just like a #GP by calling the > ve_raise_fault() directly. The ve_raise_fault() function is similar > to #GP handler and is responsible for sending SIGSEGV to userspace > and CPU die and notifying debuggers and other die chain users. > > Co-developed-by: Sean Christopherson <sean.j.christopherson@...el.com> > Signed-off-by: Sean Christopherson <sean.j.christopherson@...el.com> > Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@...ux.intel.com> > Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@...ux.intel.com> > Reviewed-by: Andi Kleen <ak@...ux.intel.com> > Reviewed-by: Tony Luck <tony.luck@...el.com> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@...ux.intel.com> > --- > arch/x86/coco/tdx.c | 60 ++++++++++++++ > arch/x86/include/asm/idtentry.h | 4 + > arch/x86/include/asm/tdx.h | 21 +++++ > arch/x86/kernel/idt.c | 3 + > arch/x86/kernel/traps.c | 138 ++++++++++++++++++++++++++------ > 5 files changed, 203 insertions(+), 23 deletions(-) > > diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c > index ea5f02a5d738..de7a02e634c2 100644 > --- a/arch/x86/coco/tdx.c > +++ b/arch/x86/coco/tdx.c > @@ -10,6 +10,7 @@ > > /* TDX module Call Leaf IDs */ > #define TDX_GET_INFO 1 > +#define TDX_GET_VEINFO 3 > > static struct { > unsigned int gpa_width; > @@ -58,6 +59,65 @@ static void get_info(void) > td_info.attributes = out.rdx; > } > > +void tdx_get_ve_info(struct ve_info *ve) > +{ > + struct tdx_module_output out; > + > + /* > + * Retrieve the #VE info from the TDX module, which also clears the "#VE > + * valid" flag. This must be done before anything else as any #VE that > + * occurs while the valid flag is set, i.e. before the previous #VE info > + * was consumed, is morphed to a #DF by the TDX module. Note, the TDX > + * module also treats virtual NMIs as inhibited if the #VE valid flag is > + * set, e.g. so that NMI=>#VE will not result in a #DF. > + */ > + tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out); > + > + ve->exit_reason = out.rcx; > + ve->exit_qual = out.rdx; > + ve->gla = out.r8; > + ve->gpa = out.r9; > + ve->instr_len = lower_32_bits(out.r10); > + ve->instr_info = upper_32_bits(out.r10); > +} > + > +/* > + * Handle the user initiated #VE. > + * > + * For example, executing the CPUID instruction from user space > + * is a valid case and hence the resulting #VE has to be handled. > + * > + * For dis-allowed or invalid #VE just return failure. > + */ > +static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve) > +{ > + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason); > + return false; > +} > + > +/* Handle the kernel #VE */ > +static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve) > +{ > + pr_warn("Unexpected #VE: %lld\n", ve->exit_reason); > + return false; > +} > + > +bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve) > +{ > + bool ret; > + > + if (user_mode(regs)) > + ret = virt_exception_user(regs, ve); > + else > + ret = virt_exception_kernel(regs, ve); > + > + /* After successful #VE handling, move the IP */ > + if (ret) > + regs->ip += ve->instr_len; > + > + return ret; > +} > + > void __init tdx_early_init(void) > { > u32 eax, sig[3]; > diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h > index 1345088e9902..8ccc81d653b3 100644 > --- a/arch/x86/include/asm/idtentry.h > +++ b/arch/x86/include/asm/idtentry.h > @@ -625,6 +625,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER, exc_xen_hypervisor_callback); > DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER, exc_xen_unknown_trap); > #endif > > +#ifdef CONFIG_INTEL_TDX_GUEST > +DECLARE_IDTENTRY(X86_TRAP_VE, exc_virtualization_exception); > +#endif > + > /* Device interrupts common/spurious */ > DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt); > #ifdef CONFIG_X86_LOCAL_APIC > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h > index 557227e40da9..34cf998ad534 100644 > --- a/arch/x86/include/asm/tdx.h > +++ b/arch/x86/include/asm/tdx.h > @@ -5,6 +5,7 @@ > > #include <linux/bits.h> > #include <linux/init.h> > +#include <asm/ptrace.h> > > #define TDX_CPUID_LEAF_ID 0x21 > #define TDX_IDENT "IntelTDX " > @@ -47,6 +48,22 @@ struct tdx_hypercall_args { > u64 r15; > }; > > +/* > + * Used by the #VE exception handler to gather the #VE exception > + * info from the TDX module. This is a software only structure > + * and not part of the TDX module/VMM ABI. > + */ > +struct ve_info { > + u64 exit_reason; > + u64 exit_qual; > + /* Guest Linear (virtual) Address */ > + u64 gla; > + /* Guest Physical (virtual) Address */ > + u64 gpa; > + u32 instr_len; > + u32 instr_info; > +}; > + > #ifdef CONFIG_INTEL_TDX_GUEST > > void __init tdx_early_init(void); > @@ -58,6 +75,10 @@ u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, > /* Used to request services from the VMM */ > u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags); > > +void tdx_get_ve_info(struct ve_info *ve); > + > +bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve); > + > #else > > static inline void tdx_early_init(void) { }; > diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c > index df0fa695bb09..1da074123c16 100644 > --- a/arch/x86/kernel/idt.c > +++ b/arch/x86/kernel/idt.c > @@ -68,6 +68,9 @@ static const __initconst struct idt_data early_idts[] = { > */ > INTG(X86_TRAP_PF, asm_exc_page_fault), > #endif > +#ifdef CONFIG_INTEL_TDX_GUEST > + INTG(X86_TRAP_VE, asm_exc_virtualization_exception), > +#endif > }; > > /* > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c > index 7ef00dee35be..b2510af38158 100644 > --- a/arch/x86/kernel/traps.c > +++ b/arch/x86/kernel/traps.c > @@ -62,6 +62,7 @@ > #include <asm/insn.h> > #include <asm/insn-eval.h> > #include <asm/vdso.h> > +#include <asm/tdx.h> > > #ifdef CONFIG_X86_64 > #include <asm/x86_init.h> > @@ -611,13 +612,43 @@ static bool try_fixup_enqcmd_gp(void) > #endif > } > > +static bool gp_try_fixup_and_notify(struct pt_regs *regs, int trapnr, > + unsigned long error_code, const char *str) > +{ > + int ret; > + > + if (fixup_exception(regs, trapnr, error_code, 0)) > + return true; > + > + current->thread.error_code = error_code; > + current->thread.trap_nr = trapnr; > + > + /* > + * To be potentially processing a kprobe fault and to trust the result > + * from kprobe_running(), we have to be non-preemptible. > + */ > + if (!preemptible() && kprobe_running() && > + kprobe_fault_handler(regs, trapnr)) > + return true; > + > + ret = notify_die(DIE_GPF, str, regs, error_code, trapnr, SIGSEGV); > + return ret == NOTIFY_STOP; > +} > + > +static void gp_user_force_sig_segv(struct pt_regs *regs, int trapnr, > + unsigned long error_code, const char *str) > +{ > + current->thread.error_code = error_code; > + current->thread.trap_nr = trapnr; > + show_signal(current, SIGSEGV, "", str, regs, error_code); > + force_sig(SIGSEGV); > +} > + > DEFINE_IDTENTRY_ERRORCODE(exc_general_protection) > { > char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR; > enum kernel_gp_hint hint = GP_NO_HINT; > - struct task_struct *tsk; > unsigned long gp_addr; > - int ret; > > if (user_mode(regs) && try_fixup_enqcmd_gp()) > return; > @@ -636,40 +667,21 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection) > return; > } > > - tsk = current; > - > if (user_mode(regs)) { > if (fixup_iopl_exception(regs)) > goto exit; > > - tsk->thread.error_code = error_code; > - tsk->thread.trap_nr = X86_TRAP_GP; > - > if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0)) > goto exit; > > - show_signal(tsk, SIGSEGV, "", desc, regs, error_code); > - force_sig(SIGSEGV); > + gp_user_force_sig_segv(regs, X86_TRAP_GP, error_code, desc); > goto exit; > } > > if (fixup_exception(regs, X86_TRAP_GP, error_code, 0)) > goto exit; > > - tsk->thread.error_code = error_code; > - tsk->thread.trap_nr = X86_TRAP_GP; > - > - /* > - * To be potentially processing a kprobe fault and to trust the result > - * from kprobe_running(), we have to be non-preemptible. > - */ > - if (!preemptible() && > - kprobe_running() && > - kprobe_fault_handler(regs, X86_TRAP_GP)) > - goto exit; > - > - ret = notify_die(DIE_GPF, desc, regs, error_code, X86_TRAP_GP, SIGSEGV); > - if (ret == NOTIFY_STOP) > + if (gp_try_fixup_and_notify(regs, X86_TRAP_GP, error_code, desc)) > goto exit; > > if (error_code) > @@ -1267,6 +1279,86 @@ DEFINE_IDTENTRY(exc_device_not_available) > } > } > > +#ifdef CONFIG_INTEL_TDX_GUEST > + > +#define VE_FAULT_STR "VE fault" > + > +static void ve_raise_fault(struct pt_regs *regs, long error_code) > +{ > + if (user_mode(regs)) { > + gp_user_force_sig_segv(regs, X86_TRAP_VE, error_code, VE_FAULT_STR); > + return; > + } > + > + if (gp_try_fixup_and_notify(regs, X86_TRAP_VE, error_code, VE_FAULT_STR)) > + return; > + > + die_addr(VE_FAULT_STR, regs, error_code, 0); > +} > + > +/* > + * Virtualization Exceptions (#VE) are delivered to TDX guests due to > + * specific guest actions which may happen in either user space or the > + * kernel: > + * > + * * Specific instructions (WBINVD, for example) > + * * Specific MSR accesses > + * * Specific CPUID leaf accesses > + * * Access to unmapped pages (EPT violation) > + * > + * In the settings that Linux will run in, virtualization exceptions are > + * never generated on accesses to normal, TD-private memory that has been > + * accepted. > + * > + * Syscall entry code has a critical window where the kernel stack is not > + * yet set up. Any exception in this window leads to hard to debug issues > + * and can be exploited for privilege escalation. Exceptions in the NMI > + * entry code also cause issues. Returning from the exception handler with > + * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack. > + * > + * For these reasons, the kernel avoids #VEs during the syscall gap and > + * the NMI entry code. Entry code paths do not access TD-shared memory, > + * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves > + * that might generate #VE. VMM can remove memory from TD at any point, > + * but access to unaccepted (or missing) private memory leads to VM > + * termination, not to #VE. > + * > + * Similarly to page faults and breakpoints, #VEs are allowed in NMI > + * handlers once the kernel is ready to deal with nested NMIs. > + * > + * During #VE delivery, all interrupts, including NMIs, are blocked until > + * TDGETVEINFO is called. It prevents #VE nesting until the kernel reads > + * the VE info. > + * > + * If a guest kernel action which would normally cause a #VE occurs in > + * the interrupt-disabled region before TDGETVEINFO, a #DF (fault > + * exception) is delivered to the guest which will result in an oops. > + */ > +DEFINE_IDTENTRY(exc_virtualization_exception) > +{ > + struct ve_info ve; > + > + /* > + * NMIs/Machine-checks/Interrupts will be in a disabled state > + * till TDGETVEINFO TDCALL is executed. This ensures that VE > + * info cannot be overwritten by a nested #VE. > + */ > + tdx_get_ve_info(&ve); > + > + cond_local_irq_enable(regs); > + > + /* > + * If tdx_handle_virt_exception() could not process > + * it successfully, treat it as #GP(0) and handle it. > + */ > + if (!tdx_handle_virt_exception(regs, &ve)) > + ve_raise_fault(regs, 0); > + > + cond_local_irq_disable(regs); > +} > + > +#endif > + > #ifdef CONFIG_X86_32 > DEFINE_IDTENTRY_SW(iret_error) > { > -- > 2.34.1 > >
Powered by blists - more mailing lists