linux-kernel - Re: [PATCH 1/1] x86/tdx: Route safe halt execution via tdx_safe

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z5l6L3Hen9_Y3SGC@google.com>
Date: Tue, 28 Jan 2025 16:45:35 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Vishal Annapurve <vannapurve@...gle.com>
Cc: Dave Hansen <dave.hansen@...el.com>, x86@...nel.org, linux-kernel@...r.kernel.org, 
	pbonzini@...hat.com, erdemaktas@...gle.com, ackerleytng@...gle.com, 
	jxgao@...gle.com, sagis@...gle.com, oupton@...gle.com, pgonda@...gle.com, 
	kirill@...temov.name, dave.hansen@...ux.intel.com, linux-coco@...ts.linux.dev, 
	chao.p.peng@...ux.intel.com, isaku.yamahata@...il.com
Subject: Re: [PATCH 1/1] x86/tdx: Route safe halt execution via tdx_safe_halt

On Tue, Jan 28, 2025, Vishal Annapurve wrote:
> On Tue, Jan 28, 2025 at 2:08 PM Dave Hansen <dave.hansen@...el.com> wrote:
> > Do you have any thoughts on why nobody has hit this up to now? Are TDX
> > users not enabling PARAVIRT_XXL? Not using ACPI?
> 
> This has been a long-standing issue which would show up visibly with
> certain workloads where vcpus hit idle loop quite often during the
> runtime. I was only able to recently spend some time towards
> understanding the cause properly.

...

> > > @@ -1083,6 +1089,15 @@ void __init tdx_early_init(void)
> > >       x86_platform.guest.enc_kexec_begin           = tdx_kexec_begin;
> > >       x86_platform.guest.enc_kexec_finish          = tdx_kexec_finish;
> > >
> > > +#ifdef CONFIG_PARAVIRT_XXL
> > > +     /*
> > > +      * halt instruction execution is not atomic for TDX VMs as it generates
> > > +      * #VEs, so otherwise "safe" halt invocations which cause interrupts to
> > > +      * get enabled right after halt instruction don't work for TDX VMs.
> > > +      */
> > > +     pv_ops.irq.safe_halt = tdx_safe_halt;
> > > +#endif
> > The basic bug here was that there was a path to a hlt instruction that
> > folks didn't realize. This patch fixes the basic bug and gives us a nice
> > warning if there are additional paths that weren't imagined.
> >
> > But it doesn't really help us audit the code to make it clear that TDX
> > guest kernel's _can't_ screw up hlt again the same way.  This, for
> > instance would make it pretty clear:
> >
> > static __always_inline void native_safe_halt(void)
> > {
> >         if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> >                 tdx_safe_halt();

This incorrectly assumes the hypervisor is intercepting HLT.  If the VM is given
a slice of hardware, HLT-exiting may be disabled, in which case it's desirable
for the guest to natively execute HLT, as the latencies to get in and out of "HLT"
are lower, especially for TDX guests.  Such a VM would hopefully have MONITOR/MWAIT
available as well, but even if that were the case, the admin could select HLT for
idling.

Ugh, and I see that bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
overrides default_idle().  The kernel really shouldn't do that, because odds are
decent that any TDX guest will have direct access to HLT.  The best approach I
can think of would be to patch x86_idle() to tdx_safe_halt() if and only if a HLT
#VE is taken.  The tricky part would be delaying the update until it's safe to do
so.

As for taking a #VE, the exception itself is fine (assuming the kernel isn't off
the rails and using a trap gate :-D).  The issue is likely that RFLAGS.IF=1 on
the stack, and so the call to cond_local_irq_enable() enables IRQs before making
the hypercall.  E.g. no one has complained about #VC, because exc_vmm_communication()
doesn't enable IRQs.

Off the top of my head, I can't think of any flows that would do HLT with IRQs
fully enabled.  Even PV spinlocks use safe_halt(), e.g. in kvm_wait(), so I don't
think there's any value in trying to precisely identify that it's a safe HLT?

E.g. this should fix the immediate problem, and then ideally someone would make
TDX guests play nice with native HLT.

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 2dbadf347b5f..c60659468894 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -78,6 +78,8 @@
 
 #include <asm/proto.h>
 
+#include <uapi/asm/vmx.h>
+
 DECLARE_BITMAP(system_vectors, NR_VECTORS);
 
 __always_inline int is_valid_bugaddr(unsigned long addr)
@@ -1424,7 +1426,14 @@ DEFINE_IDTENTRY(exc_virtualization_exception)
         */
        tdx_get_ve_info(&ve);
 
-       cond_local_irq_enable(regs);
+       /*
+        * Don't enable IRQs on #VE due to HLT, as the HLT was likely executed
+        * in an STI-shadow, e.g. by safe_halt().  For safe HLT, IRQs need to
+        * remain disabled until the TDCALL to request HLT emulation, so that
+        * pending IRQs are correctly treated as wake events.
+        */
+       if (ve.exit_reason != EXIT_REASON_HLT)
+               cond_local_irq_enable(regs);
 
        /*
         * If tdx_handle_virt_exception() could not process
@@ -1433,7 +1442,8 @@ DEFINE_IDTENTRY(exc_virtualization_exception)
        if (!tdx_handle_virt_exception(regs, &ve))
                ve_raise_fault(regs, 0, ve.gla);
 
-       cond_local_irq_disable(regs);
+       if (ve.exit_reason != EXIT_REASON_HLT)
+               cond_local_irq_disable(regs);
 }
 
 #endif