linux-kernel - Re: [PATCH 1/1] x86/tdx: Route safe halt execution via tdx_safe

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <slfw63tenzxur6fvfjf6ni3lp2exqcpzodgetx6p5eqchw66y3@oqjvn6ezgmod>
Date: Wed, 29 Jan 2025 14:10:26 +0200
From: "Kirill A. Shutemov" <kirill@...temov.name>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Vishal Annapurve <vannapurve@...gle.com>, 
	Dave Hansen <dave.hansen@...el.com>, x86@...nel.org, linux-kernel@...r.kernel.org, 
	pbonzini@...hat.com, erdemaktas@...gle.com, ackerleytng@...gle.com, jxgao@...gle.com, 
	sagis@...gle.com, oupton@...gle.com, pgonda@...gle.com, 
	dave.hansen@...ux.intel.com, linux-coco@...ts.linux.dev, chao.p.peng@...ux.intel.com, 
	isaku.yamahata@...il.com
Subject: Re: [PATCH 1/1] x86/tdx: Route safe halt execution via tdx_safe_halt

On Tue, Jan 28, 2025 at 04:45:35PM -0800, Sean Christopherson wrote:
> On Tue, Jan 28, 2025, Vishal Annapurve wrote:
> > On Tue, Jan 28, 2025 at 2:08 PM Dave Hansen <dave.hansen@...el.com> wrote:
> > > Do you have any thoughts on why nobody has hit this up to now? Are TDX
> > > users not enabling PARAVIRT_XXL? Not using ACPI?
> > 
> > This has been a long-standing issue which would show up visibly with
> > certain workloads where vcpus hit idle loop quite often during the
> > runtime. I was only able to recently spend some time towards
> > understanding the cause properly.
> 
> ...
> 
> > > > @@ -1083,6 +1089,15 @@ void __init tdx_early_init(void)
> > > >       x86_platform.guest.enc_kexec_begin           = tdx_kexec_begin;
> > > >       x86_platform.guest.enc_kexec_finish          = tdx_kexec_finish;
> > > >
> > > > +#ifdef CONFIG_PARAVIRT_XXL
> > > > +     /*
> > > > +      * halt instruction execution is not atomic for TDX VMs as it generates
> > > > +      * #VEs, so otherwise "safe" halt invocations which cause interrupts to
> > > > +      * get enabled right after halt instruction don't work for TDX VMs.
> > > > +      */
> > > > +     pv_ops.irq.safe_halt = tdx_safe_halt;
> > > > +#endif
> > > The basic bug here was that there was a path to a hlt instruction that
> > > folks didn't realize. This patch fixes the basic bug and gives us a nice
> > > warning if there are additional paths that weren't imagined.
> > >
> > > But it doesn't really help us audit the code to make it clear that TDX
> > > guest kernel's _can't_ screw up hlt again the same way.  This, for
> > > instance would make it pretty clear:
> > >
> > > static __always_inline void native_safe_halt(void)
> > > {
> > >         if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> > >                 tdx_safe_halt();
> 
> This incorrectly assumes the hypervisor is intercepting HLT.  If the VM is given
> a slice of hardware, HLT-exiting may be disabled, in which case it's desirable
> for the guest to natively execute HLT, as the latencies to get in and out of "HLT"
> are lower, especially for TDX guests.  Such a VM would hopefully have MONITOR/MWAIT
> available as well, but even if that were the case, the admin could select HLT for
> idling.
> 
> Ugh, and I see that bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
> overrides default_idle().  The kernel really shouldn't do that, because odds are
> decent that any TDX guest will have direct access to HLT.  The best approach I
> can think of would be to patch x86_idle() to tdx_safe_halt() if and only if a HLT
> #VE is taken.  The tricky part would be delaying the update until it's safe to do
> so.

I am confused. HLT triggers #VE unconditionally in TDX guests. How would
TDX guest have direct access to HLT?

Even if it would in the future, it is going to explicit opt-in from the
guest and we can avoid setting x86_idle() for such cases.

> As for taking a #VE, the exception itself is fine (assuming the kernel isn't off
> the rails and using a trap gate :-D).  The issue is likely that RFLAGS.IF=1 on
> the stack, and so the call to cond_local_irq_enable() enables IRQs before making
> the hypercall.  E.g. no one has complained about #VC, because exc_vmm_communication()
> doesn't enable IRQs.
> 
> Off the top of my head, I can't think of any flows that would do HLT with IRQs
> fully enabled.  Even PV spinlocks use safe_halt(), e.g. in kvm_wait(), so I don't
> think there's any value in trying to precisely identify that it's a safe HLT?

I can only think of "CPU is dead" use-case of HLT where interrupts are
enabled. But I hate special-casing HLT in exc_virtualization_exception() :/

> E.g. this should fix the immediate problem, and then ideally someone would make
> TDX guests play nice with native HLT.

I've asked (some time ago) TDX module folks to provide interruptibility
state as part of the guest so we can handle STI shadow properly, not as a
hack around HLT.

The immediate problem can be addressed by fixing the BIOS to not advertise
C-states (if I read the situation right).

-- 
  Kiryl Shutsemau / Kirill A. Shutemov