linux-kernel - Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z32ZjGH72WPKBMam@google.com>
Date: Tue, 7 Jan 2025 13:15:56 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Binbin Wu <binbin.wu@...ux.intel.com>
Cc: Chao Gao <chao.gao@...el.com>, Vishal Annapurve <vannapurve@...gle.com>, pbonzini@...hat.com, 
	kvm@...r.kernel.org, rick.p.edgecombe@...el.com, kai.huang@...el.com, 
	adrian.hunter@...el.com, reinette.chatre@...el.com, xiaoyao.li@...el.com, 
	tony.lindgren@...ux.intel.com, isaku.yamahata@...el.com, yan.y.zhao@...el.com, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest

On Tue, Jan 07, 2025, Binbin Wu wrote:
> On 1/7/2025 11:24 AM, Chao Gao wrote:
> > > Note, kvm_use_posted_timer_interrupt() uses a heuristic of HLT/MWAIT interception
> > > being disabled to detect that it's worth trying to post a timer interrupt, but off
> > > the top of my head I'm pretty sure that's unnecessary and pointless.  The
> > Commit 1714a4eb6fb0 gives an explanation:
> > 
> >    if only some guests isolated and others not, they would not
> >    have any benefit from posted timer interrupts, and at the same time lose
> >    VMX preemption timer fast paths because kvm_can_post_timer_interrupt()
> >    returns true and therefore forces kvm_can_use_hv_timer() to false.

But that's only relevant for setup.  Upon expiration, if the target vCPU is in
guest mode and APICv is active, then from_timer_fn must be true, which in turn
means that hrtimers are in use.

> Userspace uses KVM_CAP_X86_DISABLE_EXITS to enable mwait_in_guest or/and
> hlt_in_guest for non TDX guest. The code doesn't work for TDX guests.
> - Whether mwait in guest is allowed for TDX depends on the cpuid
>   configuration in TD_PARAMS, not the capability of physical cpu checked by
>   kvm_can_mwait_in_guest().
> - hlt for TDX is via TDVMCALL, i.e. hlt_in_guest should always be false
>   for TDX guests.
> 
> For TDX guests, not sure if it worths to fix kvm_{mwait,hlt}_in_guest()
> or add special handling (i.e., skip checking mwait/hlt in guest) because
> VMX preemption timer can't be used anyway, in order to allow housekeeping
> CPU to post timer interrupt for TDX vCPUs when nohz_full is configured
> after changing APICv active to true.

Yes, but I don't think we need any TDX-specific logic beyond noting that the
VMX preemption can't be used.  As above, consulting kvm_can_post_timer_interrupt()
in the expiration path is silly.

And after staring at all of this for a few hours, I think we should ditch the
entire lapic_timer.pending approach, because at this point it's doing more harm
than good.  Tracking pending timer IRQs was primarily done to support delivery
of *all* timer interrupts when multiple interrupts were coalesced in the host,
e.g. because a vCPU was scheduled out.  But that got removed by commit
f1ed0450a5fa ("KVM: x86: Remove support for reporting coalesced APIC IRQs"),
partly because posted interrupts threw a wrench in things, but also because
delivering multiple periodic interrupts in quick succession is problematic in
its own right.

With the interrupt coalescing logic gone, I can't think of any reason KVM needs
to kick the vCPU out to the main vcpu_run() loop just to deliver the interrupt,
i.e. pending interrupts before delivering them is unnecessary.  E.g. conditioning
direct deliverly on apicv_active in the !from_timer_fn case makes no sense, because
KVM is going to manually move the interrupt from the PIR to the IRR anyways.

I also don't like that the behavior for the posting path is subtly different than
!APICv.  E.g. if an hrtimer fires while KVM is handling a write to TMICT, KVM will
deliver the interrupt if configured to post timer, but not if APICv is disabled,
because the latter will increment "pending", and "pending" will be cleared before
handling the new TMICT.  Ditto for switch APIC timer modes.

Unfortunately, there is a lot to disentangle before KVM can directly deliver all
APIC timer interupts, as KVM has built up a fair bit of ugliness on top of "pending". 

E.g. when switching back to the VMX preemption timer (after a blocking vCPU is
scheduled back in), KVM leaves the hrtimer active.  start_hv_timer() accounts for
that by checking lapic_timer.pending, but leaving the hrtimer running in this case
is absurd and unnecessarily relies on "pending" being incremented.

	if (kvm_x86_call(set_hv_timer)(vcpu, ktimer->tscdeadline, &expired))
		return false;

	ktimer->hv_timer_in_use = true;
	hrtimer_cancel(&ktimer->timer);

	/*
	 * To simplify handling the periodic timer, leave the hv timer running
	 * even if the deadline timer has expired, i.e. rely on the resulting
	 * VM-Exit to recompute the periodic timer's target expiration.
	 */
	if (!apic_lvtt_period(apic)) {
		/*
		 * Cancel the hv timer if the sw timer fired while the hv timer
		 * was being programmed, or if the hv timer itself expired.
		 */
		if (atomic_read(&ktimer->pending)) {
			cancel_hv_timer(apic);
		} else if (expired) {
			apic_timer_expired(apic, false);
			cancel_hv_timer(apic);
		}
	}

I'm also not convinced that letting the hrtimer run on a different CPU in the
"normal" case is a good idea.  KVM explicitly migrates the timer, *except* for
the posting case, and so not pinning the hrtimer is likely counter-productive,
not to mention confusing.

  void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
  {
	struct hrtimer *timer;

	if (!lapic_in_kernel(vcpu) ||
		kvm_can_post_timer_interrupt(vcpu))
		return;

	timer = &vcpu->arch.apic->lapic_timer.timer;
	if (hrtimer_cancel(timer))
		hrtimer_start_expires(timer, HRTIMER_MODE_ABS_HARD);
  }

And if the hrtimer is pinned to the pCPU, then in practice it should be impossible
for KVM to post a timer IRQ into a vCPU that's in guest mode.

Seapking of which, posting a timer IRQ into a vCPU that's in guest mode is a bit
dodgy.  I _think_ it's safe, because all of the flows that can be coincident are
actually mutually exclusive.  E.g. apic_timer_expired()'s call to
__kvm_wait_lapic_expire() can't collide with the call from the entry path, as the
entry path's invocation happens with IRQs disabled.

But all of this makes me wary, so I'd much prefer to clean up, harden, and document
the existing code before we try to optimize timer IRQ delivery for more cases.
That's especially true for TDX as we're already adding significant compexlity
and multiple novel paths for TDX; we don't need more at this time :-)