[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYIXFmT-676oN6j0@google.com>
Date: Tue, 3 Feb 2026 07:41:10 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Chao Gao <chao.gao@...el.com>
Cc: Dave Hansen <dave.hansen@...el.com>, linux-coco@...ts.linux.dev,
linux-kernel@...r.kernel.org, kvm@...r.kernel.org, x86@...nel.org,
reinette.chatre@...el.com, ira.weiny@...el.com, kai.huang@...el.com,
dan.j.williams@...el.com, yilun.xu@...ux.intel.com, sagis@...gle.com,
vannapurve@...gle.com, paulmck@...nel.org, nik.borisov@...e.com,
zhenzhong.duan@...el.com, rick.p.edgecombe@...el.com, kas@...nel.org,
dave.hansen@...ux.intel.com, vishal.l.verma@...el.com,
Farrah Chen <farrah.chen@...el.com>, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH v3 07/26] x86/virt/seamldr: Introduce a wrapper for
P-SEAMLDR SEAMCALLs
On Tue, Feb 03, 2026, Chao Gao wrote:
> >>> I'd be shocked if this is the one and only place in the whole kernel
> >>> that can unceremoniously zap VMX state.
> >>>
> >>> I'd *bet* that you don't really need to do the vmptrld and that KVM can
> >>> figure it out because it can vmptrld on demand anyway. Something along
> >>> the lines of:
> >>>
> >>> local_irq_disable();
> >>> list_for_each(handwaving...)
> >>> vmcs_clear();
> >>> ret = seamldr_prerr(fn, args);
> >>> local_irq_enable();
> >>>
> >>> Basically, zap this CPU's vmcs state and then make KVM reload it at some
> >>> later time.
> >>
> >> The idea is feasible. But just calling vmcs_clear() won't work. We need to
> >> reset all the tracking state associated with each VMCS. We should call
> >> vmclear_local_loaded_vmcss() instead, similar to what's done before VMXOFF.
> >>
> >>>
> >>> I'm sure Sean and Paolo will tell me if I'm crazy.
> >>
> >> To me, this approach needs more work since we need to either move
> >> vmclear_local_loaded_vmcss() to the kernel or allow KVM to register a callback.
> >>
> >> I don't think it's as straightforward as just doing the save/restore.
> >
> >Could you please just do me a favor and spend 20 minutes to see what
> >this looks like in practice and if the KVM folks hate it?
I hate it :-)
> Sure. KVM tracks the current VMCS and only executes vmptrld for a new VMCS if
> it differs from the current one. See arch/x86/kvm/vmx/vmx.c::vmx_vcpu_load_vmcs()
>
> prev = per_cpu(current_vmcs, cpu);
> if (prev != vmx->loaded_vmcs->vmcs) {
> per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
> vmcs_load(vmx->loaded_vmcs->vmcs);
> }
>
> By resetting current_vmcs to NULL during P-SEAMLDR calls, KVM is forced to do a
> vmptrld on the next VMCS load. So, we can implement seamldr_call() as:
>
> static int seamldr_call(u64 fn, struct tdx_module_args *args)
> {
> int ret;
>
> WARN_ON_ONCE(!is_seamldr_call(fn));
>
> /*
> * Serialize P-SEAMLDR calls since only a single CPU is allowed to
> * interact with P-SEAMLDR at a time.
> *
> * P-SEAMLDR calls invalidate the current VMCS. Exclude KVM access to
> * the VMCS by disabling interrupts. This is not safe against VMCS use
> * in NMIs, but there are none of those today.
> *
> * Set the per-CPU current_vmcs cache to NULL to force KVM to reload
> * the VMCS.
> */
> guard(raw_spinlock_irqsave)(&seamldr_lock);
> ret = seamcall_prerr(fn, args);
> this_cpu_write(current_vmcs, NULL);
>
> return ret;
> }
>
> This requires moving the per-CPU current_vmcs from KVM to the kernel, which
> should be trivial with Sean's VMXON series.
Trivial in code, but I am very strongly opposed to moving current_vmcs out of KVM.
As stated in the cover letter of the initial VMXON RFC[*]:
: Emphasis on "only", because leaving VMCS tracking and clearing in KVM is
: another key difference from Xin's series. The "light bulb" moment on that
: front is that TDX isn't a hypervisor, and isn't trying to be a hypervisor.
: Specifically, TDX should _never_ have it's own VMCSes (that are visible to the
: host; the TDX-Module has it's own VMCSes to do SEAMCALL/SEAMRET), and so there
: is simply no reason to move that functionality out of KVM.
TDX's "use" of a VMCS should be completely transparent to KVM, because otherwise
we are stepping over that line that says the TDX subsystem isn't a hypervisor.
I also really, really don't want to add a super special case rule to KVM's VMCS
tracking logic.
After reading through the rest of this discussion, I'm doubling down on that
stance, because I agree that this is decidely odd behavior.
Pulling in two other threads from this discussion:
On Wed, Jan 28, 2026 at 3:05 PM Dave Hansen <dave.hansen@...el.com> wrote:
>
> On 1/23/26 06:55, Chao Gao wrote:
> > SEAMRET from the P-SEAMLDR clears the current VMCS structure pointed
> > to by the current-VMCS pointer. A VMM that invokes the P-SEAMLDR
> > using SEAMCALL must reload the current-VMCS, if required, using the
> > VMPTRLD instruction.
>
> That seems pretty mean.
>
> This is going to need a lot more justification for why this is an
> absolutely necessary requirement.
>
> KVM folks, are you OK with this?
As above, I'm definitely not ok with the current VMCS being zapped out from
underneath KVM. As to whether or not I'm ok with the P-SEAMLDR behavior, I would
say that's more of a question for you, as it will fall on the TDX subsytem to
workaround the bug/quirk.
On Fri, Jan 30, 2026 at 8:23 AM Dave Hansen <dave.hansen@...el.com> wrote:
> On 1/30/26 00:08, Chao Gao wrote:
> > AFAIK, this is a CPU implementation issue. The actual requirement is to
> > evict (flush and invalidate) all VMCSs __cached in SEAM mode__, but big
> > cores implement this by evicting the __entire__ VMCS cache. So, the
> > current VMCS is invalidated and cleared.
>
> But why is this a P-SEAMLDR thing and not a TDX module thing?
My guess is that it's because the P-SEAMLDR code loads and prepares the new TDX-
Module by constructing the VMCS used for SEAMCALL using direct writes to memory
(unless that TDX behavior has changed in the last few years). And so it needs
to ensure that in-memory representation is synchronized with the VMCS cache.
Hmm, but that doesn't make sense _if_ it really truly is SEAMRET that does the VMCS
cache invalidation, because flushing the VMCS cache would ovewrite the in-memory
state.
> It seems like a bug, or at least a P-SEAMLDR implementation issue the
> needs to get fixed.
Yeah, 'tis odd behavior. IMO, that's all the more reason the TDX subsystem should
hide the quirk from the rest of the kernel.
[*] https://lore.kernel.org/all/20251010220403.987927-1-seanjc@google.com
Powered by blists - more mailing lists