linux-kernel - Re: [PATCH] KVM: SVM: Don't allow L1 intercepts for instructions not advertised

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <aUGCP_n74pzTKLY6@google.com>
Date: Tue, 16 Dec 2025 08:01:03 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Kevin Cheng <chengkev@...gle.com>
Cc: pbonzini@...hat.com, jmattson@...gle.com, yosry.ahmed@...ux.dev, 
	kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] KVM: SVM: Don't allow L1 intercepts for instructions not advertised

On Sat, Dec 13, 2025, Kevin Cheng wrote:
> On Tue, Dec 9, 2025 at 11:52 AM Sean Christopherson <seanjc@...gle.com> wrote:
> > On Fri, Dec 05, 2025, Kevin Cheng wrote:
> > static
> > void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
> >                                          struct vmcb_ctrl_area_cached *to,
> >                                          struct vmcb_control_area *from)
> > {
> >         unsigned int i;
> >
> >         for (i = 0; i < MAX_INTERCEPT; i++)
> >                 to->intercepts[i] = from->intercepts[i];
> >
> >         nested_svm_sanitize_intercept(vcpu, to, RDTSCP);
> >         nested_svm_sanitize_intercept(vcpu, to, SKINIT);
> >         __nested_svm_sanitize_intercept(vcpu, to, XSAVE, XSETBV);
> >         nested_svm_sanitize_intercept(vcpu, to, RDPRU);
> >         nested_svm_sanitize_intercept(vcpu, to, INVPCID);
> >
> > Side topic, do we care about handling the case where userspace sets CPUID after
> > stuffing guest state?  I'm very tempted to send a patch disallowing KVM_SET_CPUID
> > if is_guest_mode() is true, and hoping no one cares.
> 
> Hmm good point I haven't thought about that. Would it be better to
> instead update the nested state in svm_vcpu_after_set_cpuid() if
> KVM_SET_CPUID is executed when is_guest_mode() is true?

Allowing CPUID to change (or VMX feature MSRs) when the vCPU is in L2 creates a
massive set of potential complications that would be all but impossible to handle.
E.g. if userspace clears features that are being used to run L2, then KVM could
end up with conflicting/nonsensical state for the vCPU.

Note, the same holds true for non-nested features, which is largely what led to
63f5a1909f9e ("KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is
broken") and feb627e8d6f6 ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN").

> Also sorry if this is a dumb question, but in general if KVM_SET_CPUID
> is disallowed, then how does userspace handle a failed IOCTL call?

The VMM logs an error and exits.

> Do they just try again later or accept that the call has failed? Or is there
> an error code that signals that the vcpu is executing in guest mode and
> should wait until not in guest mode to call the IOCTL?

In practice, CPUID (and feature MSRs) are set very early on, when userspace is
creating vCPUs.  With one exception, any other approach simply can't work, because
as above CPUID can't be configured after the vCPU has run.  The only way for the
vCPU to be "running" L2 is if userspace stuffed guest state via KVM_SET_NESTED_STATE,
and actually changing CPUID after that point simply can't work (it'll "succeed",
but KVM provides no guarantees as to the viability of the vCPU).

The exception is that KVM allows setting _identical_ CPUID after KVM_RUN to
support QEMU's vCPU hotplug implementation, which is what commit c6617c61e8fe
("KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN") is all about.

Trying to _completely_ protect userspace from itself is a fool's errand, as KVM's
granular set of ioctls for loading vCPU state and the subjectivity of what is a
"sane" vCPU model make it all but impossible to fully prevent userspace from
hurting itself without creating a maintenance nightmare.  But I can't think of
any scenario where it would be useful for userspace to set nested state, then go
back and change the vCPU model.

I'll send a patch and see what happens.  Worst case scenario we break userspace
and get to learn about crazy VMM behavior :-)