linux-kernel - Re: [PATCH] KVM: SVM: Don't allow L1 intercepts for instructions not advertised

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAE6NW_ZwdpGN0F_8NVe77tgGPw7nO5Mi-t1455gGoLcUVpVbpw@mail.gmail.com>
Date: Sat, 13 Dec 2025 11:11:13 -0500
From: Kevin Cheng <chengkev@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: pbonzini@...hat.com, jmattson@...gle.com, yosry.ahmed@...ux.dev, 
	kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] KVM: SVM: Don't allow L1 intercepts for instructions not advertised

On Tue, Dec 9, 2025 at 11:52 AM Sean Christopherson <seanjc@...gle.com> wrote:
>
> On Fri, Dec 05, 2025, Kevin Cheng wrote:
> > If a feature is not advertised in the guest's CPUID, prevent L1 from
> > intercepting the unsupported instructions by clearing the corresponding
> > intercept in KVM's cached vmcb12.
> >
> > When an L2 guest executes an instruction that is not advertised to L1,
> > we expect a #UD exception to be injected by L0. However, the nested svm
> > exit handler first checks if the instruction intercept is set in vmcb12,
> > and if so, synthesizes an exit from L2 to L1 instead of a #UD exception.
> > If a feature is not advertised, the L1 intercept should be ignored.
> >
> > Calculate the nested intercept mask by checking all instructions that
> > can be intercepted and are controlled by a CPUID bit. Use this mask when
> > copying from the vmcb12 to KVM's cached vmcb12 to effectively ignore the
> > intercept on nested vm exit handling.
> >
> > Another option is to handle ignoring the L1 intercepts in the nested vm
> > exit code path, but I've gone with modifying the cached vmcb12 to keep
> > it simpler.
> >
> > Signed-off-by: Kevin Cheng <chengkev@...gle.com>
> > ---
> >  arch/x86/kvm/svm/nested.c | 30 +++++++++++++++++++++++++++++-
> >  arch/x86/kvm/svm/svm.c    |  2 ++
> >  arch/x86/kvm/svm/svm.h    | 14 ++++++++++++++
> >  3 files changed, 45 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index c81005b245222..f2ade24908b39 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -184,6 +184,33 @@ void recalc_intercepts(struct vcpu_svm *svm)
> >       }
> >  }
> >
> > +/*
> > + * If a feature is not advertised to L1, set the mask bit for the corresponding
> > + * vmcb12 intercept.
> > + */
> > +void svm_recalc_nested_intercepts_mask(struct kvm_vcpu *vcpu)
> > +{
> > +     struct vcpu_svm *svm = to_svm(vcpu);
> > +
> > +     memset(svm->nested.nested_intercept_mask, 0,
> > +            sizeof(svm->nested.nested_intercept_mask));
> > +
> > +     if (!guest_cpu_cap_has(vcpu, X86_FEATURE_RDTSCP))
> > +             set_nested_intercept_mask(&svm->nested, INTERCEPT_RDTSCP);
> > +
> > +     if (!guest_cpu_cap_has(vcpu, X86_FEATURE_SKINIT))
> > +             set_nested_intercept_mask(&svm->nested, INTERCEPT_SKINIT);
> > +
> > +     if (!guest_cpu_cap_has(vcpu, X86_FEATURE_XSAVE))
> > +             set_nested_intercept_mask(&svm->nested, INTERCEPT_XSETBV);
> > +
> > +     if (!guest_cpu_cap_has(vcpu, X86_FEATURE_RDPRU))
> > +             set_nested_intercept_mask(&svm->nested, INTERCEPT_RDPRU);
> > +
> > +     if (!guest_cpu_cap_has(vcpu, X86_FEATURE_INVPCID))
> > +             set_nested_intercept_mask(&svm->nested, INTERCEPT_INVPCID);
>
> Ugh.  I don't see any reason for svm->nested.nested_intercept_mask to exist.
> guest_cpu_cap_has() is cheap (which is largely why it even exists), just sanitize
> the vmcb02 intercepts on-demand.  The name is also wonky: it "sets" bits only to
> effect a "clear" of those bits.
>
> Gah, and the helpers to access/mutate intercepts can be cleaned up.  E.g. if we
> do something like this:
>
> static inline void __vmcb_set_intercept(unsigned long *intercepts, u32 bit)
> {
>         WARN_ON_ONCE(bit >= 32 * MAX_INTERCEPT);
>         __set_bit(bit, intercepts);
> }
>
> static inline void __vmcb_clr_intercept(unsigned long *intercepts, u32 bit)
> {
>         WARN_ON_ONCE(bit >= 32 * MAX_INTERCEPT);
>         __clear_bit(bit, intercepts);
> }
>
> static inline bool __vmcb_is_intercept(unsigned long *intercepts, u32 bit)
> {
>         WARN_ON_ONCE(bit >= 32 * MAX_INTERCEPT);
>         return test_bit(bit, intercepts);
> }
>
> static inline void vmcb_set_intercept(struct vmcb_control_area *control, u32 bit)
> {
>         __vmcb_set_intercept((unsigned long *)&control->intercepts, bit);
> }
>
> static inline void vmcb_clr_intercept(struct vmcb_control_area *control, u32 bit)
> {
>         __vmcb_clr_intercept((unsigned long *)&control->intercepts, bit);
> }
>
> static inline bool vmcb_is_intercept(struct vmcb_control_area *control, u32 bit)
> {
>         return __vmcb_is_intercept((unsigned long *)&control->intercepts, bit);
> }
>
> static inline void vmcb12_clr_intercept(struct vmcb_ctrl_area_cached *control, u32 bit)
> {
>         __vmcb_clr_intercept((unsigned long *)&control->intercepts, bit);
> }
>
> static inline bool vmcb12_is_intercept(struct vmcb_ctrl_area_cached *control, u32 bit)
> {
>         return __vmcb_is_intercept((unsigned long *)&control->intercepts, bit);
> }
>
> > +}
> > +
> >  /*
> >   * This array (and its actual size) holds the set of offsets (indexing by chunk
> >   * size) to process when merging vmcb12's MSRPM with vmcb01's MSRPM.  Note, the
> > @@ -408,10 +435,11 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
> >                                        struct vmcb_ctrl_area_cached *to,
> >                                        struct vmcb_control_area *from)
> >  {
> > +     struct vcpu_svm *svm = to_svm(vcpu);
> >       unsigned int i;
> >
> >       for (i = 0; i < MAX_INTERCEPT; i++)
> > -             to->intercepts[i] = from->intercepts[i];
> > +             to->intercepts[i] = from->intercepts[i] & ~(svm->nested.nested_intercept_mask[i]);
>
> Then here we can use vmcb_clr_intercept().  And if with macro shenanigans, we
> can cut down on the boilerplate like so:
>
> #define __nested_svm_sanitize_intercept(__vcpu, __control, fname, iname)        \
> do {                                                                            \
>         if (!guest_cpu_cap_has(__vcpu, X86_FEATURE_##fname))                    \
>                 vmcb12_clr_intercept(__control, INTERCEPT_##iname);             \
> } while (0)
>
> #define nested_svm_sanitize_intercept(__vcpu, __control, name)                  \
>         __nested_svm_sanitize_intercept(__vcpu, __control, name, name)
>
> static
> void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
>                                          struct vmcb_ctrl_area_cached *to,
>                                          struct vmcb_control_area *from)
> {
>         unsigned int i;
>
>         for (i = 0; i < MAX_INTERCEPT; i++)
>                 to->intercepts[i] = from->intercepts[i];
>
>         nested_svm_sanitize_intercept(vcpu, to, RDTSCP);
>         nested_svm_sanitize_intercept(vcpu, to, SKINIT);
>         __nested_svm_sanitize_intercept(vcpu, to, XSAVE, XSETBV);
>         nested_svm_sanitize_intercept(vcpu, to, RDPRU);
>         nested_svm_sanitize_intercept(vcpu, to, INVPCID);
>
> Side topic, do we care about handling the case where userspace sets CPUID after
> stuffing guest state?  I'm very tempted to send a patch disallowing KVM_SET_CPUID
> if is_guest_mode() is true, and hoping no one cares.

Hmm good point I haven't thought about that. Would it be better to
instead update the nested state in svm_vcpu_after_set_cpuid() if
KVM_SET_CPUID is executed when is_guest_mode() is true?

Also sorry if this is a dumb question, but in general if KVM_SET_CPUID
is disallowed, then how does userspace handle a failed IOCTL call? Do
they just try again later or accept that the call has failed? Or is
there an error code that signals that the vcpu is executing in guest
mode and should wait until not in guest mode to call the IOCTL?

>
> >
> >       to->iopm_base_pa        = from->iopm_base_pa;
> >       to->msrpm_base_pa       = from->msrpm_base_pa;