linux-kernel - Re: [PATCH 3/3] KVM: x86: Generalize IBRS virtualization on emulated VM-exit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALMp9eR+sr-qMj8-XfN_E9DgX00YhcaX8=aZ2+efGzQ9gXtRRQ@mail.gmail.com>
Date: Fri, 21 Feb 2025 11:52:21 -0800
From: Jim Mattson <jmattson@...gle.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: x86@...nel.org, Sean Christopherson <seanjc@...gle.com>, Thomas Gleixner <tglx@...utronix.de>, 
	Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>, 
	Paolo Bonzini <pbonzini@...hat.com>, "Kaplan, David" <David.Kaplan@....com>, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/3] KVM: x86: Generalize IBRS virtualization on emulated VM-exit

On Fri, Feb 21, 2025 at 11:26 AM Yosry Ahmed <yosry.ahmed@...ux.dev> wrote:
>
> On Fri, Feb 21, 2025 at 11:02:16AM -0800, Jim Mattson wrote:
> > On Fri, Feb 21, 2025 at 10:39 AM Yosry Ahmed <yosry.ahmed@...ux.dev> wrote:
> > >
> > > On Fri, Feb 21, 2025 at 09:59:04AM -0800, Jim Mattson wrote:
> > > > On Fri, Feb 21, 2025 at 8:34 AM Yosry Ahmed <yosry.ahmed@...ux.dev> wrote:
> > > > >
> > > > > Commit 2e7eab81425a ("KVM: VMX: Execute IBPB on emulated VM-exit when
> > > > > guest has IBRS") added an IBPB in the emulated VM-exit path on Intel to
> > > > > properly virtualize IBRS by providing separate predictor modes for L1
> > > > > and L2.
> > > > >
> > > > > AMD requires similar handling, except when IbrsSameMode is enumerated by
> > > > > the host CPU (which is the case on most/all AMD CPUs). With
> > > > > IbrsSameMode, hardware IBRS is sufficient and no extra handling is
> > > > > needed from KVM.
> > > > >
> > > > > Generalize the handling in nested_vmx_vmexit() by moving it into a
> > > > > generic function, add the AMD handling, and use it in
> > > > > nested_svm_vmexit() too. The main reason for using a generic function is
> > > > > to have a single place to park the huge comment about virtualizing IBRS.
> > > > >
> > > > > Signed-off-by: Yosry Ahmed <yosry.ahmed@...ux.dev>
> > > > > ---
> > > > >  arch/x86/kvm/svm/nested.c |  2 ++
> > > > >  arch/x86/kvm/vmx/nested.c | 11 +----------
> > > > >  arch/x86/kvm/x86.h        | 18 ++++++++++++++++++
> > > > >  3 files changed, 21 insertions(+), 10 deletions(-)
> > > > >
> > > > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > > > index d77b094d9a4d6..61b73ff30807e 100644
> > > > > --- a/arch/x86/kvm/svm/nested.c
> > > > > +++ b/arch/x86/kvm/svm/nested.c
> > > > > @@ -1041,6 +1041,8 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
> > > > >
> > > > >         nested_svm_copy_common_state(svm->nested.vmcb02.ptr, svm->vmcb01.ptr);
> > > > >
> > > > > +       kvm_nested_vmexit_handle_spec_ctrl(vcpu);
> > > > > +
> > > > >         svm_switch_vmcb(svm, &svm->vmcb01);
> > > > >
> > > > >         /*
> > > > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > > > > index 8a7af02d466e9..453d52a6e836a 100644
> > > > > --- a/arch/x86/kvm/vmx/nested.c
> > > > > +++ b/arch/x86/kvm/vmx/nested.c
> > > > > @@ -5018,16 +5018,7 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
> > > > >
> > > > >         vmx_switch_vmcs(vcpu, &vmx->vmcs01);
> > > > >
> > > > > -       /*
> > > > > -        * If IBRS is advertised to the vCPU, KVM must flush the indirect
> > > > > -        * branch predictors when transitioning from L2 to L1, as L1 expects
> > > > > -        * hardware (KVM in this case) to provide separate predictor modes.
> > > > > -        * Bare metal isolates VMX root (host) from VMX non-root (guest), but
> > > > > -        * doesn't isolate different VMCSs, i.e. in this case, doesn't provide
> > > > > -        * separate modes for L2 vs L1.
> > > > > -        */
> > > > > -       if (guest_cpu_cap_has(vcpu, X86_FEATURE_SPEC_CTRL))
> > > > > -               indirect_branch_prediction_barrier();
> > > > > +       kvm_nested_vmexit_handle_spec_ctrl(vcpu);
> > > > >
> > > > >         /* Update any VMCS fields that might have changed while L2 ran */
> > > > >         vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);
> > > > > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> > > > > index 7a87c5fc57f1b..008c8d381c253 100644
> > > > > --- a/arch/x86/kvm/x86.h
> > > > > +++ b/arch/x86/kvm/x86.h
> > > > > @@ -116,6 +116,24 @@ static inline void kvm_leave_nested(struct kvm_vcpu *vcpu)
> > > > >         kvm_x86_ops.nested_ops->leave_nested(vcpu);
> > > > >  }
> > > > >
> > > > > +/*
> > > > > + * If IBRS is advertised to the vCPU, KVM must flush the indirect branch
> > > > > + * predictors when transitioning from L2 to L1, as L1 expects hardware (KVM in
> > > > > + * this case) to provide separate predictor modes.  Bare metal isolates the host
> > > > > + * from the guest, but doesn't isolate different guests from one another (in
> > > > > + * this case L1 and L2). The exception is if bare metal supports same mode IBRS,
> > > > > + * which offers protection within the same mode, and hence protects L1 from L2.
> > > > > + */
> > > > > +static inline void kvm_nested_vmexit_handle_spec_ctrl(struct kvm_vcpu *vcpu)
> > > >
> > > > Maybe just kvm_nested_vmexit_handle_ibrs?
> > >
> > > I was trying to use a generic name to accomodate any future handling
> > > needed for non-IBRS speculation control virtualization. But I could just
> > > be overthinking. Happy to take whatever name is agreed upon in during
> > > reviews.
> > >
> > > >
> > > > > +{
> > > > > +       if (cpu_feature_enabled(X86_FEATURE_AMD_IBRS_SAME_MODE))
> > > > > +               return;
> > > > > +
> > > > > +       if (guest_cpu_cap_has(vcpu, X86_FEATURE_SPEC_CTRL) ||
> > > > > +           guest_cpu_cap_has(vcpu, X86_FEATURE_AMD_IBRS))
> > > >
> > > > This is a bit conservative, but I don't think there's any ROI in being
> > > > more pedantic.
> > >
> > > Could you elaborate on this?
> > >
> > > Is this about doing the IBPB even if L1 does not actually execute an
> > > IBRS? I thought about this for a bit, but otherwise we'd have to
> > > intercept the MSR write IIUC, and I am not sure if that's better. Also,
> > > that's what we are already doing so I just kept it as-is.
> > >
> > > Or maybe about whether we need this on AMD only with AUTOIBRS? The APM
> > > is a bit unclear to me in this regard, but I believe may be needed even
> > > for 'normal' IBRS.
> >
> > If IA32_SPEC_CTRL.IBRS is clear at emulated VM-exit, then this IBPB is
> > unnecessary.
> >
> > However, since the host (L1) is running in a de-privileged prediction
> > domain, simply setting IA32_SPEC_CTRL.IBRS in the future won't protect
> > it from the guest (L2) that just exited. If we don't eagerly perform
> > an IBPB now, then L0 would have to intercept WRMSR(IA32_SPEC_CTRL)
> > from L1 so that we can issue an IBPB in the future, if L1 ever sets
> > IA32_SPEC_CTRL.IBRS.
>
> Right, that's what I meant by "we'd have to intercept the MSR write"
> above, but I didn't put it as eloquently as you just did :)
>
> We'd also need to have different handling for eIBRS/AUTOIBRS. It would
> basically be:
>
> if eIBRS/AUTOIBRS is enabled by L1:
>   - Do not intercept IBRS MSR writes
>   - Always IBPB on emulated VM-exits (unless IbrsSameMode).
> else if IBRS is advertised to L1:
>   - Intercept IBRS MSR writes and do an IBPB.
>   - Do not IBPB on emulated VM-exits.
>
> We'd basically have two modes of IBRS virtualization and we'd need to
> switch between them at runtime according to L1's setting of
> eIBRS/AUTOIBRS.
>
> We can simplify it if we always intercept IBRS MSR writes assuming L1
> won't do them with eIBRS/AUTOIBRS anyway, so this becomes:
>
> - On emulated VM-exits, IBPB if eIBRS/AUTOIBRS is enabled (unless
>   IbrsSameMode).
> - On IBRS MSR writes, do an IBPB.
>
> Simpler, but not sure if it buys us much.
>
> >
> > Eagerly performing an IBPB now seems like the better option.
>
> So yeah I definitely agree, unless we get regression reports caused by
> the IBPB on emulated VM-exits, and the MSR write interception turns out
> to be an improvement.

For "normal" configurations, I would expect L1 to set eIBRS or
autoIBRS on modern CPUs. In that case, the IBPB is required. However,
if L1 is a Linux OS booted with mitigations=off, the IBPB is probably
gratuitous.

On the Intel side, KVM tries to do some optimization of the case where
the guest doesn't use IA32_SPEC_CTRL. Since writes to IA32_SPEC_CTRL
are intercepted until the first non-zero value is written, I suppose
you could skip the IBPB when writes to IA32_SPEC_CTRL are intercepted,
and then just blindly perform an IBPB when disabling the intercept in
vmx_set_msr() if nested is enabled. Again, though, I don't think this
is worth it.