[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <29CDF294-5394-47C7-8B50-5F1FC101891C@nutanix.com>
Date: Wed, 18 May 2022 14:23:03 +0000
From: Jon Kohler <jon@...anix.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Jon Kohler <jon@...anix.com>, Paolo Bonzini <pbonzini@...hat.com>,
Vitaly Kuznetsov <vkuznets@...hat.com>,
Wanpeng Li <wanpengli@...cent.com>,
Jim Mattson <jmattson@...gle.com>,
Joerg Roedel <joro@...tes.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
"x86@...nel.org" <x86@...nel.org>,
"H. Peter Anvin" <hpa@...or.com>,
Andrea Arcangeli <aarcange@...hat.com>,
Josh Poimboeuf <jpoimboe@...hat.com>,
Kees Cook <keescook@...omium.org>,
Waiman Long <longman@...hat.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] KVM: VMX: do not disable interception for
MSR_IA32_SPEC_CTRL on eIBRS
> On May 17, 2022, at 9:42 PM, Sean Christopherson <seanjc@...gle.com> wrote:
>
> On Thu, May 12, 2022, Jon Kohler wrote:
>> Avoid expensive rdmsr on every VM Exit for MSR_IA32_SPEC_CTRL on
>> eIBRS enabled systems iff the guest only sets IA32_SPEC_CTRL[0] (IBRS)
>> and not [1] (STIBP) or [2] (SSBD) by not disabling interception in
>> the MSR bitmap.
>>
>> eIBRS enabled guests using just IBRS will only write SPEC_CTRL MSR
>> once or twice per vCPU on boot, so it is far better to take those
>> VM exits on boot than having to read and save this msr on every
>> single VM exit forever. This outcome was suggested on Andrea's commit
>> 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
>> however, since interception is still unilaterally disabled, the rdmsr
>> tax is still there even after that commit.
>>
>> This is a significant win for eIBRS enabled systems as this rdmsr
>> accounts for roughly ~50% of time for vmx_vcpu_run() as observed
>> by perf top disassembly, and is in the critical path for all
>> VM-Exits, including fastpath exits.
>>
>> Update relevant comments in vmx_vcpu_run() with appropriate SDM
>> references for future onlookers.
>>
>> Fixes: 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
>> Signed-off-by: Jon Kohler <jon@...anix.com>
>> Cc: Andrea Arcangeli <aarcange@...hat.com>
>> Cc: Kees Cook <keescook@...omium.org>
>> Cc: Josh Poimboeuf <jpoimboe@...hat.com>
>> Cc: Waiman Long <longman@...hat.com>
>> ---
>> arch/x86/kvm/vmx/vmx.c | 46 +++++++++++++++++++++++++++++++-----------
>> 1 file changed, 34 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index d58b763df855..d9da6fcecd8c 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -2056,6 +2056,25 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> if (kvm_spec_ctrl_test_value(data))
>> return 1;
>>
>> + /*
>> + * For Intel eIBRS, IBRS (SPEC_CTRL_IBRS aka 0x00000048 BIT(0))
>> + * only needs to be set once and can be left on forever without
>> + * needing to be constantly toggled. If the guest attempts to
>> + * write that value, let's not disable interception. Guests
>> + * with eIBRS awareness should only be writing SPEC_CTRL_IBRS
>> + * once per vCPU per boot.
>> + *
>> + * The guest can still use other SPEC_CTRL features on top of
>> + * eIBRS such as SSBD, and we should disable interception for
>
> Please don't add comments that say "should" or "need", just state what is being
> done. Just because a comment says XYZ should do something doesn't necessarily
> mean that XYZ actually does that thing.
>
>> + * those situations to avoid a multitude of VM-Exits's;
>> + * however, we will need to check SPEC_CTRL on each exit to
>> + * make sure we restore the host value properly.
>> + */
>
> This whole comment can be more succint. Better yet, merge it with the comment
> below (out of sight in this diff) and opportunistically update that comment to
> reflect what actually happens if L2 is the first to write a non-zero value (arguably
> a bug that should be fixed, but meh). The IBPB MSR has the same flaw. :-/
>
>> + if (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) && data == BIT(0)) {
>
> Use SPEC_CTRL_IBRS instead of open coding "BIT(0)", then a chunk of the comment
> goes away.
>
>> + vmx->spec_ctrl = data;
>> + break;
>> + }
>
> There's no need for a separate if statement. And the boot_cpu_has() check can
> be dropped, kvm_spec_ctrl_test_value() has already verified the bit is writable
> (unless you're worried about bit 0 being used for something else?)
>
>> +
>> vmx->spec_ctrl = data;
>> if (!data)
>> break;
>> @@ -6887,19 +6906,22 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>> vmx_vcpu_enter_exit(vcpu, vmx);
>>
>> /*
>> - * We do not use IBRS in the kernel. If this vCPU has used the
>> - * SPEC_CTRL MSR it may have left it on; save the value and
>> - * turn it off. This is much more efficient than blindly adding
>> - * it to the atomic save/restore list. Especially as the former
>> - * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
>> - *
>> - * For non-nested case:
>> - * If the L01 MSR bitmap does not intercept the MSR, then we need to
>> - * save it.
>> + * SDM 25.1.3 - handle conditional exit for MSR_IA32_SPEC_CTRL.
>> + * To prevent constant VM exits for SPEC_CTRL, kernel may
>
> Please wrap at 80 chars (ignore the bad examples in KVM).
>
>> + * disable interception in the MSR bitmap for SPEC_CTRL MSR,
>> + * such that the guest can read and write to that MSR without
>> + * trapping to KVM; however, the guest may set a different
>> + * value than the host. For exit handling, do rdmsr below if
>> + * interception is disabled, such that we can save the guest
>> + * value for restore on VM entry, as it does not get saved
>> + * automatically per SDM 27.3.1.
>> *
>> - * For nested case:
>> - * If the L02 MSR bitmap does not intercept the MSR, then we need to
>> - * save it.
>> + * This behavior is optimized on eIBRS enabled systems, such
>> + * that the kernel only disables interception for MSR_IA32_SPEC_CTRL
>> + * when guests choose to use additional SPEC_CTRL features
>> + * above and beyond IBRS, such as STIBP or SSBD. This
>> + * optimization allows the kernel to avoid doing the expensive
>> + * rdmsr below.
>
> I don't see any reason to restate why the MSR may or may not be intercepted, just
> explain when the value needs to be read.
>
> E.g. something like this for the whole patch?
Sean - Thanks for the feedback as always, good stuff. I’ll pull this together,
Double check the testing side, and send out a v2 with all this incorporated.
>
> ---
> arch/x86/kvm/vmx/vmx.c | 60 +++++++++++++++++++-----------------------
> 1 file changed, 27 insertions(+), 33 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 610355b9ccce..70d863a7882d 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2057,20 +2057,30 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> return 1;
>
> vmx->spec_ctrl = data;
> - if (!data)
> +
> + /*
> + * Disable interception on the first non-zero write, unless the
> + * guest is setting only SPEC_CTRL_IBRS, which is typically set
> + * once at boot and never touched again. All other bits are
> + * often set on a per-task basis, i.e. may change frequently,
> + * so the benefit of avoiding VM-exits during guest context
> + * switches outweighs the cost of RDMSR on every VM-Exit to
> + * save the guest's value.
> + */
> + if (!data || data == SPEC_CTRL_IBRS)
> break;
>
> /*
> - * For non-nested:
> - * When it's written (to non-zero) for the first time, pass
> - * it through.
> - *
> - * For nested:
> - * The handling of the MSR bitmap for L2 guests is done in
> - * nested_vmx_prepare_msr_bitmap. We should not touch the
> - * vmcs02.msr_bitmap here since it gets completely overwritten
> - * in the merging. We update the vmcs01 here for L1 as well
> - * since it will end up touching the MSR anyway now.
> + * Update vmcs01.msr_bitmap even if L2 is active, i.e. disable
> + * interception for the vCPU on the first write regardless of
> + * whether the WRMSR came from L1 or L2. vmcs02's bitmap is a
> + * combination of vmcs01 and vmcs12 bitmaps, and will be
> + * recomputed by nested_vmx_prepare_msr_bitmap() on the next
> + * nested VM-Enter. Note, this does mean that future WRMSRs
> + * from L2 will be intercepted until the next nested VM-Exit if
> + * L2 was the first to write, but L1 exposing the MSR to L2
> + * without first writing it is unlikely and not worth the
> + * extra bit of complexity.
> */
> vmx_disable_intercept_for_msr(vcpu,
> MSR_IA32_SPEC_CTRL,
> @@ -2098,15 +2108,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
>
> /*
> - * For non-nested:
> - * When it's written (to non-zero) for the first time, pass
> - * it through.
> - *
> - * For nested:
> - * The handling of the MSR bitmap for L2 guests is done in
> - * nested_vmx_prepare_msr_bitmap. We should not touch the
> - * vmcs02.msr_bitmap here since it gets completely overwritten
> - * in the merging.
> + * Disable interception on the first IBPB, odds are good IBPB
> + * will be a frequent guest action. See the comment for
> + * MSR_IA32_SPEC_CTRL for details on the nested interaction.
> */
> vmx_disable_intercept_for_msr(vcpu, MSR_IA32_PRED_CMD, MSR_TYPE_W);
> break;
> @@ -6887,19 +6891,9 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> vmx_vcpu_enter_exit(vcpu, vmx);
>
> /*
> - * We do not use IBRS in the kernel. If this vCPU has used the
> - * SPEC_CTRL MSR it may have left it on; save the value and
> - * turn it off. This is much more efficient than blindly adding
> - * it to the atomic save/restore list. Especially as the former
> - * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
> - *
> - * For non-nested case:
> - * If the L01 MSR bitmap does not intercept the MSR, then we need to
> - * save it.
> - *
> - * For nested case:
> - * If the L02 MSR bitmap does not intercept the MSR, then we need to
> - * save it.
> + * Save SPEC_CTRL if it may have been written by the guest, the current
> + * value in hardware is used by x86_spec_ctrl_restore_host() to avoid
> + * WRMSR if the current value matches the host's desired value.
> */
> if (unlikely(!msr_write_intercepted(vmx, MSR_IA32_SPEC_CTRL)))
> vmx->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);
>
> base-commit: 69b59889b0147aa80098936e383b06fec30cdf5c
> --
Powered by blists - more mailing lists