[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <de6acc7e-8e7f-2c54-11cc-822df4084719@gmail.com>
Date: Tue, 23 May 2023 20:40:53 +0800
From: Like Xu <like.xu.linux@...il.com>
To: Roman Kagan <rkagan@...zon.de>
Cc: Jim Mattson <jmattson@...gle.com>,
Paolo Bonzini <pbonzini@...hat.com>, x86@...nel.org,
Eric Hankland <ehankland@...gle.com>,
linux-kernel@...r.kernel.org,
Sean Christopherson <seanjc@...gle.com>,
kvm list <kvm@...r.kernel.org>
Subject: Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width
On 4/5/2023 8:00 pm, Roman Kagan wrote:
> Performance counters are defined to have width less than 64 bits. The
> vPMU code maintains the counters in u64 variables but assumes the value
> to fit within the defined width. However, for Intel non-full-width
> counters (MSR_IA32_PERFCTRx) the value receieved from the guest is
> truncated to 32 bits and then sign-extended to full 64 bits. If a
> negative value is set, it's sign-extended to 64 bits, but then in
> kvm_pmu_incr_counter() it's incremented, truncated, and compared to the
> previous value for overflow detection.
Thanks for reporting this issue. An easier-to-understand fix could be:
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index e17be25de6ca..51e75f121234 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -718,7 +718,7 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
{
- pmc->prev_counter = pmc->counter;
+ pmc->prev_counter = pmc->counter & pmc_bitmask(pmc);
pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
kvm_pmu_request_counter_reprogram(pmc);
}
Considering that the pmu code uses pmc_bitmask(pmc) everywhere to wrap
around, I would prefer to use this fix above first and then do a more thorough
cleanup based on your below diff. What do you think ?
> That previous value is not truncated, so it always evaluates bigger than
> the truncated new one, and a PMI is injected. If the PMI handler writes
> a negative counter value itself, the vCPU never quits the PMI loop.
>
> Turns out that Linux PMI handler actually does write the counter with
> the value just read with RDPMC, so when no full-width support is exposed
> via MSR_IA32_PERF_CAPABILITIES, and the guest initializes the counter to
> a negative value, it locks up.
Not really sure what the behavioral difference is between "it locks up" and
"the vCPU never quits the PMI loop".
>
> We observed this in the field, for example, when the guest configures
> atop to use perfevents and runs two instances of it simultaneously.
A more essential case I found is this:
kvm_msr: msr_write CTR1 = 0xffffffffffffffea
rdpmc on host: CTR1, value 0xffffffffffe3
kvm_exit: vcpu 0 reason EXCEPTION_NMI
kvm_msr: msr_read CTR1 = 0x83 // nmi_handler
There are two typical issues here:
- the emulated counter value changed from 0xffffffffffffffffea to 0xffffffffffffe3,
triggering __kvm_perf_overflow(pmc, false);
- PMI-handler should not reset the counter to a value that is easily overflowed,
in order to avoid overflow here before iret;
Please confirm whether your usage scenarios consist of these two types, or more.
>
> To address the problem, maintain the invariant that the counter value
> always fits in the defined bit width, by truncating the received value
> in the respective set_msr methods. For better readability, factor this
> out into a helper function, pmc_set_counter, shared by vmx and svm
> parts.
>
> Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
> Signed-off-by: Roman Kagan <rkagan@...zon.de>
Tested-by: Like Xu <likexu@...cent.com>
I prefer to use pmc_bitmask(pmc) to wrap around pmc->prev_counter as the first step.
> ---
> arch/x86/kvm/pmu.h | 6 ++++++
> arch/x86/kvm/svm/pmu.c | 2 +-
> arch/x86/kvm/vmx/pmu_intel.c | 4 ++--
> 3 files changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 5c7bbf03b599..6a91e1afef5a 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -60,6 +60,12 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> return counter & pmc_bitmask(pmc);
> }
>
> +static inline void pmc_set_counter(struct kvm_pmc *pmc, u64 val)
> +{
> + pmc->counter += val - pmc_read_counter(pmc);
> + pmc->counter &= pmc_bitmask(pmc);
> +}
> +
> static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
> {
> if (pmc->perf_event) {
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index 5fa939e411d8..f93543d84cfe 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -151,7 +151,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> /* MSR_PERFCTRn */
> pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
> if (pmc) {
> - pmc->counter += data - pmc_read_counter(pmc);
> + pmc_set_counter(pmc, data);
> pmc_update_sample_period(pmc);
> return 0;
> }
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 741efe2c497b..51354e3935d4 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -467,11 +467,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> if (!msr_info->host_initiated &&
> !(msr & MSR_PMC_FULL_WIDTH_BIT))
> data = (s64)(s32)data;
> - pmc->counter += data - pmc_read_counter(pmc);
> + pmc_set_counter(pmc, data);
> pmc_update_sample_period(pmc);
> break;
> } else if ((pmc = get_fixed_pmc(pmu, msr))) {
> - pmc->counter += data - pmc_read_counter(pmc);
> + pmc_set_counter(pmc, data);
> pmc_update_sample_period(pmc);
> break;
> } else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
Powered by blists - more mailing lists