linux-kernel - Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <de6acc7e-8e7f-2c54-11cc-822df4084719@gmail.com>
Date:   Tue, 23 May 2023 20:40:53 +0800
From:   Like Xu <like.xu.linux@...il.com>
To:     Roman Kagan <rkagan@...zon.de>
Cc:     Jim Mattson <jmattson@...gle.com>,
        Paolo Bonzini <pbonzini@...hat.com>, x86@...nel.org,
        Eric Hankland <ehankland@...gle.com>,
        linux-kernel@...r.kernel.org,
        Sean Christopherson <seanjc@...gle.com>,
        kvm list <kvm@...r.kernel.org>
Subject: Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width

On 4/5/2023 8:00 pm, Roman Kagan wrote:
> Performance counters are defined to have width less than 64 bits.  The
> vPMU code maintains the counters in u64 variables but assumes the value
> to fit within the defined width.  However, for Intel non-full-width
> counters (MSR_IA32_PERFCTRx) the value receieved from the guest is
> truncated to 32 bits and then sign-extended to full 64 bits.  If a
> negative value is set, it's sign-extended to 64 bits, but then in
> kvm_pmu_incr_counter() it's incremented, truncated, and compared to the
> previous value for overflow detection.

Thanks for reporting this issue. An easier-to-understand fix could be:

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index e17be25de6ca..51e75f121234 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -718,7 +718,7 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)

  static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
  {
-	pmc->prev_counter = pmc->counter;
+	pmc->prev_counter = pmc->counter & pmc_bitmask(pmc);
  	pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
  	kvm_pmu_request_counter_reprogram(pmc);
  }

Considering that the pmu code uses pmc_bitmask(pmc) everywhere to wrap
around, I would prefer to use this fix above first and then do a more thorough
cleanup based on your below diff. What do you think ?

> That previous value is not truncated, so it always evaluates bigger than
> the truncated new one, and a PMI is injected.  If the PMI handler writes
> a negative counter value itself, the vCPU never quits the PMI loop.
> 
> Turns out that Linux PMI handler actually does write the counter with
> the value just read with RDPMC, so when no full-width support is exposed
> via MSR_IA32_PERF_CAPABILITIES, and the guest initializes the counter to
> a negative value, it locks up.

Not really sure what the behavioral difference is between "it locks up" and
"the vCPU never quits the PMI loop".

> 
> We observed this in the field, for example, when the guest configures
> atop to use perfevents and runs two instances of it simultaneously.

A more essential case I found is this:

  kvm_msr: msr_write CTR1 = 0xffffffffffffffea
  rdpmc on host: CTR1, value 0xffffffffffe3
  kvm_exit: vcpu 0 reason EXCEPTION_NMI
  kvm_msr: msr_read CTR1 = 0x83 // nmi_handler

There are two typical issues here:
- the emulated counter value changed from 0xffffffffffffffffea to 0xffffffffffffe3,
  triggering __kvm_perf_overflow(pmc, false);
- PMI-handler should not reset the counter to a value that is easily overflowed,
  in order to avoid overflow here before iret;

Please confirm whether your usage scenarios consist of these two types, or more.

> 
> To address the problem, maintain the invariant that the counter value
> always fits in the defined bit width, by truncating the received value
> in the respective set_msr methods.  For better readability, factor this
> out into a helper function, pmc_set_counter, shared by vmx and svm
> parts.
> 
> Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
> Signed-off-by: Roman Kagan <rkagan@...zon.de>

Tested-by: Like Xu <likexu@...cent.com>
I prefer to use pmc_bitmask(pmc) to wrap around pmc->prev_counter as the first step.

> ---
>   arch/x86/kvm/pmu.h           | 6 ++++++
>   arch/x86/kvm/svm/pmu.c       | 2 +-
>   arch/x86/kvm/vmx/pmu_intel.c | 4 ++--
>   3 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 5c7bbf03b599..6a91e1afef5a 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -60,6 +60,12 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
>   	return counter & pmc_bitmask(pmc);
>   }
>   
> +static inline void pmc_set_counter(struct kvm_pmc *pmc, u64 val)
> +{
> +	pmc->counter += val - pmc_read_counter(pmc);
> +	pmc->counter &= pmc_bitmask(pmc);
> +}
> +
>   static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
>   {
>   	if (pmc->perf_event) {
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index 5fa939e411d8..f93543d84cfe 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -151,7 +151,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>   	/* MSR_PERFCTRn */
>   	pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
>   	if (pmc) {
> -		pmc->counter += data - pmc_read_counter(pmc);
> +		pmc_set_counter(pmc, data);
>   		pmc_update_sample_period(pmc);
>   		return 0;
>   	}
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 741efe2c497b..51354e3935d4 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -467,11 +467,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>   			if (!msr_info->host_initiated &&
>   			    !(msr & MSR_PMC_FULL_WIDTH_BIT))
>   				data = (s64)(s32)data;
> -			pmc->counter += data - pmc_read_counter(pmc);
> +			pmc_set_counter(pmc, data);
>   			pmc_update_sample_period(pmc);
>   			break;
>   		} else if ((pmc = get_fixed_pmc(pmu, msr))) {
> -			pmc->counter += data - pmc_read_counter(pmc);
> +			pmc_set_counter(pmc, data);
>   			pmc_update_sample_period(pmc);
>   			break;
>   		} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {