[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20231109180646.2963718-1-khorenko@virtuozzo.com>
Date: Thu, 9 Nov 2023 21:06:45 +0300
From: Konstantin Khorenko <khorenko@...tuozzo.com>
To: Sean Christopherson <seanjc@...gle.com>,
Paolo Bonzini <pbonzini@...hat.com>
Cc: Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H . Peter Anvin" <hpa@...or.com>, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org,
Konstantin Khorenko <khorenko@...tuozzo.com>,
"Denis V. Lunev" <den@...tuozzo.com>
Subject: [PATCH 0/1] KVM: x86/vPMU: Speed up vmexit for AMD Zen 4 CPUs
We have detected significant performance drop of our atomic test which
checks the rate of CPUID instructions rate inside an L1 VM on an AMD
node.
Investigation led to 2 mainstream patches which have introduced extra
events accounting:
018d70ffcfec ("KVM: x86: Update vPMCs when retiring branch instructions")
9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
And on an AMD Zen 3 CPU that resulted in immediate 43% drop in the CPUID
rate.
Checking latest mainsteam kernel the performance difference is much less
but still quite noticeable: 13.4% and shows up on AMD CPUs only.
Looks like iteration over all PMCs in kvm_pmu_trigger_event() is cheap
on Intel and expensive on AMD CPUs.
So the idea behind this patch is to skip iterations over PMCs at all in
case PMU is disabled for a VM completely or PMU is enabled for a VM, but
there are no active PMCs at all.
Unfortunately
* current kernel code does not differentiate if PMU is globally enabled
for a VM or not (pmu->version is always 1)
* AMD CPUs older than Zen 4 do not support PMU v2 and thus efficient
check for enabled PMCs is not possible
=> the patch speeds up vmexit for AMD Zen 4 CPUs only, this is sad.
but the patch does not hurt other CPUs - and this is fortunate!
i have no access to a node with AMD Zen 4 CPU, so i had to test on
AMD Zen 3 CPU and i hope my expectations are right for AMD Zen 4.
i would appreciate if anyone perform the test of a real AMD Zen 4 node.
AMD performance results:
CPU: AMD Zen 3 (three!): AMD EPYC 7443P 24-Core Processor
* The test binary is run inside an AlmaLinux 9 VM with their stock kernel
5.14.0-284.11.1.el9_2.x86_64.
* Test binary checks the CPUID instractions rate (instructions per sec).
* Default VM config (PMU is off, pmu->version is reported as 1).
* The Host runs the kernel under test.
# for i in 1 2 3 4 5 ; do ./at_cpu_cpuid.pub ; done | \
awk -e '{print $4;}' | \
cut -f1 --delimiter='.' | \
./avg.sh
Measurements:
1. Host runs stock latest mainstream kernel commit 305230142ae0.
2. Host runs same mainstream kernel + current patch.
3. Host runs same mainstream kernel + current patch + force
guest_pmu_is_enabled() to always return "false" using following change:
- if (pmu->version >= 2 && !(pmu->global_ctrl & ~pmu->global_ctrl_mask))
+ if (pmu->version == 1 && !(pmu->global_ctrl & ~pmu->global_ctrl_mask))
-----------------------------------------
| Kernels | CPUID rate |
-----------------------------------------
| 1. | 1360250 |
| 2. | 1365536 (+ 0.4%) |
| 3. | 1541850 (+13.4%) |
-----------------------------------------
Measurement (2) gives some fluctuation, the performance is not increased
because the test was done on a Zen 3 CPU, so we are unable to use fast
check for active PMCs.
Measurement (3) shows expected performance boost on a Zen 4 CPU under
the same test.
The test used:
# cat at_cpu_cpuid.pub.cpp
/*
* The test executes CPUID instruction in a loop and reports the calls rate.
*/
#include <stdio.h>
#include <time.h>
/* #define CPUID_EAX 0x80000002 */
#define CPUID_EAX 0x29a
#define CPUID_ECX 0
#define TEST_EXEC_SECS 30 // in seconds
#define LOOPS_APPROX_RATE 1000000
static inline void cpuid(unsigned int _eax, unsigned int _ecx)
{
unsigned int regs[4] = {_eax, 0, _ecx, 0};
asm __volatile__(
"cpuid"
: "=a" (regs[0]), "=b" (regs[1]), "=c" (regs[2]), "=d" (regs[3])
: "0" (regs[0]), "1" (regs[1]), "2" (regs[2]), "3" (regs[3])
: "memory");
}
double cpuid_rate_loops(int loops_num)
{
int i;
clock_t start_time, end_time;
double spent_time, rate;
start_time = clock();
for (i = 0; i < loops_num; i++)
cpuid((unsigned int)CPUID_EAX, (unsigned int)CPUID_ECX);
end_time = clock();
spent_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
rate = (double)loops_num / spent_time;
return rate;
}
int main(int argc, char* argv[])
{
double approx_rate, rate;
int loops;
/* First we detect approximate CPUIDs rate. */
approx_rate = cpuid_rate_loops(LOOPS_APPROX_RATE);
/*
* How many loops there should be in order to run the test for
* TEST_EXEC_SECS seconds?
*/
loops = (int)(approx_rate * TEST_EXEC_SECS);
/* Get the precise instructions rate. */
rate = cpuid_rate_loops(loops);
printf( "CPUID instructions rate: %f instructions/second\n", rate);
return 0;
}
Konstantin Khorenko (1):
KVM: x86/vPMU: Check PMU is enabled for vCPU before searching for PMC
arch/x86/kvm/pmu.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
--
2.39.3
Powered by blists - more mailing lists