[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2082c244-2e8e-48b3-8b8e-59b25f5ff1b4@gmail.com>
Date: Tue, 2 Dec 2025 10:19:11 +0800
From: Like Xu <like.xu.linux@...il.com>
To: Fernand Sieber <sieberf@...zon.com>
Cc: Jan H. Schönherr <jschoenh@...zon.de>, x86@...nel.org,
kvm@...r.kernel.org, linux-kernel@...r.kernel.org, dwmw@...zon.co.uk,
hborghor@...zon.de, nh-open-source@...zon.com, abusse@...zon.de,
nsaenz@...zon.com, seanjc@...gle.com, pbonzini@...hat.com
Subject: Re: [PATCH] KVM: x86/pmu: Do not accidentally create BTS events
On 12/1/25 10:23 PM, Fernand Sieber wrote:
> From: Jan H. Schönherr <jschoenh@...zon.de>
>
> It is possible to degrade host performance by manipulating performance
> counters from a VM and tricking the host hypervisor to enable branch
> tracing. When the guest programs a CPU to track branch instructions and
> deliver an interrupt after exactly one branch instruction, the value one
> is handled by the host KVM/perf subsystems and treated incorrectly as a
> special value to enable the branch trace store (BTS) subsystem. It
Based on my observations of PMU users, this is treated as a feature (using
PMC paths to trigger the BTS: generating a sampling for each branch already
makes it functionally and implementation-wise identical to BTS), and it
undoubtedly harms performance (just like other trace-based PMU facilities).
[*] perf record -e branches:u -c 1 -d ls
> should not be possible to enable BTS from a guest. When BTS is enabled,
> it leads to general host performance degradation to both VMs and host.
>
> Perf considers the combination of PERF_COUNT_HW_BRANCH_INSTRUCTIONS with
> a sample_period of 1 a special case and handles this as a BTS event (see
> intel_pmu_has_bts_period()) -- a deviation from the usual semantic,
> where the sample_period represents the amount of branch instructions to
> encounter before the overflow handler is invoked.
>
> Nothing prevents a guest from programming its vPMU with the above
> settings (count branch, interrupt after one branch), which causes KVM to
> erroneously instruct perf to create a BTS event within
> pmc_reprogram_counter(), which does not have the desired semantics.
>
> The guest could also do more benign actions and request an interrupt
> after a more reasonable number of branch instructions via its vPMU. In
> that case counting works initially. However, KVM occasionally pauses and
> resumes the created performance counters. If the remaining amount of
> branch instructions until interrupt has reached 1 exactly,
> pmc_resume_counter() fails to resume the counter and a BTS event is
> created instead with its incorrect semantics.
>
> Fix this behavior by not passing the special value "1" as sample_period
> to perf. Instead, perform the same quirk that happens later in
> x86_perf_event_set_period() anyway, when the performance counter is
> transferred to the actual PMU: bump the sample_period to 2.
>
> Testing:
> From guest:
> `./wrmsr -p 12 0x186 0x1100c4`
> `./wrmsr -p 12 0xc1 0xffffffffffff`
> `./wrmsr -p 12 0x186 0x5100c4`
>
> This sequence sets up branch instruction counting, initializes the counter
> to overflow after one event (0xffffffffffff), and then enables edge
> detection (bit 18) for branch events.
>
> ./wrmsr -p 12 0x186 0x1100c4
> Writes to IA32_PERFEVTSEL0 (0x186)
> Value 0x1100c4 breaks down as:
> Event = 0xC4 (Branch instructions)
> Bits 16-17: 0x1 (User mode only)
> Bit 22: 1 (Enable counter)
>
> ./wrmsr -p 12 0xc1 0xffffffffffff
> Writes to IA32_PMC0 (0xC1)
> Sets counter to maximum value (0xffffffffffff)
> This effectively sets up the counter to overflow on the next branch
>
> ./wrmsr -p 12 0x186 0x5100c4
> Updates IA32_PERFEVTSEL0 again
> Similar to first command but adds bit 18 (0x4 to 0x5)
> Enables edge detection (bit 18)
>
> These MSR writes are trapped by the hypervisor in KVM and forwarded to
> the perf subsystem to create corresponding monitoring events.
>
> It is possible to repro this problem in a more realistic guest scenario:
>
> `perf record -e branches:u -c 2 -a &`
> `perf record -e branches:u -c 2 -a &`
In this reproduction case, is there any unexpected memory corruption
(related to unallocated BTS buffer) ?
>
> This presumably triggers the issue by KVM pausing and resuming the
> performance counter at the wrong moment, when its value is about to
> overflow.
>
> Signed-off-by: Jan H. Schönherr <jschoenh@...zon.de>
> Signed-off-by: Fernand Sieber <sieberf@...zon.com>
> Reviewed-by: David Woodhouse <dwmw@...zon.co.uk>
> Reviewed-by: Hendrik Borghorst <hborghor@...zon.de>
> Link: https://lore.kernel.org/r/20251124100220.238177-1-sieberf@amazon.com
> ---
> arch/x86/kvm/pmu.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 487ad19a236e..547512028e24 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -225,6 +225,19 @@ static u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
> {
> u64 sample_period = (-counter_value) & pmc_bitmask(pmc);
>
> + /*
> + * A sample_period of 1 might get mistaken by perf for a BTS event, see
> + * intel_pmu_has_bts_period(). This would prevent re-arming the counter
> + * via pmc_resume_counter(), followed by the accidental creation of an
> + * actual BTS event, which we do not want.
> + *
> + * Avoid this by bumping the sampling period. Note, that we do not lose
> + * any precision, because the same quirk happens later anyway (for
> + * different reasons) in x86_perf_event_set_period().
> + */
> + if (sample_period == 1)
> + sample_period = 2;
Even without PERF_COUNT_HW_BRANCH_INSTRUCTIONS event check ?
> +
> if (!sample_period)
> sample_period = pmc_bitmask(pmc) + 1;
> return sample_period;
Powered by blists - more mailing lists