linux-kernel - Re: [PATCH] KVM: x86/pmu: Do not accidentally create BTS events

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2082c244-2e8e-48b3-8b8e-59b25f5ff1b4@gmail.com>
Date: Tue, 2 Dec 2025 10:19:11 +0800
From: Like Xu <like.xu.linux@...il.com>
To: Fernand Sieber <sieberf@...zon.com>
Cc: Jan H. Schönherr <jschoenh@...zon.de>, x86@...nel.org,
 kvm@...r.kernel.org, linux-kernel@...r.kernel.org, dwmw@...zon.co.uk,
 hborghor@...zon.de, nh-open-source@...zon.com, abusse@...zon.de,
 nsaenz@...zon.com, seanjc@...gle.com, pbonzini@...hat.com
Subject: Re: [PATCH] KVM: x86/pmu: Do not accidentally create BTS events

On 12/1/25 10:23 PM, Fernand Sieber wrote:
> From: Jan H. Schönherr <jschoenh@...zon.de>
> 
> It is possible to degrade host performance by manipulating performance
> counters from a VM and tricking the host hypervisor to enable branch
> tracing. When the guest programs a CPU to track branch instructions and
> deliver an interrupt after exactly one branch instruction, the value one
> is handled by the host KVM/perf subsystems and treated incorrectly as a
> special value to enable the branch trace store (BTS) subsystem. It

Based on my observations of PMU users, this is treated as a feature (using
PMC paths to trigger the BTS: generating a sampling for each branch already
makes it functionally and implementation-wise identical to BTS), and it
undoubtedly harms performance (just like other trace-based PMU facilities).

[*] perf record -e branches:u -c 1 -d ls

> should not be possible to enable BTS from a guest. When BTS is enabled,
> it leads to general host performance degradation to both VMs and host.
> 
> Perf considers the combination of PERF_COUNT_HW_BRANCH_INSTRUCTIONS with
> a sample_period of 1 a special case and handles this as a BTS event (see
> intel_pmu_has_bts_period()) -- a deviation from the usual semantic,
> where the sample_period represents the amount of branch instructions to
> encounter before the overflow handler is invoked.
> 
> Nothing prevents a guest from programming its vPMU with the above
> settings (count branch, interrupt after one branch), which causes KVM to
> erroneously instruct perf to create a BTS event within
> pmc_reprogram_counter(), which does not have the desired semantics.
> 
> The guest could also do more benign actions and request an interrupt
> after a more reasonable number of branch instructions via its vPMU. In
> that case counting works initially. However, KVM occasionally pauses and
> resumes the created performance counters. If the remaining amount of
> branch instructions until interrupt has reached 1 exactly,
> pmc_resume_counter() fails to resume the counter and a BTS event is
> created instead with its incorrect semantics.
> 
> Fix this behavior by not passing the special value "1" as sample_period
> to perf. Instead, perform the same quirk that happens later in
> x86_perf_event_set_period() anyway, when the performance counter is
> transferred to the actual PMU: bump the sample_period to 2.
> 
> Testing:
>  From guest:
> `./wrmsr -p 12 0x186 0x1100c4`
> `./wrmsr -p 12 0xc1 0xffffffffffff`
> `./wrmsr -p 12 0x186 0x5100c4`
> 
> This sequence sets up branch instruction counting, initializes the counter
> to overflow after one event (0xffffffffffff), and then enables edge
> detection (bit 18) for branch events.
> 
> ./wrmsr -p 12 0x186 0x1100c4
>      Writes to IA32_PERFEVTSEL0 (0x186)
>      Value 0x1100c4 breaks down as:
>          Event = 0xC4 (Branch instructions)
>          Bits 16-17: 0x1 (User mode only)
>          Bit 22: 1 (Enable counter)
> 
> ./wrmsr -p 12 0xc1 0xffffffffffff
>      Writes to IA32_PMC0 (0xC1)
>      Sets counter to maximum value (0xffffffffffff)
>      This effectively sets up the counter to overflow on the next branch
> 
> ./wrmsr -p 12 0x186 0x5100c4
>      Updates IA32_PERFEVTSEL0 again
>      Similar to first command but adds bit 18 (0x4 to 0x5)
>      Enables edge detection (bit 18)
> 
> These MSR writes are trapped by the hypervisor in KVM and forwarded to
> the perf subsystem to create corresponding monitoring events.
> 
> It is possible to repro this problem in a more realistic guest scenario:
> 
> `perf record -e branches:u -c 2 -a &`
> `perf record -e branches:u -c 2 -a &`

In this reproduction case, is there any unexpected memory corruption
(related to unallocated BTS buffer) ?

> 
> This presumably triggers the issue by KVM pausing and resuming the
> performance counter at the wrong moment, when its value is about to
> overflow.
> 
> Signed-off-by: Jan H. Schönherr <jschoenh@...zon.de>
> Signed-off-by: Fernand Sieber <sieberf@...zon.com>
> Reviewed-by: David Woodhouse <dwmw@...zon.co.uk>
> Reviewed-by: Hendrik Borghorst <hborghor@...zon.de>
> Link: https://lore.kernel.org/r/20251124100220.238177-1-sieberf@amazon.com
> ---
>   arch/x86/kvm/pmu.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 487ad19a236e..547512028e24 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -225,6 +225,19 @@ static u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
>   {
>   	u64 sample_period = (-counter_value) & pmc_bitmask(pmc);
>   
> +	/*
> +	 * A sample_period of 1 might get mistaken by perf for a BTS event, see
> +	 * intel_pmu_has_bts_period(). This would prevent re-arming the counter
> +	 * via pmc_resume_counter(), followed by the accidental creation of an
> +	 * actual BTS event, which we do not want.
> +	 *
> +	 * Avoid this by bumping the sampling period. Note, that we do not lose
> +	 * any precision, because the same quirk happens later anyway (for
> +	 * different reasons) in x86_perf_event_set_period().
> +	 */
> +	if (sample_period == 1)
> +		sample_period = 2;

Even without PERF_COUNT_HW_BRANCH_INSTRUCTIONS event check ?

> +
>   	if (!sample_period)
>   		sample_period = pmc_bitmask(pmc) + 1;
>   	return sample_period;