[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ba1153d1860cd3c25f4d12340d25be3d33bfeae7.camel@redhat.com>
Date: Mon, 27 Jan 2025 12:58:40 -0500
From: Maxim Levitsky <mlevitsk@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: vmx_pmu_caps_test fails on Skylake based CPUS due to read only
LBRs
On Fri, 2025-01-24 at 16:12 -0800, Sean Christopherson wrote:
> On Fri, Jan 24, 2025, Maxim Levitsky wrote:
> > On Wed, 2025-01-22 at 13:02 -0800, Sean Christopherson wrote:
> > > > > > Note that I read all msrs using 'rdmsr' userspace tool.
> > > > >
> > > > > I'm pretty sure debugging via 'rdmsr', i.e. /dev/msr, isn't going to work. I
> > > > > assume perf is clobbering LBR MSRs on context switch, but I haven't tracked that
> > > > > down to confirm (the code I see on inspecition is gated on at least one perf
> > > > > event using LBRs). My guess is that there's a software bug somewhere in the
> > > > > perf/KVM exchange.
> > > > >
> > > > > I confirmed that using 'rdmsr' and 'wrmsr' "loses" values, but that hacking KVM
> > > > > to read/write all LBRs during initialization works with LBRs disabled.
> >
> > Hi!
> >
> > I finally got to the very bottom of this:
> >
> > First of all, your assumption that the kernel resets LBR related msrs on
> > context switch after 'wrmsr' program finishes execution is wrong, because the
> > kernel will only do this if it *itself* enables the LBR feature (that is when
> > something like 'perf', uses a perf counter with a lbr call stack).
> >
> > Writes that 'wrmsr' tool does are not something that kernel expects so it
> > doesn't do anything in this case.
> >
> > What is happening instead, is something completely different: Turns out that
> > to shave off something like 50 nanoseconds, off the deep C-state entry/exit
> > latency, some Intel CPU don't preserve LBR stack values over these C-state
> > entries.
>
> Ugh.
>
> > > Ugh, but it does. On writes to any LBR, including LBR_TOS, KVM creates a "virtual"
> > > LBR perf event. KVM then relies on perf to context switch LBR MSRs, i.e. relies
> > > on perf to load the guest's values into hardware. At least, I think that's what
> > > is supposed to happen. AFAIK, the perf-based LBR support has never been properly
> > > document[*].
> > >
> > > Anyways, my understanding of intel_pmu_handle_lbr_msrs_access() is that if the
> > > vCPU's LBR perf event is scheduled out or can't be created, the guest's value is
> > > effectively lost. Again, I don't know the "rules" for the LBR perf event, but
> > > it wouldn't suprise me if your CI fails because something in the host conflicts
> > > with KVM's LBR perf event.
> >
> > Actually you are partially wrong here too (although BIOS can be considered
> > 'something on the host').
> >
> > I was able to prove that the reason why the unit test fails *is* because BIOS
> > left LBRs enabled:
> >
> > First of all, setting LBR bit manually in DEBUG_CTL does trigger this bug
> > (I use a different machine now, which doesn't have the bios bug):
>
> ...
>
> > ==== Test Assertion Failure ====
> > x86_64/vmx_pmu_caps_test.c:202: r == v
> > pid=8415 tid=8415 errno=0 - Success
> > 1 0x0000000000404301: __suite_lbr_perf_capabilities at vmx_pmu_caps_test.c:202
> > 2 (inlined by) vmx_pmu_caps_lbr_perf_capabilities at vmx_pmu_caps_test.c:194
> > 3 (inlined by) wrapper_vmx_pmu_caps_lbr_perf_capabilities at vmx_pmu_caps_test.c:194
> > 4 0x000000000040511a: __run_test at kselftest_harness.h:1240
> > 5 0x0000000000402b95: test_harness_run at kselftest_harness.h:1310
> > 6 (inlined by) main at vmx_pmu_caps_test.c:246
> > 7 0x00007f56ba2295cf: ?? ??:0
> > 8 0x00007f56ba22967f: ?? ??:0
> > 9 0x0000000000402e44: _start at ??:?
> > Set MSR_LBR_TOS to '0x7', got back '0xc'
> > # lbr_perf_capabilities: Test failed
> > # FAIL vmx_pmu_caps.lbr_perf_capabilities
> > not ok 5 vmx_pmu_caps.lbr_perf_capabilities
> > # RUN vmx_pmu_caps.perf_capabilities_unsupported ...
> > # OK vmx_pmu_caps.perf_capabilities_unsupported
> > ok 6 vmx_pmu_caps.perf_capabilities_unsupported
> > # FAILED: 5 / 6 tests passed.
> > # Totals: pass:5 fail:1 xfail:0 xpass:0 skip:0 error:0
> >
> > Secondary I went over all places in the kernel and all of them take care to
> > preserve DEBUG_CTL and only set/clear specific bits.
> >
> > __intel_pmu_lbr_enable() and __intel_pmu_lbr_enable() are practically the
> > only two places where DEBUGCTLMSR_LBR bit is touched, and the test doesn't
> > trigger them. Most likely because the test uses special
> > 'INTEL_FIXED_VLBR_EVENT' perf event (see intel_pmu_create_guest_lbr_event)
> > which is not enabled while in host mode.
> >
> > To double check this I traced all writes to DEBUG_CTL msr during this test
> > and the only write is done during 'guest_wrmsr_perf_capabilities' subtest, by
> > vmx_vcpu_run() which just restores the value that the msr had prior to VM
> > entry.
> >
> > So, why the value that BIOS sets survives? Because as I said all code that
> > touches DEBUG_CTL takes care to preserve all bits but the bit which is
> > changed, LBRs are never enabled on the host, and even the guest entry
> > preserves host DEBUG_CTL. Therefore the value written by BIOS survives.
>
> Well that's rather insane.
>
> > So we end up with the test writing to LBR_TOS while LBRs are unexpectedly
> > enabled, so it's not a surprise that when the test reads back the value
> > written, it will differ, and the test will rightfully fail.
> >
> > Since we have seen this in CI, and you saw it too in your CI,
>
> Gah, that was bad reporting on my end. The failure we saw was something else
> entirely.
>
> > I think this BIOS bug is not that rare, and so I suggest to stick
> > 'wrmsrl(MSR_IA32_DEBUGCTLMSR, 0)' somewhere early in a kernel boot code or at
> > least clear the DEBUGCTLMSR_LBR bit.
> >
> > I haven't found a very good place to put this, in a way that I can be sure
> > that x86 maintainers won't reject it, so I am open to your suggestions.
>
> Compile tested only, but perf's CPU online path seems appropriate, especially
> since that path also explicitly clears LBRs. Ensuring LBRs are stopped before
> clearing them seems logical.
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 99c590da0ae2..6e898b832d75 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -5030,8 +5030,12 @@ static void intel_pmu_cpu_starting(int cpu)
>
> init_debug_store_on_cpu(cpu);
> /*
> - * Deal with CPUs that don't clear their LBRs on power-up.
> + * Deal with CPUs that don't clear their LBRs on power-up, and with
> + * BIOSes that leave LBRs enabled.
> */
> + if (!static_cpu_has(X86_FEATURE_ARCH_LBR) && x86_pmu.lbr_nr)
> + msr_clear_bit(MSR_IA32_DEBUGCTLMSR, DEBUGCTLMSR_LBR_BIT);
> +
> intel_pmu_lbr_reset();
>
> cpuc->lbr_sel = NULL;
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 3ae84c3b8e6d..bb7dd85aa6f2 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -395,7 +395,8 @@
> #define MSR_IA32_PASID_VALID BIT_ULL(31)
>
> /* DEBUGCTLMSR bits (others vary by model): */
> -#define DEBUGCTLMSR_LBR (1UL << 0) /* last branch recording */
> +#define DEBUGCTLMSR_LBR_BIT 0
> +#define DEBUGCTLMSR_LBR (1UL << DEBUGCTLMSR_LBR_BIT) /* last branch recording */
> #define DEBUGCTLMSR_BTF_SHIFT 1
> #define DEBUGCTLMSR_BTF (1UL << 1) /* single-step on branches */
> #define DEBUGCTLMSR_BUS_LOCK_DETECT (1UL << 2)
>
I did some simulated test which sets the DEBUGCTLMSR_LBR early in the boot and this patch, and it worked just fine.
I agree that intel_pmu_cpu_starting is the best place to put this workaround.
You might consider refactoring the code that deals with LBR setup into a function,
like init_debug_store_on_cpu, maybe something like init_lbrs_on_cpu, but I don't mind
that, this patch as-is, is fine as well.
If I get my hands on the machine where this originally failed, I'll test there, although
most likely this just a formality.
Reviewed-by: Maxim Levitsky <mlevitsk@...hat.com>
Best regards,
Maxim Levitsky
Powered by blists - more mailing lists