lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ba1153d1860cd3c25f4d12340d25be3d33bfeae7.camel@redhat.com>
Date: Mon, 27 Jan 2025 12:58:40 -0500
From: Maxim Levitsky <mlevitsk@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: vmx_pmu_caps_test fails on Skylake based CPUS due to read only
 LBRs

On Fri, 2025-01-24 at 16:12 -0800, Sean Christopherson wrote:
> On Fri, Jan 24, 2025, Maxim Levitsky wrote:
> > On Wed, 2025-01-22 at 13:02 -0800, Sean Christopherson wrote:
> > > > > > Note that I read all msrs using 'rdmsr' userspace tool.
> > > > > 
> > > > > I'm pretty sure debugging via 'rdmsr', i.e. /dev/msr, isn't going to work.  I
> > > > > assume perf is clobbering LBR MSRs on context switch, but I haven't tracked that
> > > > > down to confirm (the code I see on inspecition is gated on at least one perf
> > > > > event using LBRs).  My guess is that there's a software bug somewhere in the
> > > > > perf/KVM exchange.
> > > > > 
> > > > > I confirmed that using 'rdmsr' and 'wrmsr' "loses" values, but that hacking KVM
> > > > > to read/write all LBRs during initialization works with LBRs disabled.
> > 
> > Hi!
> > 
> > I finally got to the very bottom of this:
> > 
> > First of all, your assumption that the kernel resets LBR related msrs on
> > context switch after 'wrmsr' program finishes execution is wrong, because the
> > kernel will only do this if it *itself* enables the LBR feature (that is when
> > something like 'perf', uses a perf counter with a lbr call stack).
> > 
> > Writes that 'wrmsr' tool does are not something that kernel expects so it
> > doesn't do anything in this case.
> > 
> > What is happening instead, is something completely different: Turns out that
> > to shave off something like 50 nanoseconds, off the deep C-state entry/exit
> > latency, some Intel CPU don't preserve LBR stack values over these C-state
> > entries.
> 
> Ugh.
> 
> > > Ugh, but it does.  On writes to any LBR, including LBR_TOS, KVM creates a "virtual"
> > > LBR perf event.  KVM then relies on perf to context switch LBR MSRs, i.e. relies
> > > on perf to load the guest's values into hardware.  At least, I think that's what
> > > is supposed to happen.  AFAIK, the perf-based LBR support has never been properly
> > > document[*].
> > > 
> > > Anyways, my understanding of intel_pmu_handle_lbr_msrs_access() is that if the
> > > vCPU's LBR perf event is scheduled out or can't be created, the guest's value is
> > > effectively lost.  Again, I don't know the "rules" for the LBR perf event, but
> > > it wouldn't suprise me if your CI fails because something in the host conflicts
> > > with KVM's LBR perf event.
> > 
> > Actually you are partially wrong here too (although BIOS can be considered
> > 'something on the host').
> > 
> > I was able to prove that the reason why the unit test fails *is* because BIOS
> > left LBRs enabled:
> > 
> > First of all, setting LBR bit manually in DEBUG_CTL does trigger this bug 
> > (I use a different machine now, which doesn't have the bios bug):
> 
> ...
> 
> > ==== Test Assertion Failure ====
> >   x86_64/vmx_pmu_caps_test.c:202: r == v
> >   pid=8415 tid=8415 errno=0 - Success
> >      1	0x0000000000404301: __suite_lbr_perf_capabilities at vmx_pmu_caps_test.c:202
> >      2	 (inlined by) vmx_pmu_caps_lbr_perf_capabilities at vmx_pmu_caps_test.c:194
> >      3	 (inlined by) wrapper_vmx_pmu_caps_lbr_perf_capabilities at vmx_pmu_caps_test.c:194
> >      4	0x000000000040511a: __run_test at kselftest_harness.h:1240
> >      5	0x0000000000402b95: test_harness_run at kselftest_harness.h:1310
> >      6	 (inlined by) main at vmx_pmu_caps_test.c:246
> >      7	0x00007f56ba2295cf: ?? ??:0
> >      8	0x00007f56ba22967f: ?? ??:0
> >      9	0x0000000000402e44: _start at ??:?
> >   Set MSR_LBR_TOS to '0x7', got back '0xc'
> > # lbr_perf_capabilities: Test failed
> > #          FAIL  vmx_pmu_caps.lbr_perf_capabilities
> > not ok 5 vmx_pmu_caps.lbr_perf_capabilities
> > #  RUN           vmx_pmu_caps.perf_capabilities_unsupported ...
> > #            OK  vmx_pmu_caps.perf_capabilities_unsupported
> > ok 6 vmx_pmu_caps.perf_capabilities_unsupported
> > # FAILED: 5 / 6 tests passed.
> > # Totals: pass:5 fail:1 xfail:0 xpass:0 skip:0 error:0
> > 
> > Secondary I went over all places in the kernel and all of them take care to
> > preserve DEBUG_CTL and only set/clear specific bits.
> > 
> > __intel_pmu_lbr_enable() and __intel_pmu_lbr_enable() are practically the
> > only two places where DEBUGCTLMSR_LBR bit is touched, and the test doesn't
> > trigger them. Most likely because the test uses special
> > 'INTEL_FIXED_VLBR_EVENT' perf event (see intel_pmu_create_guest_lbr_event)
> > which is not enabled while in host mode.
> > 
> > To double check this I traced all writes to DEBUG_CTL msr during this test
> > and the only write is done during 'guest_wrmsr_perf_capabilities' subtest, by
> > vmx_vcpu_run() which just restores the value that the msr had prior to VM
> > entry.
> > 
> > So, why the value that BIOS sets survives? Because as I said all code that
> > touches DEBUG_CTL takes care to preserve all bits but the bit which is
> > changed, LBRs are never enabled on the host, and even the guest entry
> > preserves host DEBUG_CTL.  Therefore the value written by BIOS survives.
> 
> Well that's rather insane.
> 
> > So we end up with the test writing to LBR_TOS while LBRs are unexpectedly
> > enabled, so it's not a surprise that when the test reads back the value
> > written, it will differ, and the test will rightfully fail.
> > 
> > Since we have seen this in CI, and you saw it too in your CI,
> 
> Gah, that was bad reporting on my end.  The failure we saw was something else
> entirely.
> 
> > I think this BIOS bug is not that rare, and so I suggest to stick
> > 'wrmsrl(MSR_IA32_DEBUGCTLMSR, 0)' somewhere early in a kernel boot code or at
> > least clear the DEBUGCTLMSR_LBR bit.
> > 
> > I haven't found a very good place to put this, in a way that I can be sure
> > that x86 maintainers won't reject it, so I am open to your suggestions.
> 
> Compile tested only, but perf's CPU online path seems appropriate, especially
> since that path also explicitly clears LBRs.  Ensuring LBRs are stopped before
> clearing them seems logical.
> 
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 99c590da0ae2..6e898b832d75 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -5030,8 +5030,12 @@ static void intel_pmu_cpu_starting(int cpu)
>  
>         init_debug_store_on_cpu(cpu);
>         /*
> -        * Deal with CPUs that don't clear their LBRs on power-up.
> +        * Deal with CPUs that don't clear their LBRs on power-up, and with
> +        * BIOSes that leave LBRs enabled.
>          */
> +       if (!static_cpu_has(X86_FEATURE_ARCH_LBR) && x86_pmu.lbr_nr)
> +               msr_clear_bit(MSR_IA32_DEBUGCTLMSR, DEBUGCTLMSR_LBR_BIT);
> +
>         intel_pmu_lbr_reset();
>  
>         cpuc->lbr_sel = NULL;
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 3ae84c3b8e6d..bb7dd85aa6f2 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -395,7 +395,8 @@
>  #define MSR_IA32_PASID_VALID           BIT_ULL(31)
>  
>  /* DEBUGCTLMSR bits (others vary by model): */
> -#define DEBUGCTLMSR_LBR                        (1UL <<  0) /* last branch recording */
> +#define DEBUGCTLMSR_LBR_BIT            0
> +#define DEBUGCTLMSR_LBR                        (1UL <<  DEBUGCTLMSR_LBR_BIT) /* last branch recording */
>  #define DEBUGCTLMSR_BTF_SHIFT          1
>  #define DEBUGCTLMSR_BTF                        (1UL <<  1) /* single-step on branches */
>  #define DEBUGCTLMSR_BUS_LOCK_DETECT    (1UL <<  2)
> 

I did some simulated test which sets the DEBUGCTLMSR_LBR early in the boot and this patch, and it worked just fine.
I agree that intel_pmu_cpu_starting is the best place to put this workaround.

You might consider refactoring the code that deals with LBR setup into a function,
like init_debug_store_on_cpu, maybe something like init_lbrs_on_cpu, but I don't mind
that, this patch as-is, is fine as well.

If I get my hands on the machine where this originally failed, I'll test there, although
most likely this just a formality.


Reviewed-by: Maxim Levitsky <mlevitsk@...hat.com>

Best regards,
	Maxim Levitsky





Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ