lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f7b73f3b65377b7fd28f1f4764ea18f98056c51a.camel@redhat.com>
Date: Fri, 24 Jan 2025 18:36:03 -0500
From: Maxim Levitsky <mlevitsk@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: vmx_pmu_caps_test fails on Skylake based CPUS due to read only
 LBRs

On Wed, 2025-01-22 at 13:02 -0800, Sean Christopherson wrote:
> On Wed, Jan 22, 2025, Maxim Levitsky wrote:
> > On Tue, 2025-01-21 at 17:02 -0800, Sean Christopherson wrote:
> > > On Sun, Nov 03, 2024, Maxim Levitsky wrote:
> > > > On Mon, 2024-10-28 at 08:55 -0700, Sean Christopherson wrote:
> > > > > On Fri, Oct 18, 2024, Maxim Levitsky wrote:
> > > > > > Our CI found another issue, this time with vmx_pmu_caps_test.
> > > > > > 
> > > > > > On 'Intel(R) Xeon(R) Gold 6328HL CPU' I see that all LBR msrs (from/to and
> > > > > > TOS), are always read only - even when LBR is disabled - once I disable the
> > > > > > feature in DEBUG_CTL, all LBR msrs reset to 0, and you can't change their
> > > > > > value manually.  Freeze LBRS on PMI seems not to affect this behavior.
> > > 
> > > ...
> > > 
> > > > When DEBUG_CTL.LBR=1, the LBRs do work, I see all the registers update,
> > > > although TOS does seem to be stuck at one value, but it does change
> > > > sometimes, and it's non zero.
> > > > 
> > > > The FROM/TO do show healthy amount of updates 
> > > > 
> > > > Note that I read all msrs using 'rdmsr' userspace tool.
> > > 
> > > I'm pretty sure debugging via 'rdmsr', i.e. /dev/msr, isn't going to work.  I
> > > assume perf is clobbering LBR MSRs on context switch, but I haven't tracked that
> > > down to confirm (the code I see on inspecition is gated on at least one perf
> > > event using LBRs).  My guess is that there's a software bug somewhere in the
> > > perf/KVM exchange.
> > > 
> > > I confirmed that using 'rdmsr' and 'wrmsr' "loses" values, but that hacking KVM
> > > to read/write all LBRs during initialization works with LBRs disabled.

Hi!

I finally got to the very bottom of this:

First of all, your assumption that the kernel resets LBR related msrs on context switch after 'wrmsr'
program finishes execution is wrong, because the kernel will only do this if it *itself*
enables the LBR feature (that is when something like 'perf', uses a perf counter with a lbr call stack).

Writes that 'wrmsr' tool does are not something that kernel expects so it doesn't
do anything in this case.

What is happening instead, is something completely different: Turns out that to shave off something like 
50 nanoseconds, off the deep C-state entry/exit latency, some Intel CPU don't preserve LBR stack
values over these C-state entries.

Kernel PMU code even has some special code which works this around.

So, right after 'wrmsr' execution the CPU on a otherwise idle host finishes, the CPU will enter a low power state,
and 'poof', LBR state is gone.

To see this for yourself, just disable C-states

# cpupower idle-set --disable-by-latency 0

And suddenly wrmsr reads/writes the LBR stack start to work normally as expected.

This also in particular explains why I had no problems reading/writing LBR stack msrs on some older CPUs.

> > 
> > Hi,
> > 
> > OK, this is a very good piece of the puzzle.
> > 
> > I didn't expect context switch to interfere with this because I thought that
> > perf code won't touch LBRs if they are not in use. 
> > rdmsr/wrmsr programs don't do much except doing the instruction in the kernel space.
> > 
> > Is it then possible that the the fact that LBRs were left enabled by BIOS is the
> > culprit of the problem?
> > 
> > This particular test never enables LBRs, not anything in the system does this,
> 
> Ugh, but it does.  On writes to any LBR, including LBR_TOS, KVM creates a "virtual"
> LBR perf event.  KVM then relies on perf to context switch LBR MSRs, i.e. relies
> on perf to load the guest's values into hardware.  At least, I think that's what
> is supposed to happen.  AFAIK, the perf-based LBR support has never been properly
> document[*].
> 
> Anyways, my understanding of intel_pmu_handle_lbr_msrs_access() is that if the
> vCPU's LBR perf event is scheduled out or can't be created, the guest's value is
> effectively lost.  Again, I don't know the "rules" for the LBR perf event, but
> it wouldn't suprise me if your CI fails because something in the host conflicts
> with KVM's LBR perf event.

Actually you are partially wrong here too (although BIOS can be considered 'something on the host').

I was able to prove that the reason why the unit test fails *is* because BIOS left LBRs enabled:

First of all, setting LBR bit manually in DEBUG_CTL does trigger this bug 
(I use a different machine now, which doesn't have the bios bug):


# wrmsr -a 0x1d9 0x4001
# ./x86_64/vmx_pmu_caps_test 
Random seed: 0x6b8b4567
TAP version 13
1..6
# Starting 6 tests from 1 test cases.
#  RUN           vmx_pmu_caps.guest_wrmsr_perf_capabilities ...
#            OK  vmx_pmu_caps.guest_wrmsr_perf_capabilities
ok 1 vmx_pmu_caps.guest_wrmsr_perf_capabilities
#  RUN           vmx_pmu_caps.basic_perf_capabilities ...
#            OK  vmx_pmu_caps.basic_perf_capabilities
ok 2 vmx_pmu_caps.basic_perf_capabilities
#  RUN           vmx_pmu_caps.fungible_perf_capabilities ...
#            OK  vmx_pmu_caps.fungible_perf_capabilities
ok 3 vmx_pmu_caps.fungible_perf_capabilities
#  RUN           vmx_pmu_caps.immutable_perf_capabilities ...
#            OK  vmx_pmu_caps.immutable_perf_capabilities
ok 4 vmx_pmu_caps.immutable_perf_capabilities
#  RUN           vmx_pmu_caps.lbr_perf_capabilities ...
==== Test Assertion Failure ====
  x86_64/vmx_pmu_caps_test.c:202: r == v
  pid=8415 tid=8415 errno=0 - Success
     1	0x0000000000404301: __suite_lbr_perf_capabilities at vmx_pmu_caps_test.c:202
     2	 (inlined by) vmx_pmu_caps_lbr_perf_capabilities at vmx_pmu_caps_test.c:194
     3	 (inlined by) wrapper_vmx_pmu_caps_lbr_perf_capabilities at vmx_pmu_caps_test.c:194
     4	0x000000000040511a: __run_test at kselftest_harness.h:1240
     5	0x0000000000402b95: test_harness_run at kselftest_harness.h:1310
     6	 (inlined by) main at vmx_pmu_caps_test.c:246
     7	0x00007f56ba2295cf: ?? ??:0
     8	0x00007f56ba22967f: ?? ??:0
     9	0x0000000000402e44: _start at ??:?
  Set MSR_LBR_TOS to '0x7', got back '0xc'
# lbr_perf_capabilities: Test failed
#          FAIL  vmx_pmu_caps.lbr_perf_capabilities
not ok 5 vmx_pmu_caps.lbr_perf_capabilities
#  RUN           vmx_pmu_caps.perf_capabilities_unsupported ...
#            OK  vmx_pmu_caps.perf_capabilities_unsupported
ok 6 vmx_pmu_caps.perf_capabilities_unsupported
# FAILED: 5 / 6 tests passed.
# Totals: pass:5 fail:1 xfail:0 xpass:0 skip:0 error:0


Secondary I went over all places in the kernel and all of them take care to preserve DEBUG_CTL and only set/clear specific bits.

__intel_pmu_lbr_enable() and __intel_pmu_lbr_enable() are practically the only two places where DEBUGCTLMSR_LBR bit is touched,
and the test doesn't trigger them. Most likely because the test uses special 'INTEL_FIXED_VLBR_EVENT' perf event
(see intel_pmu_create_guest_lbr_event) which is not enabled while in host mode.

To double check this I traced all writes to DEBUG_CTL msr during this test and the only write is done during 'guest_wrmsr_perf_capabilities'
subtest, by vmx_vcpu_run() which just restores the value that the msr had prior to VM entry.

So, why the value that BIOS sets survives? Because as I said all code that touches DEBUG_CTL takes care to preserve all bits but
the bit which is changed, LBRs are never enabled on the host, and even the guest entry preserves host DEBUG_CTL.
Therefore the value written by BIOS survives.

So we end up with the test writing to LBR_TOS while LBRs are unexpectedly enabled, so it's not a surprise that when the test
reads back the value written, it will differ, and the test will rightfully fail.

Since we have seen this in CI, and you saw it too in your CI, I think this BIOS bug is not that rare, and so I suggest to stick 
'wrmsrl(MSR_IA32_DEBUGCTLMSR, 0)' somewhere early in a kernel boot code
or at least clear the DEBUGCTLMSR_LBR bit.

I haven't found a very good place to put this, in a way that I can be sure that x86 maintainers 
won't reject it, so I am open to your suggestions.


Best regards,
	Maxim Levitsky


> 
> [*] https://lore.kernel.org/all/Y9RUOvJ5dkCU9J8C@google.com
> 






Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ