linux-kernel - BUG: Occasional unexpected DR6 value seen with nested virtualization on x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANDhNCq5_F3HfFYABqFGCA1bPd_+xgNj-iDQhH4tDk+wi8iZZg@mail.gmail.com>
Date: Tue, 21 Jan 2025 22:02:18 -0800
From: John Stultz <jstultz@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>, Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, 
	Peter Zijlstra <peterz@...radead.org>, Frederic Weisbecker <fweisbec@...il.com>
Cc: Andy Lutomirski <luto@...nel.org>, Borislav Petkov <bp@...e.de>, Jim Mattson <jmattson@...gle.com>, 
	Alex Bennée <alex.bennee@...aro.org>, 
	Will Deacon <will@...nel.org>, Thomas Gleixner <tglx@...utronix.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, LKML <linux-kernel@...r.kernel.org>, 
	kernel-team@...roid.com
Subject: BUG: Occasional unexpected DR6 value seen with nested virtualization
 on x86

For awhile now, when testing Android in our virtualized x86
environment (cuttlefish), we've seen flakey failures ~1% or less in
testing with the bionic sys_ptrace.watchpoint_stress test:
https://android.googlesource.com/platform/bionic/+/refs/heads/main/tests/sys_ptrace_test.cpp#221

The failure looks like:
bionic/tests/sys_ptrace_test.cpp:(194) Failure in test
sys_ptrace.watchpoint_stress
Expected equality of these values:
  4
  siginfo.si_code
    Which is: 1
sys_ptrace.watchpoint_stress exited with exitcode 1.

Basically we expect to get a SIGTRAP with si_code: TRAP_HWBKPT, but
occasionally we get an si_code of TRAP_BRKPT.

I managed to reproduce the problem, and isolated it down to the call path:
[  173.185462] __send_signal_locked+0x3af/0x4b0
[  173.185563] send_signal_locked+0x16e/0x1b0
[  173.185649] force_sig_info_to_task+0x118/0x150
[  173.185759] force_sig_fault+0x60/0x80
[  173.185847] send_sigtrap+0x48/0x50
[  173.185933] noist_exc_debug+0xbe/0x100

Where we seem to be in exc_debug_user():
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/traps.c#n1067

Specifically here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/traps.c#n1130
        icebp = !dr6;
        ...
        /* Add the virtual_dr6 bits for signals. */
        dr6 |= current->thread.virtual_dr6;
        if (dr6 & (DR_STEP | DR_TRAP_BITS) || icebp)
        send_sigtrap(regs, 0, get_si_code(dr6));

Where get_si_code() is here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/asm/traps.h#n28
static inline int get_si_code(unsigned long condition)
{
        if (condition & DR_STEP)
                return TRAP_TRACE;
        else if (condition & (DR_TRAP0|DR_TRAP1|DR_TRAP2|DR_TRAP3))
                return TRAP_HWBKPT;
        else
                return TRAP_BRKPT;
}

We seem to be hitting the case where dr6 is null, and then as icebp
gets set in that case, we will call get_si_code() with a zero value
code, that gives us TRAP_BRKPT instead of TRAP_HWBKPT.

The dr6 value passed to exc_debug_user() comes from
debug_read_clear_dr6() in the definition for
DEFINE_IDTENTRY_DEBUG_USER(exc_debug):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/traps.c#n1147
Where debug_read_clear_dr6() is implemented here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/traps.c#n926

I then cut down and ported the bionic test out so it could build under
a standard debian environment:
https://github.com/johnstultz-work/bionic-ptrace-reproducer

Where I was able to reproduce the same problem in a debian VM (after
running the test in a loop for a short while).

Now, here's where it is odd. I could *not* reproduce the problem on
bare metal hardware, *nor* could I reproduce the problem in a virtual
environment.  I can *only* reproduce the problem with nested
virtualization (running the VM inside a VM).

I have reproduced this on my intel i12 NUC using the same v6.12 kernel
on metal + virt + nested environments.  It also reproduced on the NUC
with v5.15 (metal) + v6.1 (virt) + v6.1(nested).

I've also reproduced it with both the vms using only 1 cpu, and
tasksetting qemu on the bare metal to a single cpu to rule out any
sort issue with virtcpus migrating around.

Also setting enable_shadow_vmcs=0 on the metal host didn't seem to
affect the behavior.

I've tried to do some tracing in the arch/x86/kvm/x86.c logic, but
I've not yet directly correlated anything on the hosts to the point
where we read the zero DR6 value in the nested guest.

But I'm not very savvy around virtualization or ptrace watchpoints or
low level details around intel DB6 register, so I wanted to bring this
up on the list to see if folks had suggestions or ideas to further
narrow this down?  Happy to test things as it's pretty simple to
reproduce here.

Many thanks to Alex Bennee and Jim Mattson for their testing
suggestions to help narrow this down so far.

thanks
-john