[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aP0pkFfHcl4gA2gj@gourry-fedora-PF4VCD3F>
Date: Sat, 25 Oct 2025 15:48:32 -0400
From: Gregory Price <gourry@...rry.net>
To: "Richard W.M. Jones" <rjones@...hat.com>
Cc: John Stultz <jstultz@...gle.com>,
Richard Henderson <richard.henderson@...aro.org>,
Peter Zijlstra <peterz@...radead.org>,
Arnd Bergmann <arnd@...db.de>,
Naresh Kamboju <naresh.kamboju@...aro.org>,
Anders Roxell <anders.roxell@...aro.org>,
Daniel Díaz <daniel.diaz@...aro.org>,
Benjamin Copeland <ben.copeland@...aro.org>,
linux-kernel@...r.kernel.org, x86@...nel.org,
Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP
tracing testing.
On Tue, Aug 08, 2023 at 08:28:35AM +0100, Richard W.M. Jones wrote:
> > [ 21.375453] Call Trace:
> > [ 21.375453] <TASK>
> > [ 21.375453] ? die+0x2d/0x80
> > [ 21.375453] ? exc_int3+0xf3/0x100
> > [ 21.375453] ? asm_exc_int3+0x35/0x40
> > [ 21.375453] ? hrtimer_start_range_ns+0x1ab/0x3d0
> > [ 21.375453] ? hrtimer_start_range_ns+0x1ab/0x3d0
--- >8
>
> Yes, it should be fixed upstream. You will need these two commits:
>
> commit deba78709ae8ce103e2248413857747f804cd1ef
> Author: Richard Henderson <richard.henderson@...aro.org>
> Date: Thu Jul 6 17:55:48 2023 +0100
>
> accel/tcg: Always lock pages before translation
>
Apologies for reviving an ancient thread, but I believe there is another
corner case for this bug - and it's an extremely narrow race condition.
Running in QEMU pc-q35-9.2 - so w/ all the fixes from this thread.
We first noticed crashes in poke_int3_handler() stemming from stacks
that look like so:
__kmalloc_noprof+0x7e
__kmalloc_cache_noprof+0x34
__kmalloc_node_noprof+0x98
kmem_cache_alloc_lru_noprof+0x37
kmem_cache_alloc_noprof+0x3d
... etc ...
Which lead us to static_branch code and subsequent this thread.
What we're seeing is QEMU/KVM are injecting the int3 exception with
the int3 IP address of the instruction following the 0xCC... and the
int3 has been removed.
> address_of_int3 + 5
poke() code then Oops because it can't find the int3.
We can't find *why* the exception RIP has been incremented past the int3
instruction - nothing in the KVM or QEMU emulation code immediately
suggest an obvious bug.
We spent a good amount of time inspecting the fixes in this thread,
as well as this thread:
https://lore.kernel.org/all/20220423021411.784383-6-seanjc@google.com/
And this thread:
https://lore.kernel.org/all/20250611113001.GC2273038@noisy.programming.kicks-ass.net/
We tried building a reproducer (guest_repro.c below) that simply
hammers on static_branch in the guest. This did not work by itself.
We went off to test/validate other things:
1) our guests are not configured to VMExit on int3
svm.vmcb.control.intercepts[prog["INTERCEPT_EXCEPTION"]] & (1 << 3)
(u32)0
2) svm_inject_exception DOES inject int3's, and we're fairly certain
this injection is what is causing the wrong RIP.
3) We attempted to get KVM/QEMU to emulate the int3's by executing
other instructions prior to the int3 that we know would cause VMExits
(cpuid, inb/outb), but this never caused a reproduction.
4) We finally traced the int3 injection to a VM exit for a nested page
fault. After the fault handling there's also an int3 reported in
EXITINTINFO - this causes an int3 injection.
That lead me to think this might be the result of swap or numa
balancing causing a guest page to become unmapped in the host, where
the next instruction is an int3, which would cause a nested fault.
I wrote a script on the host to migrate a guest's memory to/from a
local/remote node, and finally got a reproduction (stack below).
KVM doesn't support emulating INT3 in protected mode, which means
that the emulated INT3 and subsequent issue must happen in QEMU...
and that is where I am stuck.
So ultimately, this bug looks a lot like the one discussed in this
thread: An int3 from a static_branch is modified on one thread as
another thread tries to execute that int3 - except now we have a
Nested-Page-Fault that produces the subsequent race.
Any thoughts on where we might look in QEMU or KVM to make further
progress would be helpful - but I figured where we've gotten to might be
of interest to the folks who originally fixed this static_branch race.
I appreciate any cycles you might spare to help,
~Gregory
--- reproduction stack
<TASK>
? __die+0x77/0xc0
? die+0x2b/0x50
? exc_int3+0x41/0x70
? asm_exc_int3+0x35/0x40
? cleanup_module+0x80/0x80 [sbint3]
? looper+0x29/0x80 [sbint3]
? looper+0x29/0x80 [sbint3]
? looper+0x1e/0x80 [sbint3]
kthread+0xb1/0xe0
? __kthread_parkme+0x70/0x70
ret_from_fork+0x30/0x40
? __kthread_parkme+0x70/0x70
ret_from_fork_asm+0x11/0x20
</TASK>
-------- guest_repro.c (heavily truncated for brevity)
DEFINE_STATIC_KEY_FALSE(int3_branch_key); // initially disabled
/* 180 threads hammering on this */
static int looper(void *data)
{
while (!kthread_should_stop()) {
schedule();
/* nops produce 5-byte jump instr */
if (static_branch_likely(&int3_branch_key))
__asm__ volatile (".rept 40\n nop\n .endr\n");
}
}
/* 1 thread hammering on this */
static int toggler(void *data)
{
while (!kthread_should_stop()) {
schedule();
static_branch_enable(&int3_branch_key);
static_branch_disable(&int3_branch_key);
}
}
--- host_mover.c
/*
* basically this back and forth between node0<->node1
* 1GB at a time for each chunk in /proc/<pid>/maps
*/
int ret = move_pages(qemu_pid, chunk_size, pages, nodes, status, 0);
Powered by blists - more mailing lists