[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aP0pkFfHcl4gA2gj@gourry-fedora-PF4VCD3F>
Date: Sat, 25 Oct 2025 15:48:32 -0400
From: Gregory Price <gourry@...rry.net>
To: "Richard W.M. Jones" <rjones@...hat.com>
Cc: John Stultz <jstultz@...gle.com>,
	Richard Henderson <richard.henderson@...aro.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Arnd Bergmann <arnd@...db.de>,
	Naresh Kamboju <naresh.kamboju@...aro.org>,
	Anders Roxell <anders.roxell@...aro.org>,
	Daniel Díaz <daniel.diaz@...aro.org>,
	Benjamin Copeland <ben.copeland@...aro.org>,
	linux-kernel@...r.kernel.org, x86@...nel.org,
	Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP
 tracing testing.
On Tue, Aug 08, 2023 at 08:28:35AM +0100, Richard W.M. Jones wrote:
> > [   21.375453] Call Trace:
> > [   21.375453]  <TASK>
> > [   21.375453]  ? die+0x2d/0x80
> > [   21.375453]  ? exc_int3+0xf3/0x100
> > [   21.375453]  ? asm_exc_int3+0x35/0x40
> > [   21.375453]  ? hrtimer_start_range_ns+0x1ab/0x3d0
> > [   21.375453]  ? hrtimer_start_range_ns+0x1ab/0x3d0
--- >8
> 
> Yes, it should be fixed upstream.  You will need these two commits:
> 
> commit deba78709ae8ce103e2248413857747f804cd1ef
> Author: Richard Henderson <richard.henderson@...aro.org>
> Date:   Thu Jul 6 17:55:48 2023 +0100
> 
>     accel/tcg: Always lock pages before translation
> 
Apologies for reviving an ancient thread, but I believe there is another
corner case for this bug - and it's an extremely narrow race condition.
Running in QEMU pc-q35-9.2 - so w/ all the fixes from this thread.
We first noticed crashes in poke_int3_handler() stemming from stacks
that look like so:
    __kmalloc_noprof+0x7e
    __kmalloc_cache_noprof+0x34
    __kmalloc_node_noprof+0x98
    kmem_cache_alloc_lru_noprof+0x37
    kmem_cache_alloc_noprof+0x3d
    ... etc ...
Which lead us to static_branch code and subsequent this thread.
What we're seeing is QEMU/KVM are injecting the int3 exception with
the int3 IP address of the instruction following the 0xCC... and the
int3 has been removed.
   > address_of_int3 + 5
poke() code then Oops because it can't find the int3.
We can't find *why* the exception RIP has been incremented past the int3
instruction - nothing in the KVM or QEMU emulation code immediately
suggest an obvious bug.  
We spent a good amount of time inspecting the fixes in this thread,
as well as this thread:
https://lore.kernel.org/all/20220423021411.784383-6-seanjc@google.com/
And this thread:
https://lore.kernel.org/all/20250611113001.GC2273038@noisy.programming.kicks-ass.net/
We tried building a reproducer (guest_repro.c below) that simply
hammers on static_branch in the guest.  This did not work by itself.
We went off to test/validate other things:
1) our guests are not configured to VMExit on int3
   svm.vmcb.control.intercepts[prog["INTERCEPT_EXCEPTION"]] & (1 << 3)
   (u32)0
2) svm_inject_exception DOES inject int3's, and we're fairly certain
   this injection is what is causing the wrong RIP.
3) We attempted to get KVM/QEMU to emulate the int3's by executing
   other instructions prior to the int3 that we know would cause VMExits
   (cpuid, inb/outb), but this never caused a reproduction.
4) We finally traced the int3 injection to a VM exit for a nested page
   fault.  After the fault handling there's also an int3 reported in
   EXITINTINFO - this causes an int3 injection.
That lead me to think this might be the result of swap or numa
balancing causing a guest page to become unmapped in the host, where
the next instruction is an int3, which would cause a nested fault.
I wrote a script on the host to migrate a guest's memory to/from a
local/remote node, and finally got a reproduction (stack below).
KVM doesn't support emulating INT3 in protected mode, which means
that the emulated INT3 and subsequent issue must happen in QEMU...
and that is where I am stuck.
So ultimately, this bug looks a lot like the one discussed in this
thread:  An int3 from a static_branch is modified on one thread as
another thread tries to execute that int3 - except now we have a
Nested-Page-Fault that produces the subsequent race.
Any thoughts on where we might look in QEMU or KVM to make further
progress would be helpful - but I figured where we've gotten to might be
of interest to the folks who originally fixed this static_branch race.
I appreciate any cycles you might spare to help,
~Gregory
--- reproduction stack
 <TASK>
 ? __die+0x77/0xc0
 ? die+0x2b/0x50
 ? exc_int3+0x41/0x70
 ? asm_exc_int3+0x35/0x40
 ? cleanup_module+0x80/0x80 [sbint3]
 ? looper+0x29/0x80 [sbint3]
 ? looper+0x29/0x80 [sbint3]
 ? looper+0x1e/0x80 [sbint3]
 kthread+0xb1/0xe0
 ? __kthread_parkme+0x70/0x70
 ret_from_fork+0x30/0x40
 ? __kthread_parkme+0x70/0x70
 ret_from_fork_asm+0x11/0x20
 </TASK>
-------- guest_repro.c   (heavily truncated for brevity)
DEFINE_STATIC_KEY_FALSE(int3_branch_key); // initially disabled                                                                                   
/* 180 threads hammering on this */
static int looper(void *data)                                                                                                                     
{                                                                                                                                                 
    while (!kthread_should_stop()) {                                                                                                          
        schedule();                                
        /* nops produce 5-byte jump instr */
        if (static_branch_likely(&int3_branch_key))                                                                                  
            __asm__ volatile (".rept 40\n nop\n .endr\n");
        }
}
/* 1 thread hammering on this */
static int toggler(void *data)
{
    while (!kthread_should_stop()) {
        schedule();
        static_branch_enable(&int3_branch_key);
        static_branch_disable(&int3_branch_key);
    }
}
--- host_mover.c
/* 
 * basically this back and forth between node0<->node1
 * 1GB at a time for each chunk in /proc/<pid>/maps
 */
int ret = move_pages(qemu_pid, chunk_size, pages, nodes, status, 0);
Powered by blists - more mailing lists
 
