linux-kernel - Re: [Report] Race Condition in text_poke_bp_batch/poke_int3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20241203100757.GG8562@noisy.programming.kicks-ass.net>
Date: Tue, 3 Dec 2024 11:07:57 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Seohyeon Maeng <msh1307@...ori.io>
Cc: tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
	dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com,
	jgross@...e.com, leitao@...ian.org, linux-kernel@...r.kernel.org,
	bioloidgp@...il.com
Subject: Re: [Report] Race Condition in text_poke_bp_batch/poke_int3_handler

On Tue, Dec 03, 2024 at 04:58:50PM +0900, Seohyeon Maeng wrote:

> [   24.729808] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014

What QEMU version and setup are you using? 

There have been QEMU bugs around there. Can you reproduce on real
hardware? Because I can't seem to trigger this...


> A kernel panic occurs when the following code is executed during live
> patching. In this scenario, an int3 trap can be triggered.
> 
> static inline void perf_event_task_sched_out(struct task_struct *prev,
> 					     struct task_struct *next)
> {
> 	[...]
> 	if (static_branch_unlikely(&perf_sched_events))
> 		__perf_event_task_sched_out(prev, next);
> }
> 
> noinstr int poke_int3_handler(struct pt_regs *regs)
> {
> 	[...]
> 	desc = try_get_desc();    // [1]
> 	if (!desc)
> 		return 0;
> 	[...]
> 	if (unlikely(desc->nr_entries > 1)) {
> 		tp = __inline_bsearch(ip, desc->vec, desc->nr_entries,
> 				      sizeof(struct text_poke_loc),
> 				      patch_cmp);
> 		if (!tp)
> 			goto out_put;
> 	} else {
> 		tp = desc->vec;
> 		if (text_poke_addr(tp) != ip)
> 			goto out_put;
> 	}
> 	[...]
> out_put:
> 	put_desc();
> 	return ret;
> }
> 
> During the Interrupt Handler (poke_int3_handler) processing, the patch
> function may be entered, resulting in an improper reference count
> (refcount). This can cause the reference count to be incorrectly set,
> and the bp_desc.vec and bp_desc.nr_entries are reinitialized, leading
> to a loss of critical information and subsequent failures in handling
> the int3 trap.
> 
> static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
> {	
> 	[...]
> 	lockdep_assert_held(&text_mutex);
> 
> 	bp_desc.vec = tp;
> 	bp_desc.nr_entries = nr_entries;
> 
> 	/*
> 	 * Corresponds to the implicit memory barrier in try_get_desc() to
> 	 * ensure reading a non-zero refcount provides up to date bp_desc data.
> 	 */
> 	atomic_set_release(&bp_desc.refs, 1); // [2]
> 	[...]	
> 	/*
> 	 * Remove and wait for refs to be zero.
> 	 */
> 	if (!atomic_dec_and_test(&bp_desc.refs))   // [3]
> 		atomic_cond_read_acquire(&bp_desc.refs, !VAL);
> 	[...]
> }
> 
> As demonstrated above, bp_desc and its refcount can be modified while
> poke_int3_handler is executing, leading to unexpected behavior.
> 
> Consider a scenario where two CPUs concurrently execute the sequence
> [1] -> [2] -> [3] -> [1], with overlapping operations on the reference
> count.  When [3] is executed, the refcount may drop to zero. As a
> result, when [1] attempts to retrieve the descriptor, it fails,
> leading to a kernel panic.

I'm failing to see how this can happen. The text_poke_bp() caller should
hold text_mutex, there SHOULD be no concurrency on [2]/[3].

So there is a single CPU doing text_poke_bp():

  mutex_lock(&text_mutex);
  text_poke_bp_batch()
    lockdep_assert_held(&text_mutex);
    atomic_set_release(&bp_desc.refs, 1);	[2]
    smp_wmb();

    poke-first-byte-to-INT3			[A]

    text_poke_sync();

    poke-tail-bytes

    text_poke_sync();

    poke-first-byte

    text_poke_sync();				[B]
							
    if (!atomic_dec_and_test(&bp_desc.refs))	[3]
      atomic_cond_read_acquire(&bp_desc.refs, VAL);
  mutex_unlock(&text_mutex);


The only concurrency is multiple CPUs hitting the INT3, which exists
between [A] and [B], and notably, in that range the reference count
should be very much >= 1.

And [3] very specifically waits for all pre-existing interrupt handlers
to complete; at point [B] the INT3 is gone and no new handlers can
possibly start.

The INT3 handler (poke_int3_handler()) had the following cases:

 - the boring case, INT3 is observed right after A, it gets a ref, does
   the emulation and completes before 3.

 - the tail case, INT3 is observed somewhere before B, it gets a ref,
   does the emulation but complets after B, in which case 3 will wait
   for it.

Hmm, there *might* be an issue when:

 - INT3 triggers right before B, poke_int3_handler()'s try_get_desc() is
   delayed until after 3.

But that is not what you were describing, were you? I think that case is
made impossible by text_poke_sync() itself, that sends an IPI to all
CPUs, completion of that IPI would block on the completion of the INT3
which triggered right before B.

And after the sync-IPI that CPU must not observe INT3 anymore.


If you really think there is a problem here, please describe the code
flow in more detail. But given I can't trigger anything, nor actually
see a hole in the code, I'm going to assume you managed to tickle the
QEMU bug.