linux-kernel - x86/mm: Finishing off the fix for a should_flush

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHDw0oGd0B4=uuv8NGqbUQ_ZVmSheU2bN70e4QhFXWvuAZdt2w@mail.gmail.com>
Date: Thu, 9 Oct 2025 15:01:25 +0100
From: Stephen Dolan <sdolan@...estreet.com>
To: Dave Hansen <dave.hansen@...el.com>, Andy Lutomirski <luto@...nel.org>, 
 	Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, Eric Hagberg <ehagberg@...estreet.com>, 
 	Nick Barnes <nbarnes@...estreet.com>
Subject: x86/mm: Finishing off the fix for a should_flush_tlb race

Back in May, I reported a bug in x86 TLB management, where a race
between a shootdown and a context-switch could leave a process running
with a stale TLB.

This was reported off-list since it seemed like it might be
security-sensitive: in principle, the stale TLB could leak information
from another process. The details were disclosed in CVE-2025-37964. In
practice, the bug is hard to trigger, and when you hit it you generally
get your own stale state (e.g. from before a fork() or an mmap()) rather
than someone else's.

Dave Hansen wrote a fix in this commit:

    commit fea4e317f9e7e1f449ce90dedc27a2d2a95bee5a
    Author: Dave Hansen <dave.hansen@...ux.intel.com>
    Date:   Thu May 8 15:41:32 2025 -0700

    x86/mm: Eliminate window where TLB flushes may be inadvertently skipped

which was backported to the 6.1, 6.6 and 6.12 stable trees.

However, there's a minor problem with the original fix and a major
problem with the backported versions, which means that the bug still
reproduces on 6.1, 6.6 and 6.12 stable kernels (as Eric Hagberg, cc'd,
has verified using the repro case at the end of this email), and
theoretically might on the latest tree.

The core of the fix is that:

  - during shootdown, should_flush_tlb checks for LOADED_MM_SWITCHING
    *after* updating tlb_gen

  - during context switch, switch_mm_irqs_off sets LOADED_MM_SWITCHING
    *before* reading tlb_gen

That way, either shootdown sees LOADED_MM_SWITCHING and sends an IPI, or
switch_mm_irqs_off sees the updated tlb_gen. The problem in both cases
is about the *before*-ness in switch_mm_irqs_off:

  - in the latest tree, there isn't enough fencing to enforce this
    ordering.

  - in the stable kernel trees (6.1, 6.6, 6.12), the code is in the
    wrong order.

I've reposted below a patch by Peter Zijlstra and Ingo Molnar for the
first issue, and written a small patch for the stable trees to move the
code into the right place. Both patches and a repro case for the bug are
below.

Thanks,
Stephen



========================>
Subject: x86/mm: Fix SMP ordering in switch_mm_irqs_off()
From:  Peter Zijlstra <peterz@...radead.org>
Date: Mon, 12 May 2025 11:11:34 +0200

Stephen noted that it is possible to not have an smp_mb() between
the loaded_mm store and the tlb_gen load in switch_mm(), meaning the
ordering against flush_tlb_mm_range() goes out the window, and it
becomes possible for switch_mm() to not observe a recent tlb_gen
update and fail to flush the TLBs.

Fixes: 209954cbc7d0 ("x86/mm/tlb: Update mm_cpumask lazily")
Reported-by: Stephen Dolan <sdolan@...estreet.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
Signed-off-by: Ingo Molnar <mingo@...nel.org>
Cc: H. Peter Anvin <hpa@...or.com>
Cc: Andrew Cooper <andrew.cooper3@...rix.com>
Cc: Andy Lutomirski <luto@...nel.org>
Cc: Brian Gerst <brgerst@...il.com>
Cc: Dave Hansen <dave.hansen@...el.com>
Cc: Dave Hansen <dave.hansen@...ux.intel.com>
Cc: Eric Hagberg <ehagberg@...estreet.com>
Cc: Juergen Gross <jgross@...e.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Nicholas Barnes <nbarnes@...estreet.com>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Rik van Riel <riel@...riel.com>
---
 arch/x86/mm/tlb.c |   24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

Index: tip/arch/x86/mm/tlb.c
===================================================================
--- tip.orig/arch/x86/mm/tlb.c
+++ tip/arch/x86/mm/tlb.c
@@ -911,11 +911,31 @@ void switch_mm_irqs_off(struct mm_struct
                 * CR3 and cpu_tlbstate.loaded_mm are not all in sync.
                 */
                this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
-               barrier();

-               /* Start receiving IPIs and then read tlb_gen (and LAM below) */
+               /*
+                * Make sure this CPU is set in mm_cpumask() such that we'll
+                * receive invalidation IPIs.
+                *
+                * Rely on the smp_mb() implied by cpumask_set_cpu()'s atomic
+                * operation, or explicitly provide one. Such that:
+                *
+                * switch_mm_irqs_off()
flush_tlb_mm_range()
+                *   smp_store_release(loaded_mm, SWITCHING);
atomic64_inc_return(tlb_gen)
+                *   smp_mb(); // here                            //
smp_mb() implied
+                *   atomic64_read(tlb_gen);
this_cpu_read(loaded_mm);
+                *
+                * we properly order against flush_tlb_mm_range(), where the
+                * loaded_mm load can happen in mative_flush_tlb_multi() ->
+                * should_flush_tlb().
+                *
+                * This way switch_mm() must see the new tlb_gen or
+                * flush_tlb_mm_range() must see the new loaded_mm, or both.
+                */
                if (next != &init_mm && !cpumask_test_cpu(cpu,
mm_cpumask(next)))
                        cpumask_set_cpu(cpu, mm_cpumask(next));
+               else
+                       smp_mb();
+
                next_tlb_gen = atomic64_read(&next->context.tlb_gen);

                ns = choose_new_asid(next, next_tlb_gen);




========================>

Proposed patch for stable kernel trees below.

The issue here is simpler: two code blocks are in the wrong order due to
rebasing across some refactoring.Patch below against 6.12.49, but the
change is the same on 6.1 and 6.6 trees. (In these kernels the
cpumask_test_cpu change wasn't backported, so I don't think the extra
smp_mb check is needed)


From: Stephen Dolan <sdolan@...estreet.com>
Date: Thu, 9 Oct 2025 14:46:47 +0100
Subject: [PATCH] x86/mm: Fix check/use ordering in switch_mm_irqs_off()

To avoid racing with should_flush_tlb, setting loaded_mm to LOADED_MM_SWITCHING
must happen before reading tlb_gen.
---
 arch/x86/mm/tlb.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 8629d90fdcd9..e87ef47cfb09 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -606,6 +606,14 @@ void switch_mm_irqs_off(struct mm_struct *unused,
struct mm_struct *next,
  */
  cond_mitigation(tsk);

+ /*
+ * Indicate that CR3 is about to change. nmi_uaccess_okay()
+ * and others are sensitive to the window where mm_cpumask(),
+ * CR3 and cpu_tlbstate.loaded_mm are not all in sync.
+ */
+ this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
+ barrier();
+
  /*
  * Stop remote flushes for the previous mm.
  * Skip kernel threads; we never send init_mm TLB flushing IPIs,
@@ -623,14 +631,6 @@ void switch_mm_irqs_off(struct mm_struct *unused,
struct mm_struct *next,
  next_tlb_gen = atomic64_read(&next->context.tlb_gen);

  choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
-
- /*
- * Indicate that CR3 is about to change. nmi_uaccess_okay()
- * and others are sensitive to the window where mm_cpumask(),
- * CR3 and cpu_tlbstate.loaded_mm are not all in sync.
- */
- this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
- barrier();
  }

  new_lam = mm_lam_cr3_mask(next);

========================>

/* Reproduction case for the bug */

#include <stdint.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/mman.h>

_Atomic uint64_t *A, *B;

static void writer()
{
  (*A)++;
  (*B)++;
}

static void reader()
{
  uint64_t b = *B;
  uint64_t a = *A;
  /* Since A is always incremented before B, and we read B first,
     this should never trap */
  if (a < b) __builtin_trap();
}

static void* reader_thread(void* unused)
{
  /* This thread calls reader() repeatedly, but does a context-switch
     to another process after every call */
  int p[2], q[2];
  char buf[1] = "!";
  pipe(p);
  pipe(q);
  pid_t pid = fork();
  if (pid != 0) {
    while (1) {
      reader();
      write(p[1], buf, 1);
      read(q[0], buf, 1);
    }
  } else {
    while (1) {
      read(p[0], buf, 1);
      write(q[1], buf, 1);
    }
  }
}

int main() {
  enum {PAGE = 4096};
  A = mmap(NULL, PAGE, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
  B = mmap(NULL, PAGE, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0);

  pthread_t pth;
  pthread_create(&pth, NULL, &reader_thread, NULL);

  /* This calls writer() repeatedly, but before every call it forks.
     This causes *A to be mapped to a COW page and so the write faults to
     allocate a fresh page for *A. (*B is MAP_SHARED, so it does not
become COW).
     The bug is that the reader may fail to see the copied A, and still
     use the old A page while seeing the new B value. */
  while (1) {
    pid_t pid = fork();
    if (pid == 0) {
      _exit(0);
    } else {
      writer();
      wait(NULL);
    }
  }
}