linux-kernel - Re: [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <xhsmh8qfzu22n.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Fri, 21 Nov 2025 11:12:16 +0100
From: Valentin Schneider <vschneid@...hat.com>
To: Andy Lutomirski <luto@...nel.org>, Linux Kernel Mailing List
 <linux-kernel@...r.kernel.org>, linux-mm@...ck.org, rcu@...r.kernel.org,
 the arch/x86 maintainers <x86@...nel.org>,
 linux-arm-kernel@...ts.infradead.org, loongarch@...ts.linux.dev,
 linux-riscv@...ts.infradead.org, linux-arch@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org
Cc: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
 Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
 "H. Peter Anvin" <hpa@...or.com>, "Peter Zijlstra (Intel)"
 <peterz@...radead.org>, Arnaldo Carvalho de Melo <acme@...nel.org>, Josh
 Poimboeuf <jpoimboe@...nel.org>, Paolo Bonzini <pbonzini@...hat.com>, Arnd
 Bergmann <arnd@...db.de>, Frederic Weisbecker <frederic@...nel.org>, "Paul
 E. McKenney" <paulmck@...nel.org>, Jason Baron <jbaron@...mai.com>, Steven
 Rostedt <rostedt@...dmis.org>, Ard Biesheuvel <ardb@...nel.org>, Sami
 Tolvanen <samitolvanen@...gle.com>, "David S. Miller"
 <davem@...emloft.net>, Neeraj Upadhyay <neeraj.upadhyay@...nel.org>, Joel
 Fernandes <joelagnelf@...dia.com>, Josh Triplett <josh@...htriplett.org>,
 Boqun Feng <boqun.feng@...il.com>, Uladzislau Rezki <urezki@...il.com>,
 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, Mel Gorman
 <mgorman@...e.de>, Andrew Morton <akpm@...ux-foundation.org>, Masahiro
 Yamada <masahiroy@...nel.org>, Han Shen <shenhan@...gle.com>, Rik van Riel
 <riel@...riel.com>, Jann Horn <jannh@...gle.com>, Dan Carpenter
 <dan.carpenter@...aro.org>, Oleg Nesterov <oleg@...hat.com>, Juri Lelli
 <juri.lelli@...hat.com>, Clark Williams <williams@...hat.com>, Yair
 Podemsky <ypodemsk@...hat.com>, Marcelo Tosatti <mtosatti@...hat.com>,
 Daniel Wagner <dwagner@...e.de>, Petr Tesarik <ptesarik@...e.com>,
 Shrikanth Hegde <sshegde@...ux.ibm.com>
Subject: Re: [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush
 immediately after a switch to kernel CR3

On 19/11/25 09:31, Andy Lutomirski wrote:
> Let's consider what we're worried about:
>
> 1. Architectural access to a kernel virtual address that has been unmapped, in asm or early C.  If it hasn't been remapped, then we oops anyway.  If it has, then that means we're accessing a pointer where either the pointer has changed or the pointee has been remapped while we're in user mode, and that's a very strange thing to do for anything that the asm points to or that early C points to, unless RCU is involved.  But RCU is already disallowed in the entry paths that might be in extended quiescent states, so I think this is mostly a nonissue.
>
> 2. Non-speculative access via GDT access, etc.  We can't control this at all, but we're not avoid to move the GDT, IDT, LDT etc of a running task while that task is in user mode.  We do move the LDT, but that's quite thoroughly synchronized via IPI.  (Should probably be double checked.  I wrote that code, but that doesn't mean I remember it exactly.)
>
> 3. Speculative TLB fills.  We can't control this at all.  We have had actual machine checks, on AMD IIRC, due to messing this up.  This is why we can't defer a flush after freeing a page table.
>
> 4. Speculative or other nonarchitectural loads.  One would hope that these are not dangerous.  For example, an early version of TDX would machine check if we did a speculative load from TDX memory, but that was fixed.  I don't see why this would be materially different between actual userspace execution (without LASS, anyway), kernel asm, and kernel C.
>
> 5. Writes to page table dirty bits.  I don't think we use these.
>
> In any case, the current implementation in your series is really, really,
> utterly horrifically slow.

Quite so :-)

> It's probably fine for a task that genuinely sits in usermode forever,
> but I don't think it's likely to be something that we'd be willing to
> enable for normal kernels and normal tasks.  And it would be really nice
> for the don't-interrupt-user-code still to move toward being always
> available rather than further from it.
>

Well following Frederic's suggestion of using the "is NOHZ_FULL actually in
use" static key in the ASM bits, none of the ugly bits get involved unless
you do have 'nohz_full=' on the cmdline - not perfect, but it's something.

RHEL kernels ship with NO_HZ_FULL=y [1], so we do care about that not impacting
performance too much if it's just compiled-in and not actually used.

[1]: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-10/-/blob/main/redhat/configs/common/generic/CONFIG_NO_HZ_FULL
>
> I admit that I'm kind of with dhansen: Zen 3+ can use INVLPGB and doesn't
> need any of this.  Some Intel CPUs support RAR and will eventually be
> able to use RAR, possibly even for sync_core().

Yeah that INVLPGB thing looks really nice, and AFAICT arm64 is similarly
covered with TLBI VMALLE1IS.

My goal here is to poke around and find out what's the minimal amount of
ugly we can get away with to suppress those IPIs on existing fleets, but
there's still too much ugly :/