lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmhecpukowa.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Wed, 19 Nov 2025 16:44:53 +0100
From: Valentin Schneider <vschneid@...hat.com>
To: Andy Lutomirski <luto@...nel.org>, Linux Kernel Mailing List
 <linux-kernel@...r.kernel.org>, linux-mm@...ck.org, rcu@...r.kernel.org,
 the arch/x86 maintainers <x86@...nel.org>,
 linux-arm-kernel@...ts.infradead.org, loongarch@...ts.linux.dev,
 linux-riscv@...ts.infradead.org, linux-arch@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org
Cc: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
 Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
 "H. Peter Anvin" <hpa@...or.com>, "Peter Zijlstra (Intel)"
 <peterz@...radead.org>, Arnaldo Carvalho de Melo <acme@...nel.org>, Josh
 Poimboeuf <jpoimboe@...nel.org>, Paolo Bonzini <pbonzini@...hat.com>, Arnd
 Bergmann <arnd@...db.de>, Frederic Weisbecker <frederic@...nel.org>, "Paul
 E. McKenney" <paulmck@...nel.org>, Jason Baron <jbaron@...mai.com>, Steven
 Rostedt <rostedt@...dmis.org>, Ard Biesheuvel <ardb@...nel.org>, Sami
 Tolvanen <samitolvanen@...gle.com>, "David S. Miller"
 <davem@...emloft.net>, Neeraj Upadhyay <neeraj.upadhyay@...nel.org>, Joel
 Fernandes <joelagnelf@...dia.com>, Josh Triplett <josh@...htriplett.org>,
 Boqun Feng <boqun.feng@...il.com>, Uladzislau Rezki <urezki@...il.com>,
 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, Mel Gorman
 <mgorman@...e.de>, Andrew Morton <akpm@...ux-foundation.org>, Masahiro
 Yamada <masahiroy@...nel.org>, Han Shen <shenhan@...gle.com>, Rik van Riel
 <riel@...riel.com>, Jann Horn <jannh@...gle.com>, Dan Carpenter
 <dan.carpenter@...aro.org>, Oleg Nesterov <oleg@...hat.com>, Juri Lelli
 <juri.lelli@...hat.com>, Clark Williams <williams@...hat.com>, Yair
 Podemsky <ypodemsk@...hat.com>, Marcelo Tosatti <mtosatti@...hat.com>,
 Daniel Wagner <dwagner@...e.de>, Petr Tesarik <ptesarik@...e.com>,
 Shrikanth Hegde <sshegde@...ux.ibm.com>
Subject: Re: [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush
 immediately after a switch to kernel CR3

On 19/11/25 06:31, Andy Lutomirski wrote:
> On Fri, Nov 14, 2025, at 7:14 AM, Valentin Schneider wrote:
>> Deferring kernel range TLB flushes requires the guarantee that upon
>> entering the kernel, no stale entry may be accessed. The simplest way to
>> provide such a guarantee is to issue an unconditional flush upon switching
>> to the kernel CR3, as this is the pivoting point where such stale entries
>> may be accessed.
>>
>
> Doing this together with the PTI CR3 switch has no actual benefit: MOV CR3 doesn’t flush global pages. And doing this in asm is pretty gross.  We don’t even get a free sync_core() out of it because INVPCID is not documented as being serializing.
>
> Why can’t we do it in C?  What’s the actual risk?  In order to trip over a stale TLB entry, we would need to deference a pointer to newly allocated kernel virtual memory that was not valid prior to our entry into user mode. I can imagine BPF doing this, but plain noinstr C in the entry path?  Especially noinstr C *that has RCU disabled*?  We already can’t follow an RCU pointer, and ISTM the only style of kernel code that might do this would use RCU to protect the pointer, and we are already doomed if we follow an RCU pointer to any sort of memory.
>

So v4 and earlier had the TLB flush faff done in C in the context_tracking entry
just like sync_core().

My biggest issue with it was that I couldn't figure out a way to instrument
memory accesses such that I would get an idea of where vmalloc'd accesses
happen - even with a hackish thing just to survey the landscape. So while I
agree with your reasoning wrt entry noinstr code, I don't have any way to
prove it.
That's unlike the text_poke sync_core() deferral for which I have all of
that nice objtool instrumentation.

Dave also pointed out that the whole stale entry flush deferral is a risky
move, and that the sanest thing would be to execute the deferred flush just
after switching to the kernel CR3.

See the thread surrounding:
  https://lore.kernel.org/lkml/20250114175143.81438-30-vschneid@redhat.com/

mainly Dave's reply and subthread:
  https://lore.kernel.org/lkml/352317e3-c7dc-43b4-b4cb-9644489318d0@intel.com/

> We do need to watch out for NMI/MCE hitting before we flush.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ