linux-kernel - Re: [PATCH v5 00/25] context_tracking,x86: Defer some IPIs until a user->kernel transition

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6c44fa0e-28ed-400e-aaf2-e0e0720d3811@intel.com>
Date: Fri, 2 May 2025 07:33:55 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Steven Rostedt <rostedt@...dmis.org>,
 Valentin Schneider <vschneid@...hat.com>, linux-kernel@...r.kernel.org,
 virtualization@...ts.linux.dev, linux-arm-kernel@...ts.infradead.org,
 loongarch@...ts.linux.dev, linux-riscv@...ts.infradead.org,
 linux-perf-users@...r.kernel.org, kvm@...r.kernel.org,
 linux-arch@...r.kernel.org, linux-modules@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org, rcu@...r.kernel.org,
 linux-hardening@...r.kernel.org, linux-kselftest@...r.kernel.org,
 bpf@...r.kernel.org, Juri Lelli <juri.lelli@...hat.com>,
 Marcelo Tosatti <mtosatti@...hat.com>, Yair Podemsky <ypodemsk@...hat.com>,
 Josh Poimboeuf <jpoimboe@...nel.org>, Daniel Wagner <dwagner@...e.de>,
 Petr Tesarik <ptesarik@...e.com>, Nicolas Saenz Julienne
 <nsaenz@...zon.com>, Frederic Weisbecker <frederic@...nel.org>,
 "Paul E. McKenney" <paulmck@...nel.org>,
 Dave Hansen <dave.hansen@...ux.intel.com>,
 Sean Christopherson <seanjc@...gle.com>, Juergen Gross <jgross@...e.com>,
 Ajay Kaher <ajay.kaher@...adcom.com>,
 Alexey Makhalov <alexey.amakhalov@...adcom.com>,
 Broadcom internal kernel review list
 <bcm-kernel-feedback-list@...adcom.com>, Russell King
 <linux@...linux.org.uk>, Catalin Marinas <catalin.marinas@....com>,
 Will Deacon <will@...nel.org>, Huacai Chen <chenhuacai@...nel.org>,
 WANG Xuerui <kernel@...0n.name>, Paul Walmsley <paul.walmsley@...ive.com>,
 Palmer Dabbelt <palmer@...belt.com>, Albert Ou <aou@...s.berkeley.edu>,
 Alexandre Ghiti <alex@...ti.fr>, Thomas Gleixner <tglx@...utronix.de>,
 Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
 x86@...nel.org, "H. Peter Anvin" <hpa@...or.com>,
 Arnaldo Carvalho de Melo <acme@...nel.org>,
 Namhyung Kim <namhyung@...nel.org>, Mark Rutland <mark.rutland@....com>,
 Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
 Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
 Adrian Hunter <adrian.hunter@...el.com>,
 "Liang, Kan" <kan.liang@...ux.intel.com>,
 Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
 Paolo Bonzini <pbonzini@...hat.com>, Arnd Bergmann <arnd@...db.de>,
 Jason Baron <jbaron@...mai.com>, Ard Biesheuvel <ardb@...nel.org>,
 Luis Chamberlain <mcgrof@...nel.org>, Petr Pavlu <petr.pavlu@...e.com>,
 Sami Tolvanen <samitolvanen@...gle.com>, Daniel Gomez
 <da.gomez@...sung.com>, Naveen N Rao <naveen@...nel.org>,
 Anil S Keshavamurthy <anil.s.keshavamurthy@...el.com>,
 "David S. Miller" <davem@...emloft.net>,
 Masami Hiramatsu <mhiramat@...nel.org>,
 Neeraj Upadhyay <neeraj.upadhyay@...nel.org>,
 Joel Fernandes <joel@...lfernandes.org>,
 Josh Triplett <josh@...htriplett.org>, Boqun Feng <boqun.feng@...il.com>,
 Uladzislau Rezki <urezki@...il.com>,
 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
 Lai Jiangshan <jiangshanlai@...il.com>, Zqiang <qiang.zhang1211@...il.com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall
 <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
 Kees Cook <kees@...nel.org>, Shuah Khan <shuah@...nel.org>,
 Masahiro Yamada <masahiroy@...nel.org>, Alice Ryhl <aliceryhl@...gle.com>,
 Miguel Ojeda <ojeda@...nel.org>, "Mike Rapoport (Microsoft)"
 <rppt@...nel.org>, Rong Xu <xur@...gle.com>,
 Rafael Aquini <aquini@...hat.com>, Song Liu <song@...nel.org>,
 Andrii Nakryiko <andrii@...nel.org>, Dan Carpenter
 <dan.carpenter@...aro.org>, Brian Gerst <brgerst@...il.com>,
 "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
 Benjamin Berg <benjamin.berg@...el.com>,
 Vishal Annapurve <vannapurve@...gle.com>,
 Randy Dunlap <rdunlap@...radead.org>, John Stultz <jstultz@...gle.com>,
 Tiezhu Yang <yangtiezhu@...ngson.cn>
Subject: Re: [PATCH v5 00/25] context_tracking,x86: Defer some IPIs until a
 user->kernel transition

On 5/2/25 04:22, Peter Zijlstra wrote:
> On Wed, Apr 30, 2025 at 11:07:35AM -0700, Dave Hansen wrote:
> 
>> Both AMD and Intel have hardware to do it. ARM CPUs do it too, I think.
>> You can go buy the Intel hardware off the shelf today.
> To be fair, the Intel RAR thing is pretty horrific 🙁 Definitely
> sub-par compared to the AMD and ARM things.
> 
> Furthermore, the paper states it is a uarch feature for SPR with no
> guarantee future uarchs will get it (and to be fair, I'd prefer it if
> they didn't).

I don't think any of that is set in stone, fwiw. It should be entirely
possible to obtain a longer promise about its availability.

Or ask that AMD and Intel put their heads together in their fancy new
x86 advisory group and figure out a single way forward. If you're right
that RAR stinks and INVLPGB rocks, then it'll be an easy thing to advise.

> Furthermore, I suspect it will actually be slower than IPIs for anything
> with more than 64 logical CPUs due to reduced parallelism.

Maybe my brain is crusty and I need to go back and read the spec, but I
remember RAR using the normal old APIC programming that normal old TLB
flush IPIs use. So they have similar restrictions. If it's inefficient
to program a wide IPI, it's also inefficient to program a RAR operation.
So the (theoretical) pro is that you program it like an IPI and it slots
into the IPI code fairly easily. But the con is that it has the same
limitations as IPIs.

I was actually concerned that INVLPGB won't be scalable. Since it
doesn't have the ability to target specific CPUs in the ISA, it
fundamentally need to either have a mechanism to reach all CPUs, or some
way to know which TLB entries each CPU might have.

Maybe AMD has something super duper clever to limit the broadcast scope.
But if they don't, then a small range flush on a small number of CPUs
might end up being pretty expensive, relatively.

I don't think this is a big problem in Rik's series because he had a
floor on the size of processes that get INVLPGB applied. Also, if it
turns out to be a problem, it's dirt simple to revert back to IPIs for
problematic TLB flushes.

But I am deeply curious how the system will behave if there are a
boatload of processes doing modestly-sized INVLPGBs that only apply to a
handful of CPUs on a very large system.

AMD and Intel came at this from very different angles (go figure). The
designs are prioritizing different things for sure. I can't wait to see
both of them fighting it out under real workloads.