[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <019f6713-cfbd-466b-8fb5-dcd982cf8644@intel.com>
Date: Wed, 30 Apr 2025 11:07:35 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Valentin Schneider <vschneid@...hat.com>, linux-kernel@...r.kernel.org,
virtualization@...ts.linux.dev, linux-arm-kernel@...ts.infradead.org,
loongarch@...ts.linux.dev, linux-riscv@...ts.infradead.org,
linux-perf-users@...r.kernel.org, kvm@...r.kernel.org,
linux-arch@...r.kernel.org, linux-modules@...r.kernel.org,
linux-trace-kernel@...r.kernel.org, rcu@...r.kernel.org,
linux-hardening@...r.kernel.org, linux-kselftest@...r.kernel.org,
bpf@...r.kernel.org, Juri Lelli <juri.lelli@...hat.com>,
Marcelo Tosatti <mtosatti@...hat.com>, Yair Podemsky <ypodemsk@...hat.com>,
Josh Poimboeuf <jpoimboe@...nel.org>, Daniel Wagner <dwagner@...e.de>,
Petr Tesarik <ptesarik@...e.com>, Nicolas Saenz Julienne
<nsaenz@...zon.com>, Frederic Weisbecker <frederic@...nel.org>,
"Paul E. McKenney" <paulmck@...nel.org>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Sean Christopherson <seanjc@...gle.com>, Juergen Gross <jgross@...e.com>,
Ajay Kaher <ajay.kaher@...adcom.com>,
Alexey Makhalov <alexey.amakhalov@...adcom.com>,
Broadcom internal kernel review list
<bcm-kernel-feedback-list@...adcom.com>, Russell King
<linux@...linux.org.uk>, Catalin Marinas <catalin.marinas@....com>,
Will Deacon <will@...nel.org>, Huacai Chen <chenhuacai@...nel.org>,
WANG Xuerui <kernel@...0n.name>, Paul Walmsley <paul.walmsley@...ive.com>,
Palmer Dabbelt <palmer@...belt.com>, Albert Ou <aou@...s.berkeley.edu>,
Alexandre Ghiti <alex@...ti.fr>, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
x86@...nel.org, "H. Peter Anvin" <hpa@...or.com>,
Peter Zijlstra <peterz@...radead.org>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Namhyung Kim <namhyung@...nel.org>, Mark Rutland <mark.rutland@....com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
Adrian Hunter <adrian.hunter@...el.com>,
"Liang, Kan" <kan.liang@...ux.intel.com>,
Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
Paolo Bonzini <pbonzini@...hat.com>, Arnd Bergmann <arnd@...db.de>,
Jason Baron <jbaron@...mai.com>, Ard Biesheuvel <ardb@...nel.org>,
Luis Chamberlain <mcgrof@...nel.org>, Petr Pavlu <petr.pavlu@...e.com>,
Sami Tolvanen <samitolvanen@...gle.com>, Daniel Gomez
<da.gomez@...sung.com>, Naveen N Rao <naveen@...nel.org>,
Anil S Keshavamurthy <anil.s.keshavamurthy@...el.com>,
"David S. Miller" <davem@...emloft.net>,
Masami Hiramatsu <mhiramat@...nel.org>,
Neeraj Upadhyay <neeraj.upadhyay@...nel.org>,
Joel Fernandes <joel@...lfernandes.org>,
Josh Triplett <josh@...htriplett.org>, Boqun Feng <boqun.feng@...il.com>,
Uladzislau Rezki <urezki@...il.com>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Lai Jiangshan <jiangshanlai@...il.com>, Zqiang <qiang.zhang1211@...il.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall
<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Kees Cook <kees@...nel.org>, Shuah Khan <shuah@...nel.org>,
Masahiro Yamada <masahiroy@...nel.org>, Alice Ryhl <aliceryhl@...gle.com>,
Miguel Ojeda <ojeda@...nel.org>, "Mike Rapoport (Microsoft)"
<rppt@...nel.org>, Rong Xu <xur@...gle.com>,
Rafael Aquini <aquini@...hat.com>, Song Liu <song@...nel.org>,
Andrii Nakryiko <andrii@...nel.org>, Dan Carpenter
<dan.carpenter@...aro.org>, Brian Gerst <brgerst@...il.com>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
Benjamin Berg <benjamin.berg@...el.com>,
Vishal Annapurve <vannapurve@...gle.com>,
Randy Dunlap <rdunlap@...radead.org>, John Stultz <jstultz@...gle.com>,
Tiezhu Yang <yangtiezhu@...ngson.cn>
Subject: Re: [PATCH v5 00/25] context_tracking,x86: Defer some IPIs until a
user->kernel transition
On 4/30/25 10:20, Steven Rostedt wrote:
> On Tue, 29 Apr 2025 09:11:57 -0700
> Dave Hansen <dave.hansen@...el.com> wrote:
>
>> I don't think we should do this series.
>
> Could you provide more rationale for your decision.
I talked about it a bit in here:
> https://lore.kernel.org/all/408ebd8b-4bfb-4c4f-b118-7fe853c6e897@intel.com/
But, basically, this series puts a new onus on the entry code: it can't
touch the vmalloc() area ... except the LDT ... and except the PEBS
buffers. If anyone touches vmalloc()'d memory (or anything else that
eventually gets deferred), they crash. They _only_ crash on these
NOHZ_FULL systems.
Putting new restrictions on the entry code is really nasty. Let's say a
new hardware feature showed up that touched vmalloc()'d memory in the
entry code. Probably, nobody would notice until they got that new
hardware and tried to do a NOHZ_FULL workload. It might take years to
uncover, once that hardware was out in the wild.
I have a substantial number of gray hairs from dealing with corner cases
in the entry code.
You _could_ make it more debuggable. Could you make this work for all
tasks, not just NOHZ_FULL? The same logic _should_ apply. It would be
inefficient, but would provide good debugging coverage.
I also mentioned this earlier, but PTI could be leveraged here to ensure
that the TLB is flushed properly. You could have the rule that anything
mapped into the user page table can't have a deferred flush and then do
deferred flushes at SWITCH_TO_KERNEL_CR3 time. Yeah, that's in
arch-specific assembly, but it's a million times easier to reason about
because the window where a deferred-flush allocation might bite you is
so small.
Look at the syscall code for instance:
> SYM_CODE_START(entry_SYSCALL_64)
> swapgs
> movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
> SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
You can _trivially_ audit this and know that swapgs doesn't touch memory
and that as long as PER_CPU_VAR()s and the process stack don't have
their mappings munged and flushes deferred that this would be correct.
>> If folks want this functionality, they should get a new CPU that can
>> flush the TLB without IPIs.
>
> That's a pretty heavy handed response. I'm not sure that's always a
> feasible solution.
>
> From my experience in the world, software has always been around to fix the
> hardware, not the other way around ;-)
Both AMD and Intel have hardware to do it. ARM CPUs do it too, I think.
You can go buy the Intel hardware off the shelf today.
Powered by blists - more mailing lists