[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmhr04xfow1.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Mon, 20 Jan 2025 17:09:34 +0100
From: Valentin Schneider <vschneid@...hat.com>
To: Uladzislau Rezki <urezki@...il.com>
Cc: Uladzislau Rezki <urezki@...il.com>, Jann Horn <jannh@...gle.com>,
linux-kernel@...r.kernel.org, x86@...nel.org,
virtualization@...ts.linux.dev, linux-arm-kernel@...ts.infradead.org,
loongarch@...ts.linux.dev, linux-riscv@...ts.infradead.org,
linux-perf-users@...r.kernel.org, xen-devel@...ts.xenproject.org,
kvm@...r.kernel.org, linux-arch@...r.kernel.org, rcu@...r.kernel.org,
linux-hardening@...r.kernel.org, linux-mm@...ck.org,
linux-kselftest@...r.kernel.org, bpf@...r.kernel.org,
bcm-kernel-feedback-list@...adcom.com, Juergen Gross <jgross@...e.com>,
Ajay Kaher <ajay.kaher@...adcom.com>, Alexey Makhalov
<alexey.amakhalov@...adcom.com>, Russell King <linux@...linux.org.uk>,
Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
Huacai Chen <chenhuacai@...nel.org>, WANG Xuerui <kernel@...0n.name>, Paul
Walmsley <paul.walmsley@...ive.com>, Palmer Dabbelt <palmer@...belt.com>,
Albert Ou <aou@...s.berkeley.edu>, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave
Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>,
Peter Zijlstra <peterz@...radead.org>, Arnaldo Carvalho de Melo
<acme@...nel.org>, Namhyung Kim <namhyung@...nel.org>, Mark Rutland
<mark.rutland@....com>, Alexander Shishkin
<alexander.shishkin@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>, Ian
Rogers <irogers@...gle.com>, Adrian Hunter <adrian.hunter@...el.com>,
"Liang, Kan" <kan.liang@...ux.intel.com>, Boris Ostrovsky
<boris.ostrovsky@...cle.com>, Josh Poimboeuf <jpoimboe@...nel.org>, Pawan
Gupta <pawan.kumar.gupta@...ux.intel.com>, Sean Christopherson
<seanjc@...gle.com>, Paolo Bonzini <pbonzini@...hat.com>, Andy Lutomirski
<luto@...nel.org>, Arnd Bergmann <arnd@...db.de>, Frederic Weisbecker
<frederic@...nel.org>, "Paul E. McKenney" <paulmck@...nel.org>, Jason
Baron <jbaron@...mai.com>, Steven Rostedt <rostedt@...dmis.org>, Ard
Biesheuvel <ardb@...nel.org>, Neeraj Upadhyay
<neeraj.upadhyay@...nel.org>, Joel Fernandes <joel@...lfernandes.org>,
Josh Triplett <josh@...htriplett.org>, Boqun Feng <boqun.feng@...il.com>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, Lai Jiangshan
<jiangshanlai@...il.com>, Zqiang <qiang.zhang1211@...il.com>, Juri Lelli
<juri.lelli@...hat.com>, Clark Williams <williams@...hat.com>, Yair
Podemsky <ypodemsk@...hat.com>, Tomas Glozar <tglozar@...hat.com>, Vincent
Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann
<dietmar.eggemann@....com>, Ben Segall <bsegall@...gle.com>, Mel Gorman
<mgorman@...e.de>, Kees Cook <kees@...nel.org>, Andrew Morton
<akpm@...ux-foundation.org>, Christoph Hellwig <hch@...radead.org>, Shuah
Khan <shuah@...nel.org>, Sami Tolvanen <samitolvanen@...gle.com>, Miguel
Ojeda <ojeda@...nel.org>, Alice Ryhl <aliceryhl@...gle.com>, "Mike
Rapoport (Microsoft)" <rppt@...nel.org>, Samuel Holland
<samuel.holland@...ive.com>, Rong Xu <xur@...gle.com>, Nicolas Saenz
Julienne <nsaenzju@...hat.com>, Geert Uytterhoeven <geert@...ux-m68k.org>,
Yosry Ahmed <yosryahmed@...gle.com>, "Kirill A. Shutemov"
<kirill.shutemov@...ux.intel.com>, "Masami Hiramatsu (Google)"
<mhiramat@...nel.org>, Jinghao Jia <jinghao7@...inois.edu>, Luis
Chamberlain <mcgrof@...nel.org>, Randy Dunlap <rdunlap@...radead.org>,
Tiezhu Yang <yangtiezhu@...ngson.cn>
Subject: Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer
flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
On 20/01/25 12:15, Uladzislau Rezki wrote:
> On Fri, Jan 17, 2025 at 06:00:30PM +0100, Valentin Schneider wrote:
>> On 17/01/25 17:11, Uladzislau Rezki wrote:
>> > On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
>> >> On 14/01/25 19:16, Jann Horn wrote:
>> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@...hat.com> wrote:
>> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> >> >> flush_tlb_kernel_range() IPIs.
>> >> >>
>> >> >> Given that CPUs executing in userspace do not access data in the vmalloc
>> >> >> range, these IPIs could be deferred until their next kernel entry.
>> >> >>
>> >> >> Deferral vs early entry danger zone
>> >> >> ===================================
>> >> >>
>> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> >> >> and then accessed in early entry code.
>> >> >
>> >> > In other words, it needs a guarantee that no vmalloc allocations that
>> >> > have been created in the vmalloc region while the CPU was idle can
>> >> > then be accessed during early entry, right?
>> >>
>> >> I'm not sure if that would be a problem (not an mm expert, please do
>> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>> >> deferred anyway.
>> >>
>> >> So after vmapping something, I wouldn't expect isolated CPUs to have
>> >> invalid TLB entries for the newly vmapped page.
>> >>
>> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>> >> stale TLB entries can and will remain on isolated CPUs, up until they
>> >> execute the deferred flush themselves (IOW for the entire duration of the
>> >> "danger zone").
>> >>
>> >> Does that make sense?
>> >>
>> > Probably i am missing something and need to have a look at your patches,
>> > but how do you guarantee that no-one map same are that you defer for TLB
>> > flushing?
>> >
>>
>> That's the cool part: I don't :')
>>
> Indeed, sounds unsafe :) Then we just do not need to free areas.
>
>> For deferring instruction patching IPIs, I (well Josh really) managed to
>> get instrumentation to back me up and catch any problematic area.
>>
>> I looked into getting something similar for vmalloc region access in
>> .noinstr code, but I didn't get anywhere. I even tried using emulated
>> watchpoints on QEMU to watch the whole vmalloc range, but that went about
>> as well as you could expect.
>>
>> That left me with staring at code. AFAICT the only vmap'd thing that is
>> accessed during early entry is the task stack (CONFIG_VMAP_STACK), which
>> itself cannot be freed until the task exits - thus can't be subject to
>> invalidation when a task is entering kernelspace.
>>
>> If you have any tracing/instrumentation suggestions, I'm all ears (eyes?).
>>
> As noted before, we defer flushing for vmalloc. We have a lazy-threshold
> which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
>
In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a
single userspace application that will never enter the kernel, unless
forced to by some interference (e.g. IPI sent from a housekeeping CPU).
Increasing the lazy threshold would unfortunately only delay the
interference - housekeeping CPUs are free to run whatever, and so they will
eventually cause the lazy threshold to be hit and IPI all the CPUs,
including the isolated/NOHZ_FULL ones.
I was thinking maybe we could subdivide the vmap space into two regions
with their own thresholds, but a task may allocate/vmap stuff while on a HK
CPU and be moved to an isolated CPU afterwards, and also I still don't have
any strong guarantee about what accesses an isolated CPU can do in its
early entry code :(
> --
> Uladzislau Rezki
Powered by blists - more mailing lists