linux-kernel - Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmhzfjpfkky.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Fri, 17 Jan 2025 17:53:33 +0100
From: Valentin Schneider <vschneid@...hat.com>
To: Jann Horn <jannh@...gle.com>
Cc: linux-kernel@...r.kernel.org, x86@...nel.org,
 virtualization@...ts.linux.dev, linux-arm-kernel@...ts.infradead.org,
 loongarch@...ts.linux.dev, linux-riscv@...ts.infradead.org,
 linux-perf-users@...r.kernel.org, xen-devel@...ts.xenproject.org,
 kvm@...r.kernel.org, linux-arch@...r.kernel.org, rcu@...r.kernel.org,
 linux-hardening@...r.kernel.org, linux-mm@...ck.org,
 linux-kselftest@...r.kernel.org, bpf@...r.kernel.org,
 bcm-kernel-feedback-list@...adcom.com, Juergen Gross <jgross@...e.com>,
 Ajay Kaher <ajay.kaher@...adcom.com>, Alexey Makhalov
 <alexey.amakhalov@...adcom.com>, Russell King <linux@...linux.org.uk>,
 Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
 Huacai Chen <chenhuacai@...nel.org>, WANG Xuerui <kernel@...0n.name>, Paul
 Walmsley <paul.walmsley@...ive.com>, Palmer Dabbelt <palmer@...belt.com>,
 Albert Ou <aou@...s.berkeley.edu>, Thomas Gleixner <tglx@...utronix.de>,
 Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave
 Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>,
 Peter Zijlstra <peterz@...radead.org>, Arnaldo Carvalho de Melo
 <acme@...nel.org>, Namhyung Kim <namhyung@...nel.org>, Mark Rutland
 <mark.rutland@....com>, Alexander Shishkin
 <alexander.shishkin@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>, Ian
 Rogers <irogers@...gle.com>, Adrian Hunter <adrian.hunter@...el.com>,
 "Liang, Kan" <kan.liang@...ux.intel.com>, Boris Ostrovsky
 <boris.ostrovsky@...cle.com>, Josh Poimboeuf <jpoimboe@...nel.org>, Pawan
 Gupta <pawan.kumar.gupta@...ux.intel.com>, Sean Christopherson
 <seanjc@...gle.com>, Paolo Bonzini <pbonzini@...hat.com>, Andy Lutomirski
 <luto@...nel.org>, Arnd Bergmann <arnd@...db.de>, Frederic Weisbecker
 <frederic@...nel.org>, "Paul E. McKenney" <paulmck@...nel.org>, Jason
 Baron <jbaron@...mai.com>, Steven Rostedt <rostedt@...dmis.org>, Ard
 Biesheuvel <ardb@...nel.org>, Neeraj Upadhyay
 <neeraj.upadhyay@...nel.org>, Joel Fernandes <joel@...lfernandes.org>,
 Josh Triplett <josh@...htriplett.org>, Boqun Feng <boqun.feng@...il.com>,
 Uladzislau Rezki <urezki@...il.com>, Mathieu Desnoyers
 <mathieu.desnoyers@...icios.com>, Lai Jiangshan <jiangshanlai@...il.com>,
 Zqiang <qiang.zhang1211@...il.com>, Juri Lelli <juri.lelli@...hat.com>,
 Clark Williams <williams@...hat.com>, Yair Podemsky <ypodemsk@...hat.com>,
 Tomas Glozar <tglozar@...hat.com>, Vincent Guittot
 <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
 Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Kees Cook
 <kees@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>, Christoph
 Hellwig <hch@...radead.org>, Shuah Khan <shuah@...nel.org>, Sami Tolvanen
 <samitolvanen@...gle.com>, Miguel Ojeda <ojeda@...nel.org>, Alice Ryhl
 <aliceryhl@...gle.com>, "Mike Rapoport (Microsoft)" <rppt@...nel.org>,
 Samuel Holland <samuel.holland@...ive.com>, Rong Xu <xur@...gle.com>,
 Nicolas Saenz Julienne <nsaenzju@...hat.com>, Geert Uytterhoeven
 <geert@...ux-m68k.org>, Yosry Ahmed <yosryahmed@...gle.com>, "Kirill A.
 Shutemov" <kirill.shutemov@...ux.intel.com>, "Masami Hiramatsu (Google)"
 <mhiramat@...nel.org>, Jinghao Jia <jinghao7@...inois.edu>, Luis
 Chamberlain <mcgrof@...nel.org>, Randy Dunlap <rdunlap@...radead.org>,
 Tiezhu Yang <yangtiezhu@...ngson.cn>
Subject: Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer
 flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

On 17/01/25 16:52, Jann Horn wrote:
> On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@...hat.com> wrote:
>> On 14/01/25 19:16, Jann Horn wrote:
>> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@...hat.com> wrote:
>> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> >> flush_tlb_kernel_range() IPIs.
>> >>
>> >> Given that CPUs executing in userspace do not access data in the vmalloc
>> >> range, these IPIs could be deferred until their next kernel entry.
>> >>
>> >> Deferral vs early entry danger zone
>> >> ===================================
>> >>
>> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> >> and then accessed in early entry code.
>> >
>> > In other words, it needs a guarantee that no vmalloc allocations that
>> > have been created in the vmalloc region while the CPU was idle can
>> > then be accessed during early entry, right?
>>
>> I'm not sure if that would be a problem (not an mm expert, please do
>> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>> deferred anyway.
>
> flush_cache_vmap() is about stuff like flushing data caches on
> architectures with virtually indexed caches; that doesn't do TLB
> maintenance. When you look for its definition on x86 or arm64, you'll
> see that they use the generic implementation which is simply an empty
> inline function.
>
>> So after vmapping something, I wouldn't expect isolated CPUs to have
>> invalid TLB entries for the newly vmapped page.
>>
>> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>> stale TLB entries can and will remain on isolated CPUs, up until they
>> execute the deferred flush themselves (IOW for the entire duration of the
>> "danger zone").
>>
>> Does that make sense?
>
> The design idea wrt TLB flushes in the vmap code is that you don't do
> TLB flushes when you unmap stuff or when you map stuff, because doing
> TLB flushes across the entire system on every vmap/vunmap would be a
> bit costly; instead you just do batched TLB flushes in between, in
> __purge_vmap_area_lazy().
>
> In other words, the basic idea is that you can keep calling vmap() and
> vunmap() a bunch of times without ever doing TLB flushes until you run
> out of virtual memory in the vmap region; then you do one big TLB
> flush, and afterwards you can reuse the free virtual address space for
> new allocations again.
>
> So if you "defer" that batched TLB flush for CPUs that are not
> currently running in the kernel, I think the consequence is that those
> CPUs may end up with incoherent TLB state after a reallocation of the
> virtual address space.
>

Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc
that occurred while an isolated CPU was NOHZ-FULL can be an issue if said
CPU accesses it during early entry;

> Actually, I think this would mean that your optimization is disallowed
> at least on arm64 - I'm not sure about the exact wording, but arm64
> has a "break before make" rule that forbids conflicting writable
> address translations or something like that.
>

On the bright side of things, arm64 is not as bad as x86 when it comes to
IPI'ing isolated CPUs :-) I'll add that to my notes, thanks!

> (I said "until you run out of virtual memory in the vmap region", but
> that's not actually true - see the comment above lazy_max_pages() for
> an explanation of the actual heuristic. You might be able to tune that
> a bit if you'd be significantly happier with less frequent
> interruptions, or something along those lines.)