linux-kernel - Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmhecztj4c9.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Wed, 19 Feb 2025 17:18:14 +0100
From: Valentin Schneider <vschneid@...hat.com>
To: Joel Fernandes <joelagnelf@...dia.com>
Cc: Jann Horn <jannh@...gle.com>, linux-kernel@...r.kernel.org,
 x86@...nel.org, virtualization@...ts.linux.dev,
 linux-arm-kernel@...ts.infradead.org, loongarch@...ts.linux.dev,
 linux-riscv@...ts.infradead.org, linux-perf-users@...r.kernel.org,
 xen-devel@...ts.xenproject.org, kvm@...r.kernel.org,
 linux-arch@...r.kernel.org, rcu@...r.kernel.org,
 linux-hardening@...r.kernel.org, linux-mm@...ck.org,
 linux-kselftest@...r.kernel.org, bpf@...r.kernel.org,
 bcm-kernel-feedback-list@...adcom.com, Juergen Gross <jgross@...e.com>,
 Ajay Kaher <ajay.kaher@...adcom.com>, Alexey Makhalov
 <alexey.amakhalov@...adcom.com>, Russell King <linux@...linux.org.uk>,
 Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
 Huacai Chen <chenhuacai@...nel.org>, WANG Xuerui <kernel@...0n.name>, Paul
 Walmsley <paul.walmsley@...ive.com>, Palmer Dabbelt <palmer@...belt.com>,
 Albert Ou <aou@...s.berkeley.edu>, Thomas Gleixner <tglx@...utronix.de>,
 Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave
 Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>,
 Peter Zijlstra <peterz@...radead.org>, Arnaldo Carvalho de Melo
 <acme@...nel.org>, Namhyung Kim <namhyung@...nel.org>, Mark Rutland
 <mark.rutland@....com>, Alexander Shishkin
 <alexander.shishkin@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>, Ian
 Rogers <irogers@...gle.com>, Adrian Hunter <adrian.hunter@...el.com>,
 "Liang, Kan" <kan.liang@...ux.intel.com>, Boris Ostrovsky
 <boris.ostrovsky@...cle.com>, Josh Poimboeuf <jpoimboe@...nel.org>, Pawan
 Gupta <pawan.kumar.gupta@...ux.intel.com>, Sean Christopherson
 <seanjc@...gle.com>, Paolo Bonzini <pbonzini@...hat.com>, Andy Lutomirski
 <luto@...nel.org>, Arnd Bergmann <arnd@...db.de>, Frederic Weisbecker
 <frederic@...nel.org>, "Paul E. McKenney" <paulmck@...nel.org>, Jason
 Baron <jbaron@...mai.com>, Steven Rostedt <rostedt@...dmis.org>, Ard
 Biesheuvel <ardb@...nel.org>, Neeraj Upadhyay
 <neeraj.upadhyay@...nel.org>, Joel Fernandes <joel@...lfernandes.org>,
 Josh Triplett <josh@...htriplett.org>, Boqun Feng <boqun.feng@...il.com>,
 Uladzislau Rezki <urezki@...il.com>, Mathieu Desnoyers
 <mathieu.desnoyers@...icios.com>, Lai Jiangshan <jiangshanlai@...il.com>,
 Zqiang <qiang.zhang1211@...il.com>, Juri Lelli <juri.lelli@...hat.com>,
 Clark Williams <williams@...hat.com>, Yair Podemsky <ypodemsk@...hat.com>,
 Tomas Glozar <tglozar@...hat.com>, Vincent Guittot
 <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
 Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Kees Cook
 <kees@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>, Christoph
 Hellwig <hch@...radead.org>, Shuah Khan <shuah@...nel.org>, Sami Tolvanen
 <samitolvanen@...gle.com>, Miguel Ojeda <ojeda@...nel.org>, Alice Ryhl
 <aliceryhl@...gle.com>, "Mike Rapoport (Microsoft)" <rppt@...nel.org>,
 Samuel Holland <samuel.holland@...ive.com>, Rong Xu <xur@...gle.com>,
 Nicolas Saenz Julienne <nsaenzju@...hat.com>, Geert Uytterhoeven
 <geert@...ux-m68k.org>, Yosry Ahmed <yosryahmed@...gle.com>, "Kirill A.
 Shutemov" <kirill.shutemov@...ux.intel.com>, "Masami Hiramatsu (Google)"
 <mhiramat@...nel.org>, Jinghao Jia <jinghao7@...inois.edu>, Luis
 Chamberlain <mcgrof@...nel.org>, Randy Dunlap <rdunlap@...radead.org>,
 Tiezhu Yang <yangtiezhu@...ngson.cn>
Subject: Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer
 flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

On 19/02/25 10:05, Joel Fernandes wrote:
> On Fri, Jan 17, 2025 at 05:53:33PM +0100, Valentin Schneider wrote:
>> On 17/01/25 16:52, Jann Horn wrote:
>> > On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@...hat.com> wrote:
>> >> On 14/01/25 19:16, Jann Horn wrote:
>> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@...hat.com> wrote:
>> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> >> >> flush_tlb_kernel_range() IPIs.
>> >> >>
>> >> >> Given that CPUs executing in userspace do not access data in the vmalloc
>> >> >> range, these IPIs could be deferred until their next kernel entry.
>> >> >>
>> >> >> Deferral vs early entry danger zone
>> >> >> ===================================
>> >> >>
>> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> >> >> and then accessed in early entry code.
>> >> >
>> >> > In other words, it needs a guarantee that no vmalloc allocations that
>> >> > have been created in the vmalloc region while the CPU was idle can
>> >> > then be accessed during early entry, right?
>> >>
>> >> I'm not sure if that would be a problem (not an mm expert, please do
>> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>> >> deferred anyway.
>> >
>> > flush_cache_vmap() is about stuff like flushing data caches on
>> > architectures with virtually indexed caches; that doesn't do TLB
>> > maintenance. When you look for its definition on x86 or arm64, you'll
>> > see that they use the generic implementation which is simply an empty
>> > inline function.
>> >
>> >> So after vmapping something, I wouldn't expect isolated CPUs to have
>> >> invalid TLB entries for the newly vmapped page.
>> >>
>> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>> >> stale TLB entries can and will remain on isolated CPUs, up until they
>> >> execute the deferred flush themselves (IOW for the entire duration of the
>> >> "danger zone").
>> >>
>> >> Does that make sense?
>> >
>> > The design idea wrt TLB flushes in the vmap code is that you don't do
>> > TLB flushes when you unmap stuff or when you map stuff, because doing
>> > TLB flushes across the entire system on every vmap/vunmap would be a
>> > bit costly; instead you just do batched TLB flushes in between, in
>> > __purge_vmap_area_lazy().
>> >
>> > In other words, the basic idea is that you can keep calling vmap() and
>> > vunmap() a bunch of times without ever doing TLB flushes until you run
>> > out of virtual memory in the vmap region; then you do one big TLB
>> > flush, and afterwards you can reuse the free virtual address space for
>> > new allocations again.
>> >
>> > So if you "defer" that batched TLB flush for CPUs that are not
>> > currently running in the kernel, I think the consequence is that those
>> > CPUs may end up with incoherent TLB state after a reallocation of the
>> > virtual address space.
>> >
>>
>> Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc
>> that occurred while an isolated CPU was NOHZ-FULL can be an issue if said
>> CPU accesses it during early entry;
>
> So the issue is:
>
> CPU1: unmappes vmalloc page X which was previously mapped to physical page
> P1.
>
> CPU2: does a whole bunch of vmalloc and vfree eventually crossing some lazy
> threshold and sending out IPIs. It then goes ahead and does an allocation
> that maps the same virtual page X to physical page P2.
>
> CPU3 is isolated and executes some early entry code before receving said IPIs
> which are supposedly deferred by Valentin's patches.
>
> It does not receive the IPI becuase it is deferred, thus access by early
> entry code to page X on this CPU results in a UAF access to P1.
>
> Is that the issue?
>

Pretty much so yeah. That is, *if* there such a vmalloc'd address access in
early entry code - testing says it's not the case, but I haven't found a
way to instrumentally verify this.