lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmhwmetfk9d.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Fri, 17 Jan 2025 18:00:30 +0100
From: Valentin Schneider <vschneid@...hat.com>
To: Uladzislau Rezki <urezki@...il.com>
Cc: Jann Horn <jannh@...gle.com>, linux-kernel@...r.kernel.org,
 x86@...nel.org, virtualization@...ts.linux.dev,
 linux-arm-kernel@...ts.infradead.org, loongarch@...ts.linux.dev,
 linux-riscv@...ts.infradead.org, linux-perf-users@...r.kernel.org,
 xen-devel@...ts.xenproject.org, kvm@...r.kernel.org,
 linux-arch@...r.kernel.org, rcu@...r.kernel.org,
 linux-hardening@...r.kernel.org, linux-mm@...ck.org,
 linux-kselftest@...r.kernel.org, bpf@...r.kernel.org,
 bcm-kernel-feedback-list@...adcom.com, Juergen Gross <jgross@...e.com>,
 Ajay Kaher <ajay.kaher@...adcom.com>, Alexey Makhalov
 <alexey.amakhalov@...adcom.com>, Russell King <linux@...linux.org.uk>,
 Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
 Huacai Chen <chenhuacai@...nel.org>, WANG Xuerui <kernel@...0n.name>, Paul
 Walmsley <paul.walmsley@...ive.com>, Palmer Dabbelt <palmer@...belt.com>,
 Albert Ou <aou@...s.berkeley.edu>, Thomas Gleixner <tglx@...utronix.de>,
 Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave
 Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>,
 Peter Zijlstra <peterz@...radead.org>, Arnaldo Carvalho de Melo
 <acme@...nel.org>, Namhyung Kim <namhyung@...nel.org>, Mark Rutland
 <mark.rutland@....com>, Alexander Shishkin
 <alexander.shishkin@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>, Ian
 Rogers <irogers@...gle.com>, Adrian Hunter <adrian.hunter@...el.com>,
 "Liang, Kan" <kan.liang@...ux.intel.com>, Boris Ostrovsky
 <boris.ostrovsky@...cle.com>, Josh Poimboeuf <jpoimboe@...nel.org>, Pawan
 Gupta <pawan.kumar.gupta@...ux.intel.com>, Sean Christopherson
 <seanjc@...gle.com>, Paolo Bonzini <pbonzini@...hat.com>, Andy Lutomirski
 <luto@...nel.org>, Arnd Bergmann <arnd@...db.de>, Frederic Weisbecker
 <frederic@...nel.org>, "Paul E. McKenney" <paulmck@...nel.org>, Jason
 Baron <jbaron@...mai.com>, Steven Rostedt <rostedt@...dmis.org>, Ard
 Biesheuvel <ardb@...nel.org>, Neeraj Upadhyay
 <neeraj.upadhyay@...nel.org>, Joel Fernandes <joel@...lfernandes.org>,
 Josh Triplett <josh@...htriplett.org>, Boqun Feng <boqun.feng@...il.com>,
 Uladzislau Rezki <urezki@...il.com>, Mathieu Desnoyers
 <mathieu.desnoyers@...icios.com>, Lai Jiangshan <jiangshanlai@...il.com>,
 Zqiang <qiang.zhang1211@...il.com>, Juri Lelli <juri.lelli@...hat.com>,
 Clark Williams <williams@...hat.com>, Yair Podemsky <ypodemsk@...hat.com>,
 Tomas Glozar <tglozar@...hat.com>, Vincent Guittot
 <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
 Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Kees Cook
 <kees@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>, Christoph
 Hellwig <hch@...radead.org>, Shuah Khan <shuah@...nel.org>, Sami Tolvanen
 <samitolvanen@...gle.com>, Miguel Ojeda <ojeda@...nel.org>, Alice Ryhl
 <aliceryhl@...gle.com>, "Mike Rapoport (Microsoft)" <rppt@...nel.org>,
 Samuel Holland <samuel.holland@...ive.com>, Rong Xu <xur@...gle.com>,
 Nicolas Saenz Julienne <nsaenzju@...hat.com>, Geert Uytterhoeven
 <geert@...ux-m68k.org>, Yosry Ahmed <yosryahmed@...gle.com>, "Kirill A.
 Shutemov" <kirill.shutemov@...ux.intel.com>, "Masami Hiramatsu (Google)"
 <mhiramat@...nel.org>, Jinghao Jia <jinghao7@...inois.edu>, Luis
 Chamberlain <mcgrof@...nel.org>, Randy Dunlap <rdunlap@...radead.org>,
 Tiezhu Yang <yangtiezhu@...ngson.cn>
Subject: Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer
 flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

On 17/01/25 17:11, Uladzislau Rezki wrote:
> On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
>> On 14/01/25 19:16, Jann Horn wrote:
>> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@...hat.com> wrote:
>> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> >> flush_tlb_kernel_range() IPIs.
>> >>
>> >> Given that CPUs executing in userspace do not access data in the vmalloc
>> >> range, these IPIs could be deferred until their next kernel entry.
>> >>
>> >> Deferral vs early entry danger zone
>> >> ===================================
>> >>
>> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> >> and then accessed in early entry code.
>> >
>> > In other words, it needs a guarantee that no vmalloc allocations that
>> > have been created in the vmalloc region while the CPU was idle can
>> > then be accessed during early entry, right?
>>
>> I'm not sure if that would be a problem (not an mm expert, please do
>> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>> deferred anyway.
>>
>> So after vmapping something, I wouldn't expect isolated CPUs to have
>> invalid TLB entries for the newly vmapped page.
>>
>> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>> stale TLB entries can and will remain on isolated CPUs, up until they
>> execute the deferred flush themselves (IOW for the entire duration of the
>> "danger zone").
>>
>> Does that make sense?
>>
> Probably i am missing something and need to have a look at your patches,
> but how do you guarantee that no-one map same are that you defer for TLB
> flushing?
>

That's the cool part: I don't :')

For deferring instruction patching IPIs, I (well Josh really) managed to
get instrumentation to back me up and catch any problematic area.

I looked into getting something similar for vmalloc region access in
.noinstr code, but I didn't get anywhere. I even tried using emulated
watchpoints on QEMU to watch the whole vmalloc range, but that went about
as well as you could expect.

That left me with staring at code. AFAICT the only vmap'd thing that is
accessed during early entry is the task stack (CONFIG_VMAP_STACK), which
itself cannot be freed until the task exits - thus can't be subject to
invalidation when a task is entering kernelspace.

If you have any tracing/instrumentation suggestions, I'm all ears (eyes?).

> As noted by Jann, we already defer a TLB flushing by backing freed areas
> until certain threshold and just after we cross it we do a flush.
>
> --
> Uladzislau Rezki


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ