linux-kernel - Re: [PATCH v5 05/16] x86/cpu: Defer CR pinning setup until after EFI initialization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <34dd023d-3ed5-4655-88be-14a7a300b91e@intel.com>
Date: Tue, 29 Oct 2024 15:52:56 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: "Luck, Tony" <tony.luck@...el.com>, "Mehta, Sohil"
 <sohil.mehta@...el.com>,
 Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
 Andy Lutomirski <luto@...nel.org>, Thomas Gleixner <tglx@...utronix.de>,
 Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
 Dave Hansen <dave.hansen@...ux.intel.com>, "x86@...nel.org"
 <x86@...nel.org>, "H. Peter Anvin" <hpa@...or.com>,
 Peter Zijlstra <peterz@...radead.org>, Ard Biesheuvel <ardb@...nel.org>,
 "Paul E. McKenney" <paulmck@...nel.org>, Josh Poimboeuf
 <jpoimboe@...nel.org>, Xiongwei Song <xiongwei.song@...driver.com>,
 "Li, Xin3" <xin3.li@...el.com>, "Mike Rapoport (IBM)" <rppt@...nel.org>,
 Brijesh Singh <brijesh.singh@....com>, Michael Roth <michael.roth@....com>,
 "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
 Alexey Kardashevskiy <aik@....com>
Cc: Jonathan Corbet <corbet@....net>, Ingo Molnar <mingo@...nel.org>,
 Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
 Daniel Sneddon <daniel.sneddon@...ux.intel.com>,
 "Huang, Kai" <kai.huang@...el.com>, Sandipan Das <sandipan.das@....com>,
 Breno Leitao <leitao@...ian.org>,
 "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
 Alexei Starovoitov <ast@...nel.org>, Hou Tao <houtao1@...wei.com>,
 Juergen Gross <jgross@...e.com>, Vegard Nossum <vegard.nossum@...cle.com>,
 Kees Cook <kees@...nel.org>, Eric Biggers <ebiggers@...gle.com>,
 Jason Gunthorpe <jgg@...pe.ca>,
 "Masami Hiramatsu (Google)" <mhiramat@...nel.org>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Luis Chamberlain <mcgrof@...nel.org>, Yuntao Wang <ytcoode@...il.com>,
 Rasmus Villemoes <linux@...musvillemoes.dk>,
 Christophe Leroy <christophe.leroy@...roup.eu>, Tejun Heo <tj@...nel.org>,
 Changbin Du <changbin.du@...wei.com>,
 Huang Shijie <shijie@...amperecomputing.com>,
 Geert Uytterhoeven <geert+renesas@...der.be>,
 Namhyung Kim <namhyung@...nel.org>,
 Arnaldo Carvalho de Melo <acme@...hat.com>,
 "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "linux-efi@...r.kernel.org" <linux-efi@...r.kernel.org>
Subject: Re: [PATCH v5 05/16] x86/cpu: Defer CR pinning setup until after EFI
 initialization

On 10/29/24 15:26, Luck, Tony wrote:
>>  	/*
>>  	 * This needs to follow the FPU initializtion, since EFI depends on it.
>> +	 * It also needs to precede the CR pinning setup, because we need to be
>> +	 * able to temporarily clear the CR4.LASS bit in order to execute the
>> +	 * set_virtual_address_map call, which resides in lower addresses and
>> +	 * would trip LASS if enabled.
>>  	 */
> 
> Why are the temporary mappings used to patch kernel code in the lower half
> of the virtual address space? 

I was just asking myself the same thing.  The upper half is always
mapped uniformly.  When you create an MM you copy the 256->511th pgd
entries verbatim from the init_mm's pgd.

If you map something the <=255th pgd entry, it isn't (by default)
visible to other mm's.  That's why a new mm also tends to get you a new
process.

> But couldn't we map into upper half and do some/all of:
> 
> 1) Trust that there aren't stupid bugs that dereference random pointers into the
> temporary mapping?
> 2) Make a "this CPU only" mapping
> 3) Avoid preemption while patching so there is no need for TLB shootdown
> by other CPUs when the temporary mapping is torn down, just flush local TLB.

It's about enforcing R^X semantics.  We should limit the time and scope
where mappings have some data both writeable and executable.

If we poke text in the upper half of the address space, any kernel
thread might be exploited to write to what will soon be executable.

If we do it in the lower half in its own mm, you have to compromise the
thread doing the text poking after the mapping is created but before it
is invalidated.  With LASS you *ALSO* need to do it in the STAC/CLAC
window which is smaller than the window when the TLB is valid.

*IF* we switched things to do text poking in the upper half of the
address space, we'd probably want to find a completely unused PGD entry.
 I'm not sure off the top of my head if we have a good one for that or
if it's worth the trouble.