linux-kernel - [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20250123172428.D6D8C8D9@davehans-spike.ostc.intel.com>
Date: Thu, 23 Jan 2025 09:24:28 -0800
From: Dave Hansen <dave.hansen@...ux.intel.com>
To: linux-kernel@...r.kernel.org
Cc: x86@...nel.org,tglx@...utronix.de,bp@...en8.de,joro@...tes.org,luto@...nel.org,peterz@...radead.org,kirill.shutemov@...ux.intel.com,rick.p.edgecombe@...el.com,jgross@...e.com,Dave Hansen <dave.hansen@...ux.intel.com>
Subject: [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling

tl;dr: 32-bit PAE page table handing is a bit different when PTI
is on and off. Making the handling uniform removes a good amount
of code at the cost of not sharing kernel PMDs. The downside of
this simplification is bloating non-PTI PAE kernels by ~2 pages
per process.

Anyone who cares about security on 32-bit is running with PTI and
PAE because PAE has the No-eXecute page table bit. They are already
paying the 2-page penalty. Anyone who cares more about memory
footprint than security is probably already running a !PAE kernel
and will not be affected by this.

--

There are two 32-bit x86 hardware page table formats. A 2-level one
with 32-bit pte_t's and a 3-level one with 64-bit pte_t's called PAE.
But the PAE one is wonky. It effectively loses a bit of addressing
radix per level since its PTEs are twice as large. It makes up for
that by adding the third level, but with only 4 entries in the level.

This leads to all kinds of fun because this level only needs 32 bytes
instead of a whole page. Also, since it has only 4 entries in the top
level, the hardware just always caches the entire thing aggressively.
Modifying a PAE pgd_t ends up needing different rules than the other
other x86 paging modes and probably every other architecture too.

PAE support got even weirder when Xen came along. Xen wants to trap
into the hypervisor on page table writes and so it protects the guest
page tables with paging protections. It can't protect a 32 byte
object with paging protections so it bloats the 32-byte object out
to a page. Xen also didn't support sharing kernel PMD pages.  This
is mostly moot now because the Xen support running as a 32-bit guest
was ripped out, but there are still remnants around.

PAE also interacts with PTI in fun and exciting ways. Since pgd
updates are so fraught, the PTI PAE implementation just chose to
avoid pgd updates by preallocating all the PMDs up front since
there are only 4 instead of 512 or 1024 in the other x86 paging
modes.

Make PAE less weird:
 * Always allocate a page for PAE PGDs. This brings them in line
   with the other 2 paging modes. It was done for Xen and for
   PTI already and nobody screamed, so just do it everywhere.
 * Never share kernel PMD pages. This brings PAE in line with
   32-bit !PAE and 64-bit.
 * Always preallocate all PAE PMD pages. This basically makes
   all PAE kernels behave like PTI ones. It might waste a page
   of memory, but all 4 pages probably get allocated in the common
   case anyway.

--

 include/asm/pgtable-2level_types.h |    2
 include/asm/pgtable-3level_types.h |    4 -
 include/asm/pgtable_64_types.h     |    2
 mm/pat/set_memory.c                |    2
 mm/pgtable.c                       |  104 +++++--------------------------------
 5 files changed, 18 insertions(+), 96 deletions(-)