linux-kernel - Re: [RFC PATCH v5 00/18] pkeys-based page table hardening

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6dc0b5c8-b485-4fe1-b85b-7dcd00214d1b@arm.com>
Date: Wed, 1 Oct 2025 14:41:58 +0200
From: Kevin Brodsky <kevin.brodsky@....com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
 "yang@...amperecomputing.com" <yang@...amperecomputing.com>,
 "linux-hardening@...r.kernel.org" <linux-hardening@...r.kernel.org>
Cc: "maz@...nel.org" <maz@...nel.org>, "luto@...nel.org" <luto@...nel.org>,
 "willy@...radead.org" <willy@...radead.org>,
 "mbland@...orola.com" <mbland@...orola.com>,
 "david@...hat.com" <david@...hat.com>,
 "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
 "rppt@...nel.org" <rppt@...nel.org>, "joey.gouly@....com"
 <joey.gouly@....com>, "akpm@...ux-foundation.org"
 <akpm@...ux-foundation.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "catalin.marinas@....com" <catalin.marinas@....com>,
 "Weiny, Ira" <ira.weiny@...el.com>, "vbabka@...e.cz" <vbabka@...e.cz>,
 "pierre.langlois@....com" <pierre.langlois@....com>,
 "jeffxu@...omium.org" <jeffxu@...omium.org>,
 "linus.walleij@...aro.org" <linus.walleij@...aro.org>,
 "lorenzo.stoakes@...cle.com" <lorenzo.stoakes@...cle.com>,
 "kees@...nel.org" <kees@...nel.org>,
 "ryan.roberts@....com" <ryan.roberts@....com>,
 "tglx@...utronix.de" <tglx@...utronix.de>,
 "jannh@...gle.com" <jannh@...gle.com>,
 "peterz@...radead.org" <peterz@...radead.org>,
 "linux-arm-kernel@...ts.infradead.org"
 <linux-arm-kernel@...ts.infradead.org>, "will@...nel.org" <will@...nel.org>,
 "qperret@...gle.com" <qperret@...gle.com>,
 "linux-mm@...ck.org" <linux-mm@...ck.org>,
 "broonie@...nel.org" <broonie@...nel.org>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH v5 00/18] pkeys-based page table hardening

On 18/09/2025 19:31, Edgecombe, Rick P wrote:
> On Thu, 2025-09-18 at 16:15 +0200, Kevin Brodsky wrote:
>> This is where I have to apologise to Rick for not having studied his
>> series more thoroughly, as patch 17 [2] covers this issue very well in
>> the commit message.
>>
>> It seems fair to say there is no ideal or simple solution, though.
>> Rick's patch reserves enough (PTE-mapped) memory for fully splitting the
>> linear map, which is relatively simple but not very pleasant. Chatting
>> with Ryan Roberts, we figured another approach, improving on solution 1
>> mentioned in [2]. It would rely on allocating all PTPs from a special
>> pool (without using set_memory_pkey() in pagetable_*_ctor), along those
>> lines:
> Oh I didn't realize ARM split the direct map now at runtime. IIRC it used to
> just map at 4k if there were any permissions configured.

Until recently the linear map was always PTE-mapped on arm64 if
rodata=full (default) or in other situations (e.g. DEBUG_PAGEALLOC), so
that it never needed to be split at runtime. Since [1b] landed though,
there is support for setting permissions at the block level and
splitting, meaning that the linear map can be block-mapped in most cases
(see force_pte_mapping() in patch 3 for details). This is only enabled
on systems with the BBML2_NOABORT feature though.

[1b]
https://lore.kernel.org/all/20250917190323.3828347-1-yang@os.amperecomputing.com/

>> 1. 2 pages are reserved at all times (with the appropriate pkey)
>> 2. Try to allocate a 2M block. If needed, use a reserved page as PMD to
>> split a PUD. If successful, set its pkey - the entire block can now be
>> used for PTPs. Replenish the reserve from the block if needed.
>> 3. If no block is available, make an order-2 allocation (4 pages). If
>> needed, use 1-2 reserved pages to split PUD/PMD. Set the pkey of the 4
>> pages, take 1-2 pages to replenish the reserve if needed.
> Oh, good idea!
>
>> This ensures that we never run out of PTPs for splitting. We may get
>> into an OOM situation more easily due to the order-2 requirement, but
>> the risk remains low compared to requiring a 2M block. A bigger concern
>> is concurrency - do we need a per-CPU cache? Reserving a 2M block per
>> CPU could be very much overkill.
>>
>> No matter which solution is used, this clearly increases the complexity
>> of kpkeys_hardened_pgtables. Mike Rapoport has posted a number of RFCs
>> [3][4] that aim at addressing this problem more generally, but no
>> consensus seems to have emerged and I'm not sure they would completely
>> solve this specific problem either.
>>
>> For now, my plan is to stick to solution 3 from [2], i.e. force the
>> linear map to be PTE-mapped. This is easily done on arm64 as it is the
>> default, and is required for rodata=full, unless [1] is applied and the
>> system supports BBML2_NOABORT. See [1] for the potential performance
>> improvements we'd be missing out on (~5% ballpark).
>>
> I continue to be surprised that allocation time pkey conversion is not a
> performance disaster, even with the directmap pre-split.
>
>> I'm not quite sure
>> what the picture looks like on x86 - it may well be more significant as
>> Rick suggested.
> I think having more efficient direct map permissions is a solvable problem, but
> each usage is just a little too small to justify the infrastructure for a good
> solution. And each simple solution is a little too much overhead to justify the
> usage. So there is a long tail of blocked usages:
>  - pkeys usages (page tables and secret protection)
>  - kernel shadow stacks
>  - More efficient executable code allocations (BPF, kprobe trampolines, etc)
>
> Although the BPF folks started doing their own thing for this. But I don't think
> there are any fundamentally unsolvable problems for a generic solution. It's a
> question of a leading killer usage to justify the infrastructure. Maybe it will
> be kernel shadow stack.
It seems to be exactly the situation yes. Given Will's feedback, I'll
try to implement such a dedicated allocator one more time (based on the
scheme I suggested above) and see how it goes. Hopefully that will
create more momentum for a generic infrastructure :) - Kevin