linux-kernel - Re: [PATCH 04/21] x86/mm/asi: set up asi_nonsensitive

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <DD7W6FACMRPU.BXJCZU93UMVQ@google.com>
Date: Thu, 02 Oct 2025 14:05:08 +0000
From: Brendan Jackman <jackmanb@...gle.com>
To: Dave Hansen <dave.hansen@...el.com>, Brendan Jackman <jackmanb@...gle.com>, 
	Andy Lutomirski <luto@...nel.org>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
	"Liam R. Howlett" <Liam.Howlett@...cle.com>, Suren Baghdasaryan <surenb@...gle.com>, 
	Michal Hocko <mhocko@...e.com>, Johannes Weiner <hannes@...xchg.org>, Zi Yan <ziy@...dia.com>, 
	Axel Rasmussen <axelrasmussen@...gle.com>, Yuanchu Xie <yuanchu@...gle.com>, 
	Roman Gushchin <roman.gushchin@...ux.dev>
Cc: <peterz@...radead.org>, <bp@...en8.de>, <dave.hansen@...ux.intel.com>, 
	<mingo@...hat.com>, <tglx@...utronix.de>, <akpm@...ux-foundation.org>, 
	<david@...hat.com>, <derkling@...gle.com>, <junaids@...gle.com>, 
	<linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>, <reijiw@...gle.com>, 
	<rientjes@...gle.com>, <rppt@...nel.org>, <vbabka@...e.cz>, <x86@...nel.org>, 
	<yosry.ahmed@...ux.dev>
Subject: Re: [PATCH 04/21] x86/mm/asi: set up asi_nonsensitive_pgd

On Wed Oct 1, 2025 at 8:28 PM UTC, Dave Hansen wrote:
> On 9/24/25 07:59, Brendan Jackman wrote:
>> Create the initial shared pagetable to hold all the mappings that will
>> be shared among ASI domains.
>> 
>> Mirror the physmap into the ASI pagetables, but with a maximum
>> granularity that's guaranteed to allow changing pageblock sensitivity
>> without having to allocate pagetables, and with everything as
>> non-present.
>
> Could you also talk about what this granularity _actually_ is and why it
> has the property of never requiring page table alloc

Ack, will expand on this (I think from your other comments that you
understand it now, and you're just asking me to improve the commit
message, let me know if I misread that).

>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index e98e85cf15f42db669696ba8195d8fc633351b26..7e0471d46767c63ceade479ae0d1bf738f14904a 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -7,6 +7,7 @@
>>   *  Copyright (C) 2002,2003 Andi Kleen <ak@...e.de>
>>   */
>>  
>> +#include <linux/asi.h>
>>  #include <linux/signal.h>
>>  #include <linux/sched.h>
>>  #include <linux/kernel.h>
>> @@ -746,7 +747,8 @@ phys_pgd_init(pgd_t *pgd_page, unsigned long paddr_start, unsigned long paddr_en
>>  {
>>  	unsigned long vaddr, vaddr_start, vaddr_end, vaddr_next, paddr_last;
>>  
>> -	*pgd_changed = false;
>> +	if (pgd_changed)
>> +		*pgd_changed = false;
>
> This 'pgd_changed' hunk isn't mentioned in the changelog.

Oops, will add a note about that. The alternative would just be to
squash this into the commit that introduces phys_pgd_init(), let me know
if you have a preference.

>> @@ -797,6 +800,24 @@ __kernel_physical_mapping_init(unsigned long paddr_start,
>>  
>>  	paddr_last = phys_pgd_init(init_mm.pgd, paddr_start, paddr_end, page_size_mask,
>>  				   prot, init, &pgd_changed);
>> +
>> +	/*
>> +	 * Set up ASI's unrestricted physmap. This needs to mapped at minimum 2M
>> +	 * size so that regions can be mapped and unmapped at pageblock
>> +	 * granularity without requiring allocations.
>> +	 */
>
> This took me a minute to wrap my head around.
>
> Here, I think you're trying to convey that:
>
>   1. There's a higher-level design decision that all sensitivity will be
>      done at a 2M granularity. A 2MB physical region is either sensitive
>      or not.
>   2. Because of #1, 1GB mappings are not cool because splitting a 1GB
>      mapping into 2MB needs to allocate a page table page.
>   3. 4k mappings are OK because they can also have their permissions
>      changed at a 2MB granularity. It's just more laborious.
>
> The "minimum 2M size" comment really threw me off because that, to me,
> also includes 1G which is a no-no here.

Er yeah sorry that's just wrong, it should say "maximum size".

> I also can't help but wonder if it would have been easier and more
> straightforward to just start this whole exercise at 4k: force all the
> ASI tables to be 4k. Then, later, add the 2MB support and tie to
> pageblocks on after.

This would lead to a much smaller patchset, but I think it creates some
pretty yucky technical debt and complexity of its own. If you're
imagining a world where we just leave most of the allocator as-is, and
just inject "map into ASI" or "unmap from ASI" at the right moments...

I think to make this work you have to do one of:

- Say all free pages are unmapped from the restricted address space, we
  map them on-demand in allocation (if !__GFP_SENSITIVE), and unmap them
  again when they are freed. Because you can't flush the TLB
  synchronously in the free path, you need an async worker to take care
  of that for you.

  This is what we did in the Google implementation (where "don't change
  the page allocator more than you have to" kinda trumps everything) and
  it's pretty nasty. We have lots of knobs we can turn to try and make
  it perform well but in the end it's eventually gonna block deployment
  to some environment or other.

- Say free pages are mapped into the restricted address space. So if you
  get a __GFP_SENSITIVE alloc you unmap the pages and do the TLB flush
  synchronously there, unless we think the caller might be atomic, in
  which case.... I guess we'd have to have a sort of special atomic
  reserve for this? Which... seems like a weaker and more awkward
  version of the thing I'm proposing in this patchset.

  Then when you free the page you need to map it back again, which means
  you need to zero it.

I might have some tunnel-vision on this so please challenge me if it
sounds like I'm missing something.

>> +	if (asi_nonsensitive_pgd) {
>> +		/*
>> +		 * Since most memory is expected to end up sensitive, start with
>> +		 * everything unmapped in this pagetable.
>> +		 */
>> +		pgprot_t prot_np = __pgprot(pgprot_val(prot) & ~_PAGE_PRESENT);
>> +
>> +		VM_BUG_ON((PAGE_SHIFT + pageblock_order) < page_level_shift(PG_LEVEL_2M));
>> +		phys_pgd_init(asi_nonsensitive_pgd, paddr_start, paddr_end, 1 << PG_LEVEL_2M,
>> +			      prot_np, init, NULL);
>> +	}
>
> I'm also kinda wondering what the purpose is of having a whole page
> table full of !_PAGE_PRESENT entries. It would be nice to know how this
> eventually gets turned into something useful.

If you are thinking of the fact that just clearing P doesn't really do
anything for Meltdown/L1TF.. yeah that's true! We'll actually need to
munge the PFN or something too, but here I wanted do just focus on the
broad strokes of integration without worrying too much about individual
CPU mitigations. Flippping _PAGE_PRESENT is already supported by
set_memory.c and IIRC it's good enough for everything newer than
Skylake.

Other than that, these pages being unmapped is the whole point.. later
on, the subset of memory that we don't need to protect will get flipped
to being present. Everything else will trigger a pagefault if touched
and we'll switch address spaces, do the flushing etc.

Sorry if I'm missing your point here...