linux-kernel - Re: [Discuss] First steps for ASI (ASI is fast again)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <DD7S12CF1V9G.3KGT3KYBLZ7F2@google.com>
Date: Thu, 02 Oct 2025 10:50:03 +0000
From: Brendan Jackman <jackmanb@...gle.com>
To: David Hildenbrand <david@...hat.com>, Brendan Jackman <jackmanb@...gle.com>, <peterz@...radead.org>, 
	<bp@...en8.de>, <dave.hansen@...ux.intel.com>, <mingo@...hat.com>, 
	<tglx@...utronix.de>
Cc: <akpm@...ux-foundation.org>, <derkling@...gle.com>, <junaids@...gle.com>, 
	<linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>, <reijiw@...gle.com>, 
	<rientjes@...gle.com>, <rppt@...nel.org>, <vbabka@...e.cz>, <x86@...nel.org>, 
	<yosry.ahmed@...ux.dev>, Patrick Roy <roypat@...zon.co.uk>, Zi Yan <ziy@...dia.com>
Subject: Re: [Discuss] First steps for ASI (ASI is fast again)

On Thu Oct 2, 2025 at 7:45 AM UTC, David Hildenbrand wrote:
>> I won't re-hash the details of the problem here (see [1]) but in short: file
>> pages aren't mapped into the physmap as seen from ASI's restricted address space.
>> This causes a major overhead when e.g. read()ing files. The solution we've
>> always envisaged (and which I very hastily tried to describe at LSF/MM/BPF this
>> year) was to simply stop read() etc from touching the physmap.
>> 
>> This is achieved in this prototype by a mechanism that I've called the "ephmap".
>> The ephmap is a special region of the kernel address space that is local to the
>> mm (much like the "proclocal" idea from 2019 [2]). Users of the ephmap API can
>> allocate a subregion of this, and provide pages that get mapped into their
>> subregion. These subregions are CPU-local. This means that it's cheap to tear
>> these mappings down, so they can be removed immediately after use (eph =
>> "ephemeral"), eliminating the need for complex/costly tracking data structures.
>> 
>> (You might notice the ephmap is extremely similar to kmap_local_page() - see the
>> commit that introduces it ("x86: mm: Introduce the ephmap") for discussion).
>> 
>> The ephmap can then be used for accessing file pages. It's also a generic
>> mechanism for accessing sensitive data, for example it could be used for
>> zeroing sensitive pages, or if necessary for copy-on-write of user pages.
>> 
>
> At some point we discussed on how to make secretmem pages movable so we 
> end up having less unmovable pages in the system.
>
> Secretmem pages have their directmap removed once allocated, and 
> restored once free (truncated from the page cache).
>
> In order to migrate them we would have to temporarily map them, and we 
> obviously don't want to temporarily map them into the directmap.
>
> Maybe the ephmap could be user for that use case, too.

The way I've implemented it here, you can only use the ephmap while
preemption is disabled. (A lot about the implementation I posted here is
just stupid prototype stuff, but the preemption-off thing is
deliberate). Does that still work here? I guess it's only needed for the
brief moment while we are actually copying the data, right? In that case
then yeah this seems like a good use case.

> Another, similar use case, would be guest_memfd with a similar approach 
> that secretmem took: removing the direct map. While guest_memfd does not 
> support page migration yet, there are some prototypes that allow 
> migrating pages for non-CoCo (IOW: ordinary) VMs.
>
> Maybe using the ephmap could be used here too.

Yeah, I think overall, the pattern of "I have tried to remove stuff from
my address space, but actuonally I need to exceptionally access it anyway,
we are not actually a microkernel" is gonna be a pretty common one. So
if we can find a way to solve it generically that seems worthwhile. I'm
not confident that this design is a generic solution but it seems like
it might be a reasonable starting point.

> I guess an interesting question would be: which MM to use when we are 
> migrating a page out of random context: memory offlining, page 
> compaction, memory-failure, alloc_contig_pages, ...
>
> [...]
>
>> 
>> Despite my title these numbers are kinda disappointing to be honest, it's not
>> where I wanted to be by now,
>
> "ASI is faster again" :)
>
>> but it's still an order-of-magnitude better than
>> where we were for native FIO a few months ago. I believe almost all of this
>> remaining slowdown is due to unnecessary ASI exits, the key areas being:
>> 
>> - On every context_switch(). Google's internal implementation has fixed this (we
>>    only really need it when switching mms).
>> 
>> - Whenever zeroing sensitive pages from the allocator. This could potentially be
>>    solved with the ephmap but requires a bit of care to avoid opening CPU attack
>>    windows.
>> 
>> - In copy-on-write for user pages. The ephmap could also help here but the
>>    current implementation doesn't support it (it only allows one allocation at a
>>    time per context).
>> 
>
> But only the first point would actually be relevant for the FIO 
> benchmark I assume, right?

Yeah that's a good point, I was thinking more of kernel compile when I
wrote this, I don't remember having a specific theory about the FIO
degradation. The other thing I didn't mention here that might be hitting
FIO is filesystem metadata. For example if you run this on ext4 you
would need to get the superblock into the restricted addres space to
make it fast.  I'm not sure if there would be anything like that in
shmem though...

> So how confident are you that this is really going to be solvable. 

I feel pretty good about solvability right now - the numbers we see now
are kinda where we were at internally 2 or 3 years ago, and then it was
a few optimisation steps from there to GCE prod (IIRC the
context_swith() one was a pretty big one for that usecase, I can't
remember if any of the TLB flushing optimisations made a big
difference). 

I can't deny the risk that these few steps might be much harder for
native workloads than VM ones but it  just seems like a game of
whack-a-mole now, not a "I'm not sure this thing is ever gonna work".
The only question is how many moles there are to whack...

> Or to 
> ask from another angle: long-term how much slowdown do you expect and 
> target?

In the vast majority of cases, we've been able to keep degradations from
ASI below 1% of whatever anyone's measuring. When things go above that
we need to grovel a bit, if anything gets to 5% we don't even bother
asking.

But also, note in lots of these cases we're switching ASI on while
leaving other mitigations in place too. If we had a complete "denylist"
(i.e. the holes in the restricted address space) that we were confident
covered everything, we'd be able to make a lot of these degradataions
negative. So we might just be making life unnecessarily hard for
ourselves by not doing that in the first place. The idea is to retrace
our steps later and start switching off old mitigations and bragging
triumphantly about our perf wins once we are totally certain there's no
security regression.

So yeah I can't be 100% confident for the reasons I mentioned above but
the target, which I think is realistic, is for ASI to be faster than the
existing mitigations in all the interesting cases ("interesting" meaning
we have to do kernel work instead of just flipping a bit in the CPU ).