lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <600f9ca1-580e-46c9-94ad-f9b4b8c3cf97@redhat.com>
Date: Thu, 2 Oct 2025 13:21:17 +0200
From: David Hildenbrand <david@...hat.com>
To: Brendan Jackman <jackmanb@...gle.com>, peterz@...radead.org,
 bp@...en8.de, dave.hansen@...ux.intel.com, mingo@...hat.com,
 tglx@...utronix.de
Cc: akpm@...ux-foundation.org, derkling@...gle.com, junaids@...gle.com,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org, reijiw@...gle.com,
 rientjes@...gle.com, rppt@...nel.org, vbabka@...e.cz, x86@...nel.org,
 yosry.ahmed@...ux.dev, Patrick Roy <roypat@...zon.co.uk>,
 Zi Yan <ziy@...dia.com>
Subject: Re: [Discuss] First steps for ASI (ASI is fast again)

On 02.10.25 12:50, Brendan Jackman wrote:
> On Thu Oct 2, 2025 at 7:45 AM UTC, David Hildenbrand wrote:
>>> I won't re-hash the details of the problem here (see [1]) but in short: file
>>> pages aren't mapped into the physmap as seen from ASI's restricted address space.
>>> This causes a major overhead when e.g. read()ing files. The solution we've
>>> always envisaged (and which I very hastily tried to describe at LSF/MM/BPF this
>>> year) was to simply stop read() etc from touching the physmap.
>>>
>>> This is achieved in this prototype by a mechanism that I've called the "ephmap".
>>> The ephmap is a special region of the kernel address space that is local to the
>>> mm (much like the "proclocal" idea from 2019 [2]). Users of the ephmap API can
>>> allocate a subregion of this, and provide pages that get mapped into their
>>> subregion. These subregions are CPU-local. This means that it's cheap to tear
>>> these mappings down, so they can be removed immediately after use (eph =
>>> "ephemeral"), eliminating the need for complex/costly tracking data structures.
>>>
>>> (You might notice the ephmap is extremely similar to kmap_local_page() - see the
>>> commit that introduces it ("x86: mm: Introduce the ephmap") for discussion).
>>>
>>> The ephmap can then be used for accessing file pages. It's also a generic
>>> mechanism for accessing sensitive data, for example it could be used for
>>> zeroing sensitive pages, or if necessary for copy-on-write of user pages.
>>>
>>
>> At some point we discussed on how to make secretmem pages movable so we
>> end up having less unmovable pages in the system.
>>
>> Secretmem pages have their directmap removed once allocated, and
>> restored once free (truncated from the page cache).
>>
>> In order to migrate them we would have to temporarily map them, and we
>> obviously don't want to temporarily map them into the directmap.
>>
>> Maybe the ephmap could be user for that use case, too.
> 
> The way I've implemented it here, you can only use the ephmap while
> preemption is disabled. (A lot about the implementation I posted here is
> just stupid prototype stuff, but the preemption-off thing is
> deliberate). Does that still work here? I guess it's only needed for the
> brief moment while we are actually copying the data, right? In that case
> then yeah this seems like a good use case.

Yes, that's my expectation: we only need access for a brief moment in 
time, when actually copying page content.

> 
>> Another, similar use case, would be guest_memfd with a similar approach
>> that secretmem took: removing the direct map. While guest_memfd does not
>> support page migration yet, there are some prototypes that allow
>> migrating pages for non-CoCo (IOW: ordinary) VMs.
>>
>> Maybe using the ephmap could be used here too.
> 
> Yeah, I think overall, the pattern of "I have tried to remove stuff from
> my address space, but actuonally I need to exceptionally access it anyway,
> we are not actually a microkernel" is gonna be a pretty common one. So
> if we can find a way to solve it generically that seems worthwhile. I'm
> not confident that this design is a generic solution but it seems like
> it might be a reasonable starting point.
> 
>> I guess an interesting question would be: which MM to use when we are
>> migrating a page out of random context: memory offlining, page
>> compaction, memory-failure, alloc_contig_pages, ...

In both context ^ would be an interesting point. Maybe we'd just need 
some dummy MM to achieve that, not sure.

>>
>> [...]
>>
>>>
>>> Despite my title these numbers are kinda disappointing to be honest, it's not
>>> where I wanted to be by now,
>>
>> "ASI is faster again" :)
>>
>>> but it's still an order-of-magnitude better than
>>> where we were for native FIO a few months ago. I believe almost all of this
>>> remaining slowdown is due to unnecessary ASI exits, the key areas being:
>>>
>>> - On every context_switch(). Google's internal implementation has fixed this (we
>>>     only really need it when switching mms).
>>>
>>> - Whenever zeroing sensitive pages from the allocator. This could potentially be
>>>     solved with the ephmap but requires a bit of care to avoid opening CPU attack
>>>     windows.
>>>
>>> - In copy-on-write for user pages. The ephmap could also help here but the
>>>     current implementation doesn't support it (it only allows one allocation at a
>>>     time per context).
>>>
>>
>> But only the first point would actually be relevant for the FIO
>> benchmark I assume, right?
> 
> Yeah that's a good point, I was thinking more of kernel compile when I
> wrote this, I don't remember having a specific theory about the FIO
> degradation. The other thing I didn't mention here that might be hitting
> FIO is filesystem metadata. For example if you run this on ext4 you
> would need to get the superblock into the restricted addres space to
> make it fast.  I'm not sure if there would be anything like that in
> shmem though...
> 
>> So how confident are you that this is really going to be solvable.
> 
> I feel pretty good about solvability right now - the numbers we see now
> are kinda where we were at internally 2 or 3 years ago, and then it was
> a few optimisation steps from there to GCE prod (IIRC the
> context_swith() one was a pretty big one for that usecase, I can't
> remember if any of the TLB flushing optimisations made a big
> difference).
> 
> I can't deny the risk that these few steps might be much harder for
> native workloads than VM ones but it  just seems like a game of
> whack-a-mole now, not a "I'm not sure this thing is ever gonna work".
> The only question is how many moles there are to whack...
> 
>> Or to
>> ask from another angle: long-term how much slowdown do you expect and
>> target?
> 
> In the vast majority of cases, we've been able to keep degradations from
> ASI below 1% of whatever anyone's measuring. When things go above that
> we need to grovel a bit, if anything gets to 5% we don't even bother
> asking.
> 
> But also, note in lots of these cases we're switching ASI on while
> leaving other mitigations in place too. If we had a complete "denylist"
> (i.e. the holes in the restricted address space) that we were confident
> covered everything, we'd be able to make a lot of these degradataions
> negative. So we might just be making life unnecessarily hard for
> ourselves by not doing that in the first place. The idea is to retrace
> our steps later and start switching off old mitigations and bragging
> triumphantly about our perf wins once we are totally certain there's no
> security regression.
> 
> So yeah I can't be 100% confident for the reasons I mentioned above but
> the target, which I think is realistic, is for ASI to be faster than the
> existing mitigations in all the interesting cases ("interesting" meaning
> we have to do kernel work instead of just flipping a bit in the CPU ).

Got it, thanks for all that information. I suggest you use some of that 
when moving forward with this projects, because it's much more important 
what the roadmap is than what is the current limitations.

-- 
Cheers

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ