[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <cdccc1a6-c348-4cae-ab70-92c5bd3bd9fd@lucifer.local>
Date: Thu, 21 Aug 2025 09:55:27 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Brendan Jackman <jackmanb@...gle.com>
Cc: peterz@...radead.org, bp@...en8.de, dave.hansen@...ux.intel.com,
mingo@...hat.com, tglx@...utronix.de, akpm@...ux-foundation.org,
david@...hat.com, derkling@...gle.com, junaids@...gle.com,
linux-kernel@...r.kernel.org, linux-mm@...ck.org, reijiw@...gle.com,
rientjes@...gle.com, rppt@...nel.org, vbabka@...e.cz, x86@...nel.org,
yosry.ahmed@...ux.dev, Matthew Wilcox <willy@...radead.org>,
Liam Howlett <liam.howlett@...cle.com>,
"Kirill A. Shutemov" <kas@...nel.org>,
Harry Yoo <harry.yoo@...cle.com>, Jann Horn <jannh@...gle.com>,
Pedro Falcato <pfalcato@...e.de>, Andy Lutomirski <luto@...nel.org>,
Josh Poimboeuf <jpoimboe@...nel.org>, Kees Cook <kees@...nel.org>
Subject: Re: [Discuss] First steps for ASI (ASI is fast again)
+cc Matthew for page cache side
+cc Other memory mapping folks for mapping side
+cc various x86 folks for x86 side
+cc Kees for security side of things
On Tue, Aug 12, 2025 at 05:31:09PM +0000, Brendan Jackman wrote:
> .:: Intro
>
> Following up to the plan I posted at [0], I've now prepared an up-to-date ASI
> branch that demonstrates a technique for solving the page cache performance
> devastation I described in [1]. The branch is at [5].
Have looked through your branch at [5], note that the exit_mmap() code is
changing very soon see [ljs0]. Also with regard to PGD syncing, Harry introduced
a hotfix series recently to address issues around this generalising this PGD
sync code which may be usefully relevant to your series.
[ljs0]:https://lore.kernel.org/linux-mm/20250815191031.3769540-1-Liam.Howlett@oracle.com/
[ljs1]:https://lore.kernel.org/linux-mm/20250818020206.4517-1-harry.yoo@oracle.com/
>
> The goal of this prototype is to increase confidence that ASI is viable as a
> broad solution for CPU vulnerabilities. (If the community still has to develop
> and maintain new mitigations for every individual vuln, because ASI only works
> for certain use-cases, then ASI isn't super attractive given its complexity
> burden).
>
> The biggest gap for establishing that confidence was that Google's deployment
> still only uses ASI for KVM workloads, not bare-metal processes. And indeed the
> page cache turned out to be a massive issue that Google just hasn't run up
> against yet internally.
>
> .:: The "ephmap"
>
> I won't re-hash the details of the problem here (see [1]) but in short: file
> pages aren't mapped into the physmap as seen from ASI's restricted address space.
> This causes a major overhead when e.g. read()ing files. The solution we've
> always envisaged (and which I very hastily tried to describe at LSF/MM/BPF this
> year) was to simply stop read() etc from touching the physmap.
>
> This is achieved in this prototype by a mechanism that I've called the "ephmap".
> The ephmap is a special region of the kernel address space that is local to the
> mm (much like the "proclocal" idea from 2019 [2]). Users of the ephmap API can
> allocate a subregion of this, and provide pages that get mapped into their
> subregion. These subregions are CPU-local. This means that it's cheap to tear
> these mappings down, so they can be removed immediately after use (eph =
> "ephemeral"), eliminating the need for complex/costly tracking data structures.
OK I had a bunch of questions here but looked at the code :)
So the idea is we have a per-CPU buffer that is equal to the size of the largest
possible folio, for each process.
I wonder by the way if we can cache page tables rather than alloc on bring
up/tear down? Or just zap? That could help things.
>
> (You might notice the ephmap is extremely similar to kmap_local_page() - see the
> commit that introduces it ("x86: mm: Introduce the ephmap") for discussion).
I do wonder if we need to have a separate kmap thing or whether we can just
adjust what already exists?
Presumably we will restrict ASI support to 64-bit kernels only (starting with
and perhaps only for x86-64), so we can avoid the highmem bs.
>
> The ephmap can then be used for accessing file pages. It's also a generic
> mechanism for accessing sensitive data, for example it could be used for
> zeroing sensitive pages, or if necessary for copy-on-write of user pages.
>
> .:: State of the branch
>
> The branch contains:
>
> - A rebased version of my "ASI integration for the page allocator" RFC [3]. (Up
> to "mm/page_alloc: Add support for ASI-unmapping pages")
> - The rest of ASI's basic functionality (up to "mm: asi: Stop ignoring asi=on
> cmdline flag")
> - Some test and observability conveniences (up to "mm: asi: Add a tracepoint for
> ASI page faults")
> - A prototype of the new performance improvements (the remainder of the
> branch).
>
> There's a gradient of quality where the earlier patches are closer to "complete"
> and the later ones are increasingly messy and hacky. Comments and commit message
> describe lots of the hacky elements but the most important things are:
>
> 1. The logic to take advantage of the ephmap is stuck directly into mm/shmem.c.
> This is just a shortcut to make its behaviour obvious. Since tmpfs is the
> most extreme case of the read/write slowdown this should give us some idea of
> the performance improvements but it obviously hides a lot of important
> complexity wrt how this would be integrated "for real".
Right, at what level do you plan to put the 'real' stuff?
generic_file_read_iter() + equivalent or something like this? But then you'd
miss some fs obv., so I guess filemap_read()?
>
> 2. The ephmap implementation is extremely stupid. It only works for the simple
> shmem usecase. I don't think this is really important though, whatever we end
> up with needs to be very simple, and it's not even clear that we actually
> want a whole new subsystem anyway. (e.g. maybe it's better to just adapt
> kmap_local_page() itself).
Right just testing stuff out, fair enough. Obviously not an upstremable thing
but sort of test case right?
>
> 3. For software correctness, the ephmap only needs to be TLB-flushed on the
> local CPU. But for CPU vulnerability mitigation, flushes are needed on other
> CPUs too. I believe these flushes should only be needed very infrequently.
> "Add ephmap TLB flushes for mitigating CPU vulns" is an illustrative idea of
> how these flushes could be implemented, but it's a bit of a simplistic
> implementation. The commit message has some more details.
Yeah, I am no security/x86 expert so you'll need insight from those with a
better understanding of both, but I think it's worth taking the time to have
this do the minimum possible that we can prove is necessary in any real-world
scenario.
It's good to start super conservative though.
>
> .:: Performance
>
> This data was gathered using the scripts at [4]. This is running on a Sapphire
> Rapids machine, but with setcpuid=retbleed. This introduces an IBPB in
> asi_exit(), which dramatically amplifies the performance impact of ASI. We don't
> know of any vulns that would necessitate this IBPB, so this is basically a weird
> selectively-paranoid configuration of ASI. It doesn't really make sense from a
> security perspective. A few years from now (once the security researchers have
> had their fun) we'll know what's _really_ needed on this CPU, it's very unlikely
> that it turns out to be exactly an IBPB like this, but it's reasonably likely to
> be something with a vaguely similar performance overhead.
I mean, this all sounds like you should drop this :)
What do the numbers look like without it?
>
> Native FIO randread IOPS on tmpfs (this is where the 70% perf degradation was):
> +---------+---------+-----------+---------+-----------+---------------+
> | variant | samples | mean | min | max | delta mean |
> +---------+---------+-----------+---------+-----------+---------------+
> | asi-off | 10 | 1,003,102 | 981,813 | 1,036,142 | |
> | asi-on | 10 | 871,928 | 848,362 | 885,622 | -13.1% |
> +---------+---------+-----------+---------+-----------+---------------+
>
> Native kernel compilation time:
> +---------+---------+--------+--------+--------+-------------+
> | variant | samples | mean | min | max | delta mean |
> +---------+---------+--------+--------+--------+-------------+
> | asi-off | 3 | 34.84s | 34.42s | 35.31s | |
> | asi-on | 3 | 37.50s | 37.39s | 37.58s | 7.6% |
> +---------+---------+--------+--------+--------+-------------+
>
> Kernel compilation in a guest VM:
> +---------+---------+--------+--------+--------+-------------+
> | variant | samples | mean | min | max | delta mean |
> +---------+---------+--------+--------+--------+-------------+
> | asi-off | 3 | 52.73s | 52.41s | 53.15s | |
> | asi-on | 3 | 55.80s | 55.51s | 56.06s | 5.8% |
> +---------+---------+--------+--------+--------+-------------+
(tiny nit but I think the bottom two are meant to be negative or the first
postiive :P)
>
> Despite my title these numbers are kinda disappointing to be honest, it's not
> where I wanted to be by now, but it's still an order-of-magnitude better than
> where we were for native FIO a few months ago. I believe almost all of this
> remaining slowdown is due to unnecessary ASI exits, the key areas being:
Nice, this broad approach does seem simple.
Obviously we really do need to see these numbers come down significantly for
this to be reasonably workable, as this kind of perf impact could really add up
at scale.
But from all you say it seems very plausible that we can in fact significant
reduce this.
Am guessing the below are general issues that are holding back ASI as a whole
perf-wise?
>
> - On every context_switch(). Google's internal implementation has fixed this (we
> only really need it when switching mms).
How did you guys fix this?
>
> - Whenever zeroing sensitive pages from the allocator. This could potentially be
> solved with the ephmap but requires a bit of care to avoid opening CPU attack
> windows.
Right, seems that having a per-CPU mapping is a generally useful thing. I wonder
if we can actually generalise this past ASI...
By the way a random thought, but we really do need some generic page table code,
there's mm/pagewalk.c which has install_pte(), but David and I have spoken quite
few times about generalising past this (watch this space).
I do intend to add install_pmd() and install_pud() also for the purposes of one
of my currently many pending series :P
>
> - In copy-on-write for user pages. The ephmap could also help here but the
> current implementation doesn't support it (it only allows one allocation at a
> time per context).
Hmm, CoW generally a pain. Could you go into more detail as to what's the issue
here?
>
> .:: Next steps
>
> Here's where I'd like to go next:
>
> 1. Discuss here and get feedback from x86 folks. Dave H said we need "line of
> sight" to a version of ASI that's viable for sandboxing native workloads. I
> don't consider a 13% slowdown "viable" as-is, but I do think this shows we're
> out of the "but what about the page cache" black hole. It seems provably
> solvable now.
Yes I agree.
Obviously it'd be great to get some insight from x86 guys, but strikes me we're
still broadly in mm territory here.
I do think the next step is to take the original ASI series, make it fully
upstremable, and simply introduce the CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
flag, default to N of course, without the ephmap work yet in place, rather a
minimal implementation.
And in the config/docs/commit msgs etc. you can indicate its limitations and
perf overhead.
I think with numerous RFC's and talks we're good for you to just send that as a
normal series and get some proper review going and ideally some bots running
with ASI switched on also (all* + random configs should do that for free) + some
syzbot action.
That way we have the roots in place and can build further upon that, but nobody
is impacted unless they decide to consciously opt in despite the documented
overhead + limitations.
>
> 2. Once we have some x86 maintainers saying "yep, it looks like this can work
> and it's something we want", I can start turning my page_alloc RFC [3] into a
> proper patchset (or maybe multiple if I can find a way to break things down
> further).
>
> Note what I'm NOT proposing is to carry on working on this branch until ASI is
> as fast as I am claiming it eventually will be. I would like to avoid doing that
> since I believe the biggest unknowns on that path are now solved, and it would
> be more useful to start getting down to nuts and bolts, i.e. reviewing real,
> PATCH-quality code and merging precursor stuff. I think this will lead to more
> useful discussions about the overall design, since so far all my postings have
> been so long and rarefied that it's been hard to really get a good conversation
> going.
Yes absolutely agreed.
Send the ASI core series as normal series and let's get the base stuff in tree
and some serious review going.
>
> .:: Conclusion
>
> So, x86 folks: Does this feel like "line of sight" to you? If not, what would
> that look like, what experiments should I run?
>From an mm point of view, I think obviously the ephmap stuff you have now is
hacky (as you point out clearly in [5] yourself :) but the general approach
seems sensible.
>
> ---
>
> [0] https://lore.kernel.org/lkml/DAJ0LUX8F2IW.Q95PTFBNMFOI@google.com/
> [1] https://lore.kernel.org/linux-mm/20250129144320.2675822-1-jackmanb@google.com/
> [2] https://lore.kernel.org/linux-mm/20190612170834.14855-1-mhillenb@amazon.de/
> [3] https://lore.kernel.org/lkml/20250313-asi-page-alloc-v1-0-04972e046cea@google.com/
> [4] https://github.com/bjackman/nixos-flake/commit/be42ba326f8a0854deb1d37143b5c70bf301c9db
> [5] https://github.com/bjackman/linux/tree/asi/6.16
>
Cheers, Lorenzo
Powered by blists - more mailing lists