[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <05c32a14-805c-4603-9afc-80e8f29b7957@lucifer.local>
Date: Fri, 22 Aug 2025 15:22:04 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Brendan Jackman <jackmanb@...gle.com>
Cc: peterz@...radead.org, bp@...en8.de, dave.hansen@...ux.intel.com,
mingo@...hat.com, tglx@...utronix.de, akpm@...ux-foundation.org,
david@...hat.com, derkling@...gle.com, junaids@...gle.com,
linux-kernel@...r.kernel.org, linux-mm@...ck.org, reijiw@...gle.com,
rientjes@...gle.com, rppt@...nel.org, vbabka@...e.cz, x86@...nel.org,
yosry.ahmed@...ux.dev, Matthew Wilcox <willy@...radead.org>,
Liam Howlett <liam.howlett@...cle.com>,
"Kirill A. Shutemov" <kas@...nel.org>,
Harry Yoo <harry.yoo@...cle.com>, Jann Horn <jannh@...gle.com>,
Pedro Falcato <pfalcato@...e.de>, Andy Lutomirski <luto@...nel.org>,
Josh Poimboeuf <jpoimboe@...nel.org>, Kees Cook <kees@...nel.org>
Subject: Re: [Discuss] First steps for ASI (ASI is fast again)
On Thu, Aug 21, 2025 at 12:15:04PM +0000, Brendan Jackman wrote:
> > OK I had a bunch of questions here but looked at the code :)
> >
> > So the idea is we have a per-CPU buffer that is equal to the size of the largest
> > possible folio, for each process.
> >
> > I wonder by the way if we can cache page tables rather than alloc on bring
> > up/tear down? Or just zap? That could help things.
>
> Yeah if I'm catching your gist correctly, we have done a bit of this in
> the Google-internal version. In cases where it's fine to fail to map
> stuff (as is the case for ephmap users in this branch) you can just have
> a little pool of pre-allocated pagetables that you can allocate from in
> arbitrary contexts. Maybe the ALLOC_TRYLOCK stuff could also be useful
> here, I haven't explored that.
Yeah nice, seems an easy win!
>
> >>
> >> (You might notice the ephmap is extremely similar to kmap_local_page() - see the
> >> commit that introduces it ("x86: mm: Introduce the ephmap") for discussion).
> >
> > I do wonder if we need to have a separate kmap thing or whether we can just
> > adjust what already exists?
>
> Yeah, I also wondered this. I think we could potentially just change the
> semantics of kmap_local_page() to suit ASI's needs, but I'm not really
> clear if that's consistent with the design or if there are perf
> concerns regarding its existing usecase. I am hoping once we start to
> get the more basic ASI stuff in, this will be a topic that will interest
> the right people, and I'll be able to get some useful input...
I think Matthew again might have some thoughts here.
>
> > Presumably we will restrict ASI support to 64-bit kernels only (starting with
> > and perhaps only for x86-64), so we can avoid the highmem bs.
>
> Yep.
Cool. If only we could move the rest of the kernel to this :)
>
> >>
> >> The ephmap can then be used for accessing file pages. It's also a generic
> >> mechanism for accessing sensitive data, for example it could be used for
> >> zeroing sensitive pages, or if necessary for copy-on-write of user pages.
> >>
> >> .:: State of the branch
> >>
> >> The branch contains:
> >>
> >> - A rebased version of my "ASI integration for the page allocator" RFC [3]. (Up
> >> to "mm/page_alloc: Add support for ASI-unmapping pages")
> >> - The rest of ASI's basic functionality (up to "mm: asi: Stop ignoring asi=on
> >> cmdline flag")
> >> - Some test and observability conveniences (up to "mm: asi: Add a tracepoint for
> >> ASI page faults")
> >> - A prototype of the new performance improvements (the remainder of the
> >> branch).
> >>
> >> There's a gradient of quality where the earlier patches are closer to "complete"
> >> and the later ones are increasingly messy and hacky. Comments and commit message
> >> describe lots of the hacky elements but the most important things are:
> >>
> >> 1. The logic to take advantage of the ephmap is stuck directly into mm/shmem.c.
> >> This is just a shortcut to make its behaviour obvious. Since tmpfs is the
> >> most extreme case of the read/write slowdown this should give us some idea of
> >> the performance improvements but it obviously hides a lot of important
> >> complexity wrt how this would be integrated "for real".
> >
> > Right, at what level do you plan to put the 'real' stuff?
> >
> > generic_file_read_iter() + equivalent or something like this? But then you'd
> > miss some fs obv., so I guess filemap_read()?
>
> Yeah, just putting it into these generic stuff seemed like the most
> obvious way, but I was also hoping there could be some more general way
> to integrate it into the page cache or even something like the iov
> system. I did not see anything like this yet, but I don't think I've
> done the full quota of code-gazing that I'd need to come up with the
> best idea here. (Also maybe the solution becomes obvious if I can find
> the right pair of eyes).
I think you'd need filemap_read() and possibly filemap_splcie_read()? Not
sure iterator stuff is right level of abstraction at all as should be
explicitly about page cache, but then maybe we just want to use this
_generally_? Probably a combination of:
- Checking what every filesystem ultimately uses
- Emperically testing different approaches
Is the way to go.
>
> Anyway, my hope is that the number of filesystems that are both a) very
> special implementation-wise and b) dear to the hearts of
> performance-sensitive users is quite small, so maybe just injecting into
> the right pre-existing filemap.c helpers, plus one or two
> filesystem-specific additions, already gets us almost all the way there.
Yeah I think the bulk use some form of generic_*().
>
> >>
> >> 2. The ephmap implementation is extremely stupid. It only works for the simple
> >> shmem usecase. I don't think this is really important though, whatever we end
> >> up with needs to be very simple, and it's not even clear that we actually
> >> want a whole new subsystem anyway. (e.g. maybe it's better to just adapt
> >> kmap_local_page() itself).
> >
> > Right just testing stuff out, fair enough. Obviously not an upstremable thing
> > but sort of test case right?
>
> Yeah exactly.
>
> Maybe worth adding here that I explored just using vmalloc's allocator
> for this. My experience was that despite looking quite nicely optimised
> re avoiding synchronisation, just the simple fact of traversing its data
> structures is too slow for this usecase (at least, it did poorly on my
> super-sensitive FIO benchmark setup).
Yeah I think honestly vmalloc is fairly unopitimised in many ways, while
Ulad is doing fantastic work, there's a lot of legacy cruft and duplication
there.
>
> >> 3. For software correctness, the ephmap only needs to be TLB-flushed on the
> >> local CPU. But for CPU vulnerability mitigation, flushes are needed on other
> >> CPUs too. I believe these flushes should only be needed very infrequently.
> >> "Add ephmap TLB flushes for mitigating CPU vulns" is an illustrative idea of
> >> how these flushes could be implemented, but it's a bit of a simplistic
> >> implementation. The commit message has some more details.
> >
> > Yeah, I am no security/x86 expert so you'll need insight from those with a
> > better understanding of both, but I think it's worth taking the time to have
> > this do the minimum possible that we can prove is necessary in any real-world
> > scenario.
>
> I can also add a bit of colour here in case it piques any interest.
>
> What I think we can do is an mm-global flush whenever there's a
> possibility that the process is losing logical access to a physical
> page. So basically I think that's whenever we evict from the page cache,
> or the user closes a file.
>
> ("Logical access" = we would let them do a read() that gives them the
> contents of the page).
>
> The key insight is that a) those events are reeelatively rare and b)
> already often involve big TLB flushes. So doing global flushes there is
> not that bad, and this allows us to forget about all the particular
> details of which pages might have TLB entries on which CPUs and just say
> "_some_ CPU in this MM might have _some_ stale TLB entry", which is
> simple and efficient to track.
I guess rare to get truncation mid-way through a read(), closing it mid-way
would be... a bug surely? :P
I may be missing context here however.
But yes we can probably not worry at all about perf of _that_
>
> So yeah actually this doesn't really require too much security
> understanding, it's mostly just a job of making sure we don't forget a
> place where the flush would be needed, and then tying it nicely with the
> existing TLB infrastructure so that we can aggregate the flushes and
> avoid redundant IPIs. It's fiddly, but in a fun way. So I think this is
> "the easy bit".
>
Cool.
I guess starting conservative is sensible for security though.
> > It's good to start super conservative though.
> >
> >>
> >> .:: Performance
> >>
> >> This data was gathered using the scripts at [4]. This is running on a Sapphire
> >> Rapids machine, but with setcpuid=retbleed. This introduces an IBPB in
> >> asi_exit(), which dramatically amplifies the performance impact of ASI. We don't
> >> know of any vulns that would necessitate this IBPB, so this is basically a weird
> >> selectively-paranoid configuration of ASI. It doesn't really make sense from a
> >> security perspective. A few years from now (once the security researchers have
> >> had their fun) we'll know what's _really_ needed on this CPU, it's very unlikely
> >> that it turns out to be exactly an IBPB like this, but it's reasonably likely to
> >> be something with a vaguely similar performance overhead.
> >
> > I mean, this all sounds like you should drop this :)
> >
> > What do the numbers look like without it?
>
> Sure, let's see...
>
> (Minor note: I said above that setcpuid=retbleed triggered this IBPB but
> I just noticed that's wrong, in the code I've posted the IBPB is
> hard-coded. So to disable it I'm setting clearcpuid=ibpb).
>
> metric: compile-kernel_elapsed (ns) | test: compile-kernel_host
> +---------+---------+--------+--------+--------+------+
> | variant | samples | mean | min | max | Δμ |
> +---------+---------+--------+--------+--------+------+
> | asi-off | 0 | 35.10s | 35.00s | 35.16s | |
> | asi-on | 0 | 36.85s | 36.77s | 37.00s | 5.0% |
> +---------+---------+--------+--------+--------+------+
>
> My first guess at the main source of that 5% would be the address space
> switches themselves. At the moment you'll see that __asi_enter() and
> asi_exit() always clear the noflush bit in CR3 meaning they trash the
> TLB. This is not particularly difficult to address, it just means
> extending all the existing stuff in tlb.c etc to deal with an additional
> address space (this is done in Google's internal version).
Cool, sounds like it would just be a bit fiddly then.
>
> (But getting rid of the asi_exits() completely is the higher-priority
> optimisation. On most CPUs that TLB trashing is gonna be less
> significant than the actual security flushes, which can't be avoided if
> we do transition. This is why I introduced the IBPB, since otherwise
> Sapphire Rapids makes things look a bit too easy. See the bullet points
> below for what I think is needed to eliminate most of the transitions).
>
Ack.
> >> Native FIO randread IOPS on tmpfs (this is where the 70% perf degradation was):
> >> +---------+---------+-----------+---------+-----------+---------------+
> >> | variant | samples | mean | min | max | delta mean |
> >> +---------+---------+-----------+---------+-----------+---------------+
> >> | asi-off | 10 | 1,003,102 | 981,813 | 1,036,142 | |
> >> | asi-on | 10 | 871,928 | 848,362 | 885,622 | -13.1% |
> >> +---------+---------+-----------+---------+-----------+---------------+
> >>
> >> Native kernel compilation time:
> >> +---------+---------+--------+--------+--------+-------------+
> >> | variant | samples | mean | min | max | delta mean |
> >> +---------+---------+--------+--------+--------+-------------+
> >> | asi-off | 3 | 34.84s | 34.42s | 35.31s | |
> >> | asi-on | 3 | 37.50s | 37.39s | 37.58s | 7.6% |
> >> +---------+---------+--------+--------+--------+-------------+
> >>
> >> Kernel compilation in a guest VM:
> >> +---------+---------+--------+--------+--------+-------------+
> >> | variant | samples | mean | min | max | delta mean |
> >> +---------+---------+--------+--------+--------+-------------+
> >> | asi-off | 3 | 52.73s | 52.41s | 53.15s | |
> >> | asi-on | 3 | 55.80s | 55.51s | 56.06s | 5.8% |
> >> +---------+---------+--------+--------+--------+-------------+
> >
> > (tiny nit but I think the bottom two are meant to be negative or the first
> > postiive :P)
>
> The polarities are correct - more FIO IOPS is better, more kernel
> compilation duration is worse. (Maybe I should make my scripts aware of
> which direction is better for each metric!)
>
Ahhh so, right. I just saw it as a raw directional delta so either you
decide +ve or -ve is good. But that makes sense!
> >> Despite my title these numbers are kinda disappointing to be honest, it's not
> >> where I wanted to be by now, but it's still an order-of-magnitude better than
> >> where we were for native FIO a few months ago. I believe almost all of this
> >> remaining slowdown is due to unnecessary ASI exits, the key areas being:
> >
> > Nice, this broad approach does seem simple.
> >
> > Obviously we really do need to see these numbers come down significantly for
> > this to be reasonably workable, as this kind of perf impact could really add up
> > at scale.
> >
> > But from all you say it seems very plausible that we can in fact significant
> > reduce this.
> >
> > Am guessing the below are general issues that are holding back ASI as a whole
> > perf-wise?
> >
> >>
> >> - On every context_switch(). Google's internal implementation has fixed this (we
> >> only really need it when switching mms).
> >
> > How did you guys fix this?
>
> The only issue here is that it makes CR3 unstable in places where it was
> formerly stable: if you're in the restricted address space, an interrupt
> might show up and cause an asi_exit() at any time. (CR3 is already
> unstable when preemption is on because the PCID can get recycled). So we
> just had to updated the CR3 accessor API and then hunt for places that
> access CR3 directly.
Ack. Doesn't seem... too egregious?
>
> Other than that, we had to fiddle around with the lifetime of struct asi
> a bit (this doesn't really add complexity TBH, we just made it live as
> long as the mm_struct). Then we can stay in the restricted address space
> across context_switch() within the same mm, including to a kthread and
> back.
>
Ugh mm lifetime is already a bit horrendous with the various forking stuff
and exit_mmap() is a horrendous nightmare, ref. Liam's recent RFC on this.
So need to tread carefully :)
> >> - Whenever zeroing sensitive pages from the allocator. This could potentially be
> >> solved with the ephmap but requires a bit of care to avoid opening CPU attack
> >> windows.
> >
> > Right, seems that having a per-CPU mapping is a generally useful thing. I wonder
> > if we can actually generalise this past ASI...
> >
> > By the way a random thought, but we really do need some generic page table code,
> > there's mm/pagewalk.c which has install_pte(), but David and I have spoken quite
> > few times about generalising past this (watch this space).
>
> OK good to know, Yosry and I both did some fiddling around trying to
> come up with cute ways to write this kinda code but in the end I think
> the best way is quite dependent on maintainer preference.
>
Yeah, the whole situation is a bit of a mess still tbh. Let's see on review.
> > I do intend to add install_pmd() and install_pud() also for the purposes of one
> > of my currently many pending series :P
> >
> >>
> >> - In copy-on-write for user pages. The ephmap could also help here but the
> >> current implementation doesn't support it (it only allows one allocation at a
> >> time per context).
> >
> > Hmm, CoW generally a pain. Could you go into more detail as to what's the issue
> > here?
>
> It's just that you have two user pages that you wanna touch at once
> (src, dst). This crappy ephmap implementation doesn't suppport two
> mappings at once in the same context, so the second allocation fails, so
> you always get an asi_exit().
Right... well like can we just have space for 2 then? ;) it's mappings not
actually allocating pages so... :)
>
> >>
> >> .:: Next steps
> >>
> >> Here's where I'd like to go next:
> >>
> >> 1. Discuss here and get feedback from x86 folks. Dave H said we need "line of
> >> sight" to a version of ASI that's viable for sandboxing native workloads. I
> >> don't consider a 13% slowdown "viable" as-is, but I do think this shows we're
> >> out of the "but what about the page cache" black hole. It seems provably
> >> solvable now.
> >
> > Yes I agree.
> >
> > Obviously it'd be great to get some insight from x86 guys, but strikes me we're
> > still broadly in mm territory here.
>
> Implementation wise, certainly. It's just that I'd prefer not to take
> up loads of everyone's time hashing out implementation details if
> there's a risk that the x86 guys NAK it when we get to their part.
I think it's better to just go ahead with the series, everybody's super
busy, so you're less likely to get meaningful responses like this by doing
so.
>
> > I do think the next step is to take the original ASI series, make it fully
> > upstremable, and simply introduce the CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
> > flag, default to N of course, without the ephmap work yet in place, rather a
> > minimal implementation.
>
> I think even this would actually be too big, reviewing all that at once
> would be quite unpleasant even in the absolutely minimal case. But yes I
> think we can get a series-of-series that does this :)
Well generally I mean we should just get going with some iterative series :)
>
> > And in the config/docs/commit msgs etc. you can indicate its limitations and
> > perf overhead.
> >
> > I think with numerous RFC's and talks we're good for you to just send that as a
> > normal series and get some proper review going and ideally some bots running
> > with ASI switched on also (all* + random configs should do that for free) + some
> > syzbot action.
> >
> > That way we have the roots in place and can build further upon that, but nobody
> > is impacted unless they decide to consciously opt in despite the documented
> > overhead + limitations.
> >
> >>
> >> 2. Once we have some x86 maintainers saying "yep, it looks like this can work
> >> and it's something we want", I can start turning my page_alloc RFC [3] into a
> >> proper patchset (or maybe multiple if I can find a way to break things down
> >> further).
> >>
> >> Note what I'm NOT proposing is to carry on working on this branch until ASI is
> >> as fast as I am claiming it eventually will be. I would like to avoid doing that
> >> since I believe the biggest unknowns on that path are now solved, and it would
> >> be more useful to start getting down to nuts and bolts, i.e. reviewing real,
> >> PATCH-quality code and merging precursor stuff. I think this will lead to more
> >> useful discussions about the overall design, since so far all my postings have
> >> been so long and rarefied that it's been hard to really get a good conversation
> >> going.
> >
> > Yes absolutely agreed.
> >
> > Send the ASI core series as normal series and let's get the base stuff in tree
> > and some serious review going.
> >
> >>
> >> .:: Conclusion
> >>
> >> So, x86 folks: Does this feel like "line of sight" to you? If not, what would
> >> that look like, what experiments should I run?
> >
> > From an mm point of view, I think obviously the ephmap stuff you have now is
> > hacky (as you point out clearly in [5] yourself :) but the general approach
> > seems sensible.
>
> Great, thanks so much for taking a look!
No problem :)
Cheers, Lorenzo
Powered by blists - more mailing lists