[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZEM5Zq8oo+xnApW9@google.com>
Date:   Fri, 21 Apr 2023 18:33:26 -0700
From:   Sean Christopherson <seanjc@...gle.com>
To:     David Hildenbrand <david@...hat.com>
Cc:     Chao Peng <chao.p.peng@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        "Maciej S . Szmigiero" <mail@...iej.szmigiero.name>,
        Vlastimil Babka <vbabka@...e.cz>,
        Vishal Annapurve <vannapurve@...gle.com>,
        Yu Zhang <yu.c.zhang@...ux.intel.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        dhildenb@...hat.com, Quentin Perret <qperret@...gle.com>,
        tabba@...gle.com, Michael Roth <michael.roth@....com>,
        wei.w.wang@...el.com, Mike Rapoport <rppt@...nel.org>,
        Liam Merwick <liam.merwick@...cle.com>,
        Isaku Yamahata <isaku.yamahata@...il.com>,
        Jarkko Sakkinen <jarkko@...nel.org>,
        Ackerley Tng <ackerleytng@...gle.com>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Hugh Dickins <hughd@...gle.com>,
        Christian Brauner <brauner@...nel.org>
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9]
 KVM: mm: fd-based approach for supporting KVM)
+Christian and Hugh
On Wed, Apr 19, 2023, David Hildenbrand wrote:
> On 19.04.23 02:47, Sean Christopherson wrote:
> > An alternative that we haven't considered since the very early days is making the
> > uAPI a KVM ioctl() instead of a memfd() flag or dedicated syscall.  Looking at the
> > code for "pure shim" implementation[3], that's actually not that crazy of an idea.
> 
> Yes.
> 
> > 
> > I don't know that I love the idea of burying this in KVM, but there are benefits
> > to coupling restrictedmem to KVM (aside from getting out from behind this bikeshed
> > that I created).
> 
> Yes, it's all better than jumping through hoops to come up with a bad name
> like "restrictedmem".
> 
> > 
> > The big benefit is that the layer of indirection goes away.  That simplifies things
> > like enhancing restrictedmem to allow host userspace access for debug purposes,
> > batching TLB flushes if a PUNCH_HOLE spans multiple memslots, enforcing exclusive
> > access, likely the whole "share with a device" story if/when we get there, etc.
> > 
> > The obvious downsides are that (a) maintenance falls under the KVM umbrella, but
> > that's likely to be true in practice regardless of where the code lands, and
> 
> Yes.
> 
> > (b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk
> > someone reinventing a similar solution.
> 
> I agree. But if it's as simple as providing an ioctl for that hypervisor
> that simply wires up the existing implementation, it's not too bad.
> 
> > 
> > If we can get Gunyah on board and they don't need substantial changes to the
> > restrictedmem implementation, then I'm all for continuing on the path we're on.
> > But if Gunyah wants to do their own thing, and the lightweight shim approach is
> > viable, then it's awfully tempting to put this all behind a KVM ioctl().
> 
> Right. Or we still succeed in finding a name that's not as bad as what we
> had so far.
Okie dokie.  So as not to bury the lede:
I think we should provide a KVM ioctl() and have KVM provide its own low-level
implementation, i.e. not wrap shmem.  As much as I don't want to have KVM serving
up memory to userspace, by trying to keep KVM out of the memory management business,
we're actually making things *more* complex and harder to maintain (and merge).
Hugh said something to this effect quite early on[1], it just unfortunately took
me a long time to understand the benefits.
 : QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
 : the fallocate syscall interface itself) to allocate and free the memory,
 : ioctl for initializing some of it too.  KVM in control of whether that
 : fd can be read or written or mmap'ed or whatever, no need to prevent it
 : in shmem.c, no need for flags, seals, notifications to and fro because
 : KVM is already in control and knows the history.  If shmem actually has
 : value, call into it underneath - somewhat like SysV SHM, and /dev/zero
 : mmap, and i915/gem make use of it underneath.  If shmem has nothing to
 : add, just allocate and free kernel memory directly, recorded in your
 : own xarray.
Christian also suggested that we stop trying to be lazy and create a proper API[2].
 : I find this hard to navigate tbh and it feels like taking a shortcut to
 : avoid building a proper api. If you only care about a specific set of
 : operations specific to memfd restricte that needs to be available to
 : in-kernel consumers, I wonder if you shouldn't just go one step further
 : then your proposal below and build a dedicated minimal ops api. Idk,
 : sketching like a madman on a drawning board here with no claim to
 : feasibility from a mm perspective whatsoever
The key point from Hugh is that, except for a few minor things that are trivially
easy to replicate, the things that make shmem "shmem" don't provide any value for
the KVM use case:
  - We have no plans to support swap, and migration support is dubious at best.
    Swap in particular brings in a lot of complexity for no benefit (to us).  That
    complexity doesn't mean depending on shmem is inherently bad, but it does mean
    rolling our own implementation is highly unlikely to result in reinventing
    shmem's wheel.
  - Sharing a backing file between arbitrary process is _unwanted_ for the initial
    use cases.  There may come a time when mutually trusted VMs can share "private"
    data, but (a) that's a distant future problem and (b) it'll likely require even
    more KVM control over the memory.
  - All of the interfaces for read/write/mmap/etc. are dead weight from our
    perspective.  Even worse, we have to actively avoid those interfaces.  That
    can kinda sorta be done without polluting shmem code by using a shim, but
    that has problems of its own (see below).
And Christian pointed out several flaws with wrapping shmem:
  - Implementing a partial overlay filesystem leads to inconsistencies because
    only some of the ops are changed, e.g. poking at the inode_operations or
    super_operations will show shmem stuff, whereas address_space_operations and
    file_operations will show restrictedmem stuff.
  - Usurping just f_ops and a_ops without creating a full blown filesystem
    avoids the partial overlay issues, but partially overriding ops isn't really
    supported (because it's weird and brittle), e.g. blindly forwarding a
    fallocate() after restrictedmem does it's thing "works", but only because
    we very carefully and deliberately designed restrictedmem on top of shmem.
On top of the points raised by Hugh and Christian, wrapping shmem isn't really
any simpler, just different.  E.g. there's a very subtle bug in the shim variant:
by passing SGP_WRITE, shmem skips zeroing the page because restrictemem is telling
shmem that it (restrictedmem) will immediately write the page.  For TDX and SNP,
that's a "feature" because in those cases trusted firmware will zero (or init)
memory when it's assigned to the guest, but it's a nasty flaw for other use cases.
I'm not saying that we'll magically avoid such bugs by avoiding shmem, just pointing
out that using shmem requires understanding exactly how shmem works, i.e. using
shmem isn't necessarily any easier than building directly on filemap and/or folio
APIs.  And I gotta imagine it will be a similar story if/when we want to add
hugetlbfs support.  Building on filemap/folio will definitely have its own challenges,
but after prototyping an implementation I can confidently say that none of the
problems will be insurmountable.  KVM has _way_ more complex code in its memslot
and MMU implementations.
And another benefit of building on filemap and/or folio APIs is that, because we
would be reinventing the wheel to some extent, when we do inevitablly run into
problems, it will be easier to get help solving those problems because (a) we won't
be doing weird things no one wants to deal with and (b) because the problems will
likely be things others have already dealt with.
The other angle I've been looking at is whether or not having KVM provide its
own implementation will lead to maintenance problems in the future, specifically
if we get to the point where we want to support "fancy" things like swap and
migration.  For migration, I'm quite confident that a dedicated KVM ioctl() versus
wrapping shmem would be at worst a wash, and more than likely simpler if KVM owns
everything.  E.g. migrating pages under TDX and SNP requires invoking magic
instructions, and so we'd be overriding ->migrate_folio() no matter what.
As for swap, I think we should put a stake in the ground and say that KVM will
never support swap for KVM's ioctl().  Sooo much of the infrastructure around
swap/reclaim is tied to userspace mappings, e.g. knowing which pages are LRU and/or
cold.  I poked around a bit to see how we could avoid reinventing all of that
infrastructure for fd-only memory, and the best idea I could come up with is
basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
allow "mapping" fd-only memory, but ensure that memory is never actually present
from hardware's perspective.
But on top of the various problems with that approach, the only use cases I can
think of for using fd-only to back non-confidential VMs is to guard against spurious
writes/reads to guest memory and/or avoid memory overhead for mapping guest
memory into the user page tables.  Avoiding memory overhead is completely defeated
if the page tables are populated PROT_NONE, which just leaves the "harden against
guest data corruption use case".  And for that specific use case, _if_ swap is
desirable, it would be far, far easier to teach the kernel to allow KVM to follow
PROT_NONE (as Kirill's series did), as opposed to trying to teach the kernel and/or
KVM how to swap fd-only memory.
In other words, fd-only memory is purely for slice-of-hardware VMs.  If someone
wants to overcommit VMs, then they use KVM's traditional API for mapping memory
into the guest.
Regarding putting this in KVM, as mentioned (way) above in the quote, the worst
case scenario of making this a KVM ioctl() seems to be that KVM will end up with
an ioctl() that redirects to common kernel code.  On the flip side, implementing
this in KVM gives us a much clearer path to getting all of this merged.  There will
be a few small non-KVM patches, but they will either be trivial, e.g. exporting
APIs (that aren't contentious) or not strictly required, e.g. adding a flag to
mark an address space as completely unmigratable (for improved performance).  I.e.
the non-KVM patches will be small and be very actionable for their maintainers,
and other than that we control our own destiny.
And on the topic of control, making this a KVM ioctl() and implementation gives
KVM a _lot_ of control.  E.g. we can make the file size immutable to simplify the
implementation, bind the fd to a single VM at creation time, add KVM-defined flags
for controlling hugepage behavior, etc.
Last but not least, I think forcing us KVM folks to get our hands at least a little
dirty with MM and FS stuff would be a *good* thing.  KVM has had more than a few
bugs that we missed in no small part because most of the people that work on KVM
have almost zero knowledge of MM and FS, and especially at the boundaries between
those two.
Note, I implemented the POC on top of the filemap APIs because that was much faster
and less error prone than re-implementing xarray management myself.  We may or may
not want to actually make the kernel at large aware of these allocations, i.e. it
may be better to follow Hugh's suggestion and use the folio APIs directly instead
of piggybacking filemap+folios.  E.g. I _think_ migration becomes a complete non-issue
if the pages are "raw" allocations and never get marked as LRU-friendly.  What I'm
not so sure about is if there is anything substantial that we'd lose out on by not
reusing the filemap stuff.
The POC also doesn't play nice with Ackerley's patches to allocate files on a
user-defined mount.  AIUI, that's largely a non-issue as something like fbind()
would provide a superset of that functionality and more than one maintainer has
expressed (handwavy) support for a generic fbind().
Code is available here if folks want to take a look before any kind of formal
posting:
	https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
[1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
[2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
[3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
Powered by blists - more mailing lists
 
