[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230522135036.wnvsmryhkvstwvw2@amd.com>
Date: Mon, 22 May 2023 08:50:36 -0500
From: Michael Roth <michael.roth@....com>
To: Sean Christopherson <seanjc@...gle.com>
CC: David Hildenbrand <david@...hat.com>,
Chao Peng <chao.p.peng@...ux.intel.com>,
Paolo Bonzini <pbonzini@...hat.com>,
Vitaly Kuznetsov <vkuznets@...hat.com>,
Jim Mattson <jmattson@...gle.com>,
Joerg Roedel <joro@...tes.org>,
"Maciej S . Szmigiero" <mail@...iej.szmigiero.name>,
Vlastimil Babka <vbabka@...e.cz>,
Vishal Annapurve <vannapurve@...gle.com>,
Yu Zhang <yu.c.zhang@...ux.intel.com>,
"Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
<dhildenb@...hat.com>, Quentin Perret <qperret@...gle.com>,
<tabba@...gle.com>, <wei.w.wang@...el.com>,
Mike Rapoport <rppt@...nel.org>,
Liam Merwick <liam.merwick@...cle.com>,
Isaku Yamahata <isaku.yamahata@...il.com>,
Jarkko Sakkinen <jarkko@...nel.org>,
Ackerley Tng <ackerleytng@...gle.com>, <kvm@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, Hugh Dickins <hughd@...gle.com>,
Christian Brauner <brauner@...nel.org>
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9]
KVM: mm: fd-based approach for supporting KVM)
On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote:
> On Thu, May 11, 2023, Michael Roth wrote:
> > On Fri, Apr 21, 2023 at 06:33:26PM -0700, Sean Christopherson wrote:
> > >
> > > Code is available here if folks want to take a look before any kind of formal
> > > posting:
> > >
> > > https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> >
> > Hi Sean,
> >
> > I've been working on getting the SNP patches ported to this but I'm having
> > some trouble working out a reasonable scheme for how to work the
> > RMPUPDATE hooks into the proposed design.
> >
> > One of the main things is kvm_gmem_punch_hole(): this is can free pages
> > back to the host whenever userspace feels like it. Pages that are still
> > marked private in the RMP table will blow up the host if they aren't returned
> > to the normal state before handing them back to the kernel. So I'm trying to
> > add a hook, orchestrated by kvm_arch_gmem_invalidate(), to handle that,
> > e.g.:
> >
> > static long kvm_gmem_punch_hole(struct file *file, int mode, loff_t offset,
> > loff_t len)
> > {
> > struct kvm_gmem *gmem = file->private_data;
> > pgoff_t start = offset >> PAGE_SHIFT;
> > pgoff_t end = (offset + len) >> PAGE_SHIFT;
> > struct kvm *kvm = gmem->kvm;
> >
> > /*
> > * Bindings must stable across invalidation to ensure the start+end
> > * are balanced.
> > */
> > filemap_invalidate_lock(file->f_mapping);
> > kvm_gmem_invalidate_begin(kvm, gmem, start, end);
> >
> > /* Handle arch-specific cleanups before releasing pages */
> > kvm_arch_gmem_invalidate(kvm, gmem, start, end);
> > truncate_inode_pages_range(file->f_mapping, offset, offset + len);
> >
> > kvm_gmem_invalidate_end(kvm, gmem, start, end);
> > filemap_invalidate_unlock(file->f_mapping);
> >
> > return 0;
> > }
> >
> > But there's another hook, kvm_arch_gmem_set_mem_attributes(), needed to put
> > the page in its intended state in the RMP table prior to mapping it into the
> > guest's NPT.
>
> IMO, this approach is wrong. kvm->mem_attr_array is the source of truth for whether
> userspace wants _guest_ physical pages mapped private vs. shared, but the attributes
> array has zero insight into the _host_ physical pages. I.e. SNP shouldn't hook
> kvm_mem_attrs_changed(), because operating on the RMP from that code is fundamentally
> wrong.
>
> A good analogy is moving a memslot (ignoring that AFAIK no VMM actually moves
> memslots, but it's a good analogy for KVM internals). KVM needs to zap all mappings
> for the old memslot gfn, but KVM does not create mappings for the new memslot gfn.
> Same for changing attributes; unmap, but never map.
>
> As for the unmapping side of things, kvm_unmap_gfn_range() will unmap all relevant
> NPT entries, and the elevated mmu_invalidate_in_progress will prevent KVM from
> establishing a new NPT mapping. And mmu_invalidate_in_progress will reach '0' only
> after both truncation _and_ kvm_vm_ioctl_set_mem_attributes() complete, i.e. KVM
> can create new mappings only when both kvm->mem_attr_array and any relevant
> guest_mem bindings have reached steady state.
>
> That leaves the question of when/where to do RMP updates. Off the cuff, I think
> RMP updates (and I _think_ also TDX page conversions) should _always_ be done in
> the context of either (a) file truncation (make host owned due, a.k.a. TDX reclaim)
> or (b) allocating a new page/folio in guest_mem, a.k.a. kvm_gmem_get_folio().
> Under the hood, even though the gfn is the same, the backing pfn is different, i.e.
> installing a shared mapping should _never_ need to touch the RMP because pages
> common from the normal (non-guest_mem) pool must already be host owned.
Hi Sean, thanks for the suggestions.
I reworked things based on this approach and things seems to work out
pretty nicely for SNP.
I needed to add the hook to kvm_gmem_get_pfn() instead of
kvm_gmem_get_folio() because SNP needs to know the GFN in order to mark
the page as private in the RMP table, but otherwise I think things are
the same as what you had in mind. One downside to this approach is since
the hook always gets called during kvm_gmem_get_pfn(), we need to do an
extra RMP lookup to determine whether or not that page has already been
set to private state, vs. being able to assume it's already been put in
the expected state, but it's only a memory access so not a huge
overhead. Not sure if that would be a concern of not on the TDX side
though.
I put together a tree with some fixups that are needed for against the
kvm_gmem_solo base tree, and a set of hooks to handle invalidations,
preparing the initial private state as suggested above, and a
platform-configurable mask that the x86 MMU code can use for determining
whether a fault is for private vs. shared pages.
KVM: x86: Determine shared/private faults using a configurable mask
^ for TDX we could trivially add an inverted analogue of the mask/logic
KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault
KVM: x86: Add platform hooks for private memory invalidations
KVM: x86: Add platform hook for initializing private memory
*fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations
*fixup (kvm_gmem_solo): KVM: selftests: update kvm_create_guest_memfd struct usage
https://github.com/mdroth/linux/commits/kvm_gmem_solo_x86
I'm hoping these are similarly usable for TDX, but could use some input
from TDX folks on that aspect.
> >
> > Keep in mind that RMP updates can't be done while holding KVM->mmu_lock
> > spinlock, because we also need to unmap pages from the directmap, which can
> > lead to scheduling-while-atomic BUG()s[1], so that's another constraint we
> > need to work around.
This concern also ends up going away since GFP_RECLAIM also has similar
issues when called under kvm->mmu_lock, so having the hook in
kvm_gmem_get_pfn() sort of guarantees we wouldn't hit issues with this.
-Mike
> >
> > Thanks!
> >
> > -Mike
> >
> > [1] https://lore.kernel.org/linux-coco/20221214194056.161492-7-michael.roth@amd.com/T/#m45a1af063aa5ac0b9314d6a7d46eecb1253bba7a
> >
> > >
> > > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
> > > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com
Powered by blists - more mailing lists