linux-kernel - Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230522235838.ov3722lcusotzlvo@amd.com>
Date:   Mon, 22 May 2023 18:58:38 -0500
From:   Michael Roth <michael.roth@....com>
To:     Sean Christopherson <seanjc@...gle.com>
CC:     David Hildenbrand <david@...hat.com>,
        Chao Peng <chao.p.peng@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        "Maciej S . Szmigiero" <mail@...iej.szmigiero.name>,
        Vlastimil Babka <vbabka@...e.cz>,
        Vishal Annapurve <vannapurve@...gle.com>,
        Yu Zhang <yu.c.zhang@...ux.intel.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        <dhildenb@...hat.com>, Quentin Perret <qperret@...gle.com>,
        <tabba@...gle.com>, <wei.w.wang@...el.com>,
        Mike Rapoport <rppt@...nel.org>,
        Liam Merwick <liam.merwick@...cle.com>,
        Isaku Yamahata <isaku.yamahata@...il.com>,
        Jarkko Sakkinen <jarkko@...nel.org>,
        Ackerley Tng <ackerleytng@...gle.com>, <kvm@...r.kernel.org>,
        <linux-kernel@...r.kernel.org>, Hugh Dickins <hughd@...gle.com>,
        Christian Brauner <brauner@...nel.org>
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9]
 KVM: mm: fd-based approach for supporting KVM)

On Mon, May 22, 2023 at 10:09:40AM -0700, Sean Christopherson wrote:
> On Mon, May 22, 2023, Michael Roth wrote:
> > On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote:
> > > On Thu, May 11, 2023, Michael Roth wrote:
> > I put together a tree with some fixups that are needed for against the
> > kvm_gmem_solo base tree, and a set of hooks to handle invalidations,
> > preparing the initial private state as suggested above, and a
> > platform-configurable mask that the x86 MMU code can use for determining
> > whether a fault is for private vs. shared pages.
> > 
> >   KVM: x86: Determine shared/private faults using a configurable mask
> >   ^ for TDX we could trivially add an inverted analogue of the mask/logic
> >   KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault
> >   KVM: x86: Add platform hooks for private memory invalidations
> 
> Hrm, I'd prefer to avoid adding another hook for this case, arch code already has
> a "hook" in the form of kvm_unmap_gfn_range().  We'd probably just need a
> kvm_gfn_range.is_private flag to communicate to arch/vendor code that the memory
> being zapped is private.

kvm_unmap_gfn_range() does however get called with kvm->mmu_lock held so
it might be tricky to tie RMP updates into that path.

> 
> That'd leave a gap for the unbind() case because kvm_unmap_gfn_range() is invoked
> if and only if there's an overlapping memslot.  I'll chew on that a bit to see if
> there's a way to cleanly handle that case without another hook.  I think it's worth
> mapping out exactly what we want unbind() to look like anyways, e.g. right now the
> code subtly relies on private memslots being immutable.

I thought the direction you sort of driving at was to completely decouple
RMP updates for physical pages from the KVM MMU map/unmap paths since the
life-cycles of those backing pages and associated RMP state are somewhat
separate from the state of the GFNs and kvm->mem_attr_array. It seems to
make sense when dealing with things like this unbind() case.

There's also cases like userspaces that opt to not discard memory after
conversions because they highly favor performance over memory usage. In
those cases it would make sense to defer marking the pages as shared in
the RMP until the FALLOC_FL_PUNCH_HOLE, rather than triggering it via
KVM MMU invalidation path after a conversion.

> 
> >   KVM: x86: Add platform hook for initializing private memory
> 
> This should also be unnecessary.  The call to kvm_gmem_get_pfn() is from arch
> code, KVM just needs to ensure the RMP is converted before acquiring mmu_lock,
> e.g. KVM has all the necessary info in kvm_tdp_mmu_page_fault().

I think that approach would work fine. The way I was thinking of things
is that KVM MMU would necessarily call kvm_gmem_get_pfn() to grab the
page before mapping it into the guest, so moving it out into an explicit
call should work just as well. That would also drop the need for the
__kvm_gmem_get_pfn() stuff I needed to add for the initial case where we
need to access the PFN prior to making it private.

> 
> The only reason to add another arch hook would be if we wanted to converted the
> RMP when _allocating_, e.g. to preconvert in response to fallocate() instead of
> waiting until #NPF.  But I think I would rather add a generic ioctl() to allow
> userspace to effectively prefault guest memory, e.g. to setup the RMP before
> running a vCPU.  Such an ioctl() would potentially be useful in other scenarios,
> e.g. on the dest during live migration to reduce jitter.

Agreed, deferring the RMPUPDATE until it's actually needed would give us
more flexibility on optimizing for things like lazy-acceptance.

For less-common scenarios like preallocation it makes sense to make that
an opt-in sort of thing for userspace to configure explicitly.

> 
> >   *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations
> 
> There was another bug in this path.  The math for handling a non-zero offsets into
> the file was wrong.  The code now looks like:
> 
> 	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
> 		struct kvm_gfn_range gfn_range = {
> 			.start = slot->base_gfn + start - slot->gmem.index,

Sorry if I'm missing something here, but isn't there a risk that:

  start - slot->gmem.index

would be less than zero? E.g. starting GFN was 0, but current slot is bound
at some non-zero offset in the same gmem instance. I guess the warning below
shouldn't caught that, but it seems like a real scenario.

Since 'index' corresponds to the gmem offset of the current slot, is there any
reason not to do something like this?:

  .start = slot->base_gfn + index - slot->gmem.index,

But then, if that's the case, wouldn't index == slot->gmem.index? Suggesting
we case just simplify to this?:

  .start = slot->base_gfn,

-Mike

> 			.end = slot->base_gfn + min(end - slot->gmem.index, slot->npages),
> 			.slot = slot,
> 			.pte = __pte(0),
> 			.may_block = true,
> 		};
> 
> 		if (WARN_ON_ONCE(start < slot->gmem.index ||
> 				 end > slot->gmem.index + slot->npages))
> 			continue;
> 
> 		kvm_mmu_invalidate_range_add(kvm, gfn_range.start, gfn_range.end);
> 
> 		flush |= kvm_unmap_gfn_range(kvm, &gfn_range);
> 	}