linux-kernel - Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZGwHFPnNK89/t7wx@google.com>
Date:   Tue, 23 May 2023 00:21:40 +0000
From:   Sean Christopherson <seanjc@...gle.com>
To:     Michael Roth <michael.roth@....com>
Cc:     David Hildenbrand <david@...hat.com>,
        Chao Peng <chao.p.peng@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        "Maciej S . Szmigiero" <mail@...iej.szmigiero.name>,
        Vlastimil Babka <vbabka@...e.cz>,
        Vishal Annapurve <vannapurve@...gle.com>,
        Yu Zhang <yu.c.zhang@...ux.intel.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        dhildenb@...hat.com, Quentin Perret <qperret@...gle.com>,
        tabba@...gle.com, wei.w.wang@...el.com,
        Mike Rapoport <rppt@...nel.org>,
        Liam Merwick <liam.merwick@...cle.com>,
        Isaku Yamahata <isaku.yamahata@...il.com>,
        Jarkko Sakkinen <jarkko@...nel.org>,
        Ackerley Tng <ackerleytng@...gle.com>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Hugh Dickins <hughd@...gle.com>,
        Christian Brauner <brauner@...nel.org>
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9]
 KVM: mm: fd-based approach for supporting KVM)

On Mon, May 22, 2023, Michael Roth wrote:
> On Mon, May 22, 2023 at 10:09:40AM -0700, Sean Christopherson wrote:
> > On Mon, May 22, 2023, Michael Roth wrote:
> > > On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote:
> > > > On Thu, May 11, 2023, Michael Roth wrote:
> > > I put together a tree with some fixups that are needed for against the
> > > kvm_gmem_solo base tree, and a set of hooks to handle invalidations,
> > > preparing the initial private state as suggested above, and a
> > > platform-configurable mask that the x86 MMU code can use for determining
> > > whether a fault is for private vs. shared pages.
> > > 
> > >   KVM: x86: Determine shared/private faults using a configurable mask
> > >   ^ for TDX we could trivially add an inverted analogue of the mask/logic
> > >   KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault
> > >   KVM: x86: Add platform hooks for private memory invalidations
> > 
> > Hrm, I'd prefer to avoid adding another hook for this case, arch code already has
> > a "hook" in the form of kvm_unmap_gfn_range().  We'd probably just need a
> > kvm_gfn_range.is_private flag to communicate to arch/vendor code that the memory
> > being zapped is private.
> 
> kvm_unmap_gfn_range() does however get called with kvm->mmu_lock held so
> it might be tricky to tie RMP updates into that path.

Gah, I caught the mmu_lock issue before the end of my email, but forgot to go back
and rethink the first half.

> > That'd leave a gap for the unbind() case because kvm_unmap_gfn_range() is invoked
> > if and only if there's an overlapping memslot.  I'll chew on that a bit to see if
> > there's a way to cleanly handle that case without another hook.  I think it's worth
> > mapping out exactly what we want unbind() to look like anyways, e.g. right now the
> > code subtly relies on private memslots being immutable.
m 
> I thought the direction you sort of driving at was to completely decouple
> RMP updates for physical pages from the KVM MMU map/unmap paths since the
> life-cycles of those backing pages and associated RMP state are somewhat
> separate from the state of the GFNs and kvm->mem_attr_array. It seems to
> make sense when dealing with things like this unbind() case.
> 
> There's also cases like userspaces that opt to not discard memory after
> conversions because they highly favor performance over memory usage. In
> those cases it would make sense to defer marking the pages as shared in
> the RMP until the FALLOC_FL_PUNCH_HOLE, rather than triggering it via
> KVM MMU invalidation path after a conversion.

Hmm, right.  I got overzealous in my desire to avoid new hooks.

> > >   KVM: x86: Add platform hook for initializing private memory
> > 
> > This should also be unnecessary.  The call to kvm_gmem_get_pfn() is from arch
> > code, KVM just needs to ensure the RMP is converted before acquiring mmu_lock,
> > e.g. KVM has all the necessary info in kvm_tdp_mmu_page_fault().
> 
> I think that approach would work fine. The way I was thinking of things
> is that KVM MMU would necessarily call kvm_gmem_get_pfn() to grab the
> page before mapping it into the guest, so moving it out into an explicit
> call should work just as well. That would also drop the need for the
> __kvm_gmem_get_pfn() stuff I needed to add for the initial case where we
> need to access the PFN prior to making it private.
> 
> > 
> > The only reason to add another arch hook would be if we wanted to converted the
> > RMP when _allocating_, e.g. to preconvert in response to fallocate() instead of
> > waiting until #NPF.  But I think I would rather add a generic ioctl() to allow
> > userspace to effectively prefault guest memory, e.g. to setup the RMP before
> > running a vCPU.  Such an ioctl() would potentially be useful in other scenarios,
> > e.g. on the dest during live migration to reduce jitter.
> 
> Agreed, deferring the RMPUPDATE until it's actually needed would give us
> more flexibility on optimizing for things like lazy-acceptance.
> 
> For less-common scenarios like preallocation it makes sense to make that
> an opt-in sort of thing for userspace to configure explicitly.
> 
> > 
> > >   *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations
> > 
> > There was another bug in this path.  The math for handling a non-zero offsets into
> > the file was wrong.  The code now looks like:
> > 
> > 	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
> > 		struct kvm_gfn_range gfn_range = {
> > 			.start = slot->base_gfn + start - slot->gmem.index,
> 
> Sorry if I'm missing something here, but isn't there a risk that:
> 
>   start - slot->gmem.index
> 
> would be less than zero? E.g. starting GFN was 0, but current slot is bound
> at some non-zero offset in the same gmem instance. I guess the warning below
> shouldn't caught that, but it seems like a real scenario.

Heh, only if there's a testcase for it.  Assuming start >= the slot offset does
seem broken, e.g. if the range-to-invalidate overlaps multiple slots, later slots
will have index==slot->gmem.index > start.

> Since 'index' corresponds to the gmem offset of the current slot, is there any
> reason not to do something like this?:
> 
>   .start = slot->base_gfn + index - slot->gmem.index,
> 
> But then, if that's the case, wouldn't index == slot->gmem.index? Suggesting
> we case just simplify to this?:
> 
>   .start = slot->base_gfn,

No, e.g. if start is partway through a memslot, there's no need to invalidate
the entire memslot.  I'll stare at this tomorrow when my brain is hopefully a
bit more functional, I suspect there is a min() and/or max() needed somewhere.