linux-kernel - Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Yz71ogila0mSHxxJ@google.com>
Date:   Thu, 6 Oct 2022 15:34:58 +0000
From:   Sean Christopherson <seanjc@...gle.com>
To:     Jarkko Sakkinen <jarkko@...nel.org>
Cc:     Chao Peng <chao.p.peng@...ux.intel.com>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        linux-fsdevel@...r.kernel.org, linux-api@...r.kernel.org,
        linux-doc@...r.kernel.org, qemu-devel@...gnu.org,
        Paolo Bonzini <pbonzini@...hat.com>,
        Jonathan Corbet <corbet@....net>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        x86@...nel.org, "H . Peter Anvin" <hpa@...or.com>,
        Hugh Dickins <hughd@...gle.com>,
        Jeff Layton <jlayton@...nel.org>,
        "J . Bruce Fields" <bfields@...ldses.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Shuah Khan <shuah@...nel.org>, Mike Rapoport <rppt@...nel.org>,
        Steven Price <steven.price@....com>,
        "Maciej S . Szmigiero" <mail@...iej.szmigiero.name>,
        Vlastimil Babka <vbabka@...e.cz>,
        Vishal Annapurve <vannapurve@...gle.com>,
        Yu Zhang <yu.c.zhang@...ux.intel.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        luto@...nel.org, jun.nakajima@...el.com, dave.hansen@...el.com,
        ak@...ux.intel.com, david@...hat.com, aarcange@...hat.com,
        ddutile@...hat.com, dhildenb@...hat.com,
        Quentin Perret <qperret@...gle.com>,
        Michael Roth <michael.roth@....com>, mhocko@...e.com,
        Muchun Song <songmuchun@...edance.com>, wei.w.wang@...el.com
Subject: Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based
 private memory

On Thu, Oct 06, 2022, Jarkko Sakkinen wrote:
> On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote:
> > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote:
> > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > additional KVM memslot fields private_fd/private_offset to allow
> > > userspace to specify that guest private memory provided from the
> > > private_fd and guest_phys_addr mapped at the private_offset of the
> > > private_fd, spanning a range of memory_size.
> > > 
> > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > single memslot can maintain both private memory through private
> > > fd(private_fd/private_offset) and shared memory through
> > > hva(userspace_addr). Whether the private or shared part is visible to
> > > guest is maintained by other KVM code.
> > 
> > What is anyway the appeal of private_offset field, instead of having just
> > 1:1 association between regions and files, i.e. one memfd per region?

Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM).
E.g. if a vCPU converts a single page, it will be forced to wait until all other
vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in
memory.  KVM's memslot updates also hold a mutex for the entire duration of the
update, i.e. conversions on different vCPUs would be fully serialized, exacerbating
the SRCU problem.

KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any
memslot is deleted.

Taking both a private_fd and a shared userspace address allows userspace to convert
between private and shared without having to manipulate memslots.

Paolo's original idea (was sent off-list):

  : The problem is that KVM_SET_USER_MEMORY_REGION and memslots in general
  : are designed around (S)RCU.  It is way too slow (in both QEMU and KVM)
  : to be called on every private<->shared conversion with 4K granularity,
  : and it tends naturally to have quadratic behavior (though, at least for
  : KVM, the in-progress "fast memslots" series would avoid that).
  : 
  : Since private PTEs are persistent, and userspace cannot access the memfd
  : in any other way, userspace could use fallocate() to map/unmap an
  : address range as private, and KVM can treat everything that userspace
  : hasn't mapped as shared.
  : 
  : This would be a new entry in struct guest_ops, called by fallocate(),
  : and the callback can take the mmu_lock for write to avoid racing with
  : page faults.  This doesn't add any more contention than
  : KVM_SET_USER_MEMORY_REGION, since the latter takes slots_lock.  If
  : there's something I'm missing then the mapping operation can use a
  : ioctl, while the unmapping can keep using FALLOC_FL_PUNCH_HOLE.
  : 
  : Then:
  : 
  : - for simplicity, mapping a private memslot fails if there are any
  : mappings (similar to the handling when F_SEAL_GUEST is set).
  : 
  : - for TDX, accessing a nonexistent private PTE will cause a userspace
  : exit for a shared->private conversion request.  For SNP, the guest will
  : do a page state change VMGEXIT to request an RMPUPDATE, which can cause
  : a userspace exit too; the consequent fallocate() on the private fd
  : invokes RMPUPDATE.
  : 
  : - trying to map a shared PTE where there's already a private PTE causes
  : a userspace exit for a private->shared conversion request.
  : kvm_faultin_pfn or handle_abnormal_pfn can query this in the private-fd
  : inode, which is essentially a single pagecache_get_page call.
  : 
  : - if userspace asks to map a private PTE where there's already a shared
  : PTE (which it can check because it has the mmu_lock taken for write),
  : KVM unmaps the shared PTE.

> > 
> > If this was the case, then an extended struct would not be needed in the
> > first place. A simple union inside the existing struct would do:
> > 
> >         union {
> >                 __u64 userspace_addr,
> >                 __u64 private_fd,
> >         };
> 
> Also, why is this mechanism just for fd's with MFD_INACCESSIBLE flag? I'd
> consider instead having KVM_MEM_FD flag. For generic KVM (if memfd does not
> have MFD_INACCESSIBLE set), KVM could just use the memory as it is using
> mapped memory. This would simplify user space code, as you can the use the
> same thing for both cases.

I explored this idea too[*].  Because we want to support specifying both the
private and shared backing stores in a single memslot, then we need two file
descriptors so that shared memory can also use fd-based memory.

[*] https://lore.kernel.org/all/YulTH7bL4MwT5v5K@google.com