linux-kernel - Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b21f41e5-0322-bbfb-b9c2-db102488592d@amd.com>
Date:   Thu, 11 Aug 2022 15:32:07 +0530
From:   "Nikunj A. Dadhania" <nikunj@....com>
To:     Chao Peng <chao.p.peng@...ux.intel.com>,
        Sean Christopherson <seanjc@...gle.com>
Cc:     Paolo Bonzini <pbonzini@...hat.com>,
        Jonathan Corbet <corbet@....net>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        x86@...nel.org, "H . Peter Anvin" <hpa@...or.com>,
        Hugh Dickins <hughd@...gle.com>,
        Jeff Layton <jlayton@...nel.org>,
        "J . Bruce Fields" <bfields@...ldses.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Shuah Khan <shuah@...nel.org>, Mike Rapoport <rppt@...nel.org>,
        Steven Price <steven.price@....com>,
        "Maciej S . Szmigiero" <mail@...iej.szmigiero.name>,
        Vlastimil Babka <vbabka@...e.cz>,
        Vishal Annapurve <vannapurve@...gle.com>,
        Yu Zhang <yu.c.zhang@...ux.intel.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        luto@...nel.org, jun.nakajima@...el.com, dave.hansen@...el.com,
        ak@...ux.intel.com, david@...hat.com, aarcange@...hat.com,
        ddutile@...hat.com, dhildenb@...hat.com,
        Quentin Perret <qperret@...gle.com>,
        Michael Roth <michael.roth@....com>, mhocko@...e.com,
        Muchun Song <songmuchun@...edance.com>, bharata@....com,
        kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, linux-kselftest@...r.kernel.org,
        linux-api@...r.kernel.org, linux-doc@...r.kernel.org,
        qemu-devel@...gnu.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM
 guest private memory

On 06/07/22 13:50, Chao Peng wrote:
> This is the v7 of this series which tries to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
> 
>   b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> split_desc_cache only by default capacity
> 
> Introduction
> ------------
> In general this patch series introduce fd-based memslot which provides
> guest memory through memory file descriptor fd[offset,size] instead of
> hva/size. The fd can be created from a supported memory filesystem
> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> and the the memory backing store exchange callbacks when such memslot
> gets created. At runtime KVM will call into callbacks provided by the
> backing store to get the pfn with the fd+offset. Memory backing store
> will also call into KVM callbacks when userspace punch hole on the fd
> to notify KVM to unmap secondary MMU page table entries.
> 
> Comparing to existing hva-based memslot, this new type of memslot allows
> guest memory unmapped from host userspace like QEMU and even the kernel
> itself, therefore reduce attack surface and prevent bugs.
> 
> Based on this fd-based memslot, we can build guest private memory that
> is going to be used in confidential computing environments such as Intel
> TDX and AMD SEV. When supported, the memory backing store can provide
> more enforcement on the fd and KVM can use a single memslot to hold both
> the private and shared part of the guest memory. 
> 
> mm extension
> ---------------------
> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
> created with these flags cannot read(), write() or mmap() etc via normal
> MMU operations. The file content can only be used with the newly
> introduced memfile_notifier extension.
> 
> The memfile_notifier extension provides two sets of callbacks for KVM to
> interact with the memory backing store:
>   - memfile_notifier_ops: callbacks for memory backing store to notify
>     KVM when memory gets invalidated.
>   - backing store callbacks: callbacks for KVM to call into memory
>     backing store to request memory pages for guest private memory.
> 
> The memfile_notifier extension also provides APIs for memory backing
> store to register/unregister itself and to trigger the notifier when the
> bookmarked memory gets invalidated.
> 
> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
> prevent double allocation caused by unintentional guest when we only
> have a single side of the shared/private memfds effective.
> 
> memslot extension
> -----------------
> Add the private fd and the fd offset to existing 'shared' memslot so
> that both private/shared guest memory can live in one single memslot.
> A page in the memslot is either private or shared. Whether a guest page
> is private or shared is maintained through reusing existing SEV ioctls
> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
> 
> Test
> ----
> To test the new functionalities of this patch TDX patchset is needed.
> Since TDX patchset has not been merged so I did two kinds of test:
> 
> -  Regresion test on kvm/queue (this patchset)
>    Most new code are not covered. Code also in below repo:
>    https://github.com/chao-p/linux/tree/privmem-v7
> 
> -  New Funational test on latest TDX code
>    The patch is rebased to latest TDX code and tested the new
>    funcationalities. See below repos:
>    Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
>    QEMU: https://github.com/chao-p/qemu/tree/privmem-v7

While debugging an issue with SEV+UPM, found that fallocate() returns 
an error in QEMU which is not handled (EINTR). With the below handling 
of EINTR subsequent fallocate() succeeds:


diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c
index af8fb0c957..e8597ed28d 100644
--- a/backends/hostmem-memfd-private.c
+++ b/backends/hostmem-memfd-private.c
@@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     MachineState *machine = MACHINE(qdev_get_machine());
     uint32_t ram_flags;
     char *name;
-    int fd, priv_fd;
+    int fd, priv_fd, ret;
 
     if (!backend->size) {
         error_setg(errp, "can't create backend with size 0");
@@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
                                    backend->size, ram_flags, fd, 0, errp);
     g_free(name);
 
-    fallocate(priv_fd, 0, 0, backend->size);
+again:
+    ret = fallocate(priv_fd, 0, 0, backend->size);
+    if (ret) {
+           perror("Fallocate failed: \n");
+           if (errno == EINTR)
+                   goto again;
+           else
+                   exit(1);
+    }

However, fallocate() preallocates full guest memory before starting the guest.
With this behaviour guest memory is *not* demand pinned. Is there a way to 
prevent fallocate() from reserving full guest memory?

> An example QEMU command line for TDX test:
> -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
> -machine confidential-guest-support=tdx \
> -object memory-backend-memfd-private,id=ram1,size=${mem} \
> -machine memory-backend=ram1
> 

Regards,
Nikunj