linux-kernel - Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220926142330.GC2658254@chaop.bj.intel.com>
Date:   Mon, 26 Sep 2022 22:23:30 +0800
From:   Chao Peng <chao.p.peng@...ux.intel.com>
To:     Fuad Tabba <tabba@...gle.com>
Cc:     Sean Christopherson <seanjc@...gle.com>,
        David Hildenbrand <david@...hat.com>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        linux-fsdevel@...r.kernel.org, linux-api@...r.kernel.org,
        linux-doc@...r.kernel.org, qemu-devel@...gnu.org,
        Paolo Bonzini <pbonzini@...hat.com>,
        Jonathan Corbet <corbet@....net>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        x86@...nel.org, "H . Peter Anvin" <hpa@...or.com>,
        Hugh Dickins <hughd@...gle.com>,
        Jeff Layton <jlayton@...nel.org>,
        "J . Bruce Fields" <bfields@...ldses.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Shuah Khan <shuah@...nel.org>, Mike Rapoport <rppt@...nel.org>,
        Steven Price <steven.price@....com>,
        "Maciej S . Szmigiero" <mail@...iej.szmigiero.name>,
        Vlastimil Babka <vbabka@...e.cz>,
        Vishal Annapurve <vannapurve@...gle.com>,
        Yu Zhang <yu.c.zhang@...ux.intel.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        luto@...nel.org, jun.nakajima@...el.com, dave.hansen@...el.com,
        ak@...ux.intel.com, aarcange@...hat.com, ddutile@...hat.com,
        dhildenb@...hat.com, Quentin Perret <qperret@...gle.com>,
        Michael Roth <michael.roth@....com>, mhocko@...e.com,
        Muchun Song <songmuchun@...edance.com>, wei.w.wang@...el.com,
        Will Deacon <will@...nel.org>, Marc Zyngier <maz@...nel.org>
Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > Regarding pKVM's use case, with the shim approach I believe this can be done by
> > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> > piled on top.
> >
> > My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly
> > tightly control usage without taking on too much complexity in the kernel, but
> > working through things, routing the behavior through the shim itself might not be
> > all that horrific.
> >
> > IIRC, we discarded the idea of allowing userspace to map the "private" fd because
> > things got too complex, but with the shim it doesn't seem _that_ bad.
> >
> > E.g. on the memfd side:
> >
> >   1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
> >      mapping is all or nothing.
> >
> >   2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
> >      the restricted memfd.
> >
> >   3. Add notifier hooks to allow downstream users to further restrict things.
> >
> >   4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
> >      one shot.
> >
> >   5. Require that there are no outstanding references at munmap().  Or if this
> >      can't be guaranteed by userspace, maybe add some way for userspace to wait
> >      until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
> >      to do an expensive check every time.
> >
> >   static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma)
> >   {
> >         if (vma->vm_pgoff)
> >                 return -EINVAL;
> >
> >         if ((vma->vm_end - vma->vm_start) != <file size>)
> >                 return -EINVAL;
> >
> >         mutex_lock(&data->lock);
> >
> >         if (data->has_mapping) {
> >                 r = -EINVAL;
> >                 goto err;
> >         }
> >         list_for_each_entry(notifier, &data->notifiers, list) {
> >                 r = notifier->ops->mmap_start(notifier, ...);
> >                 if (r)
> >                         goto abort;
> >         }
> >
> >         notifier->ops->mmap_end(notifier, ...);
> >         mutex_unlock(&data->lock);
> >         return 0;
> >
> >   abort:
> >         list_for_each_entry_continue_reverse(notifier &data->notifiers, list)
> >                 notifier->ops->mmap_abort(notifier, ...);
> >   err:
> >         mutex_unlock(&data->lock);
> >         return r;
> >   }
> >
> >   static void memfd_restricted_close(struct vm_area_struct *vma)
> >   {
> >         mutex_lock(...);
> >
> >         /*
> >          * Destroy the memfd and disable all future accesses if there are
> >          * outstanding refcounts (or other unsatisfied restrictions?).
> >          */
> >         if (<outstanding references> || ???)
> >                 memfd_restricted_destroy(...);
> >         else
> >                 data->has_mapping = false;
> >
> >         mutex_unlock(...);
> >   }
> >
> >   static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr)
> >   {
> >         return -EINVAL;
> >   }
> >
> >   static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma)
> >   {
> >         return -EINVAL;
> >   }
> >
> > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> >
> >   1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> >      memory into the guest (after pre-boot phase).
> >
> >   2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> >      and only if the entire gfn range of the associated memslot is shared.
> 
> In general I think that this would work with pKVM. However, limiting
> private<->shared conversions to the granularity of a whole memslot
> might be difficult to handle in pKVM, since the guest doesn't have the
> concept of memslots. For example, in pKVM right now, when a guest
> shares back its restricted DMA pool with the host it does so at the
> page-level. pKVM would also need a way to make an fd accessible again
> when shared back, which I think isn't possible with this patch.

But does pKVM really want to mmap/munmap a new region at the page-level,
that can cause VMA fragmentation if the conversion is frequent as I see.
Even with a KVM ioctl for mapping as mentioned below, I think there will
be the same issue.

> 
> You were initially considering a KVM ioctl for mapping, which might be
> better suited for this since KVM knows which pages are shared and
> which ones are private. So routing things through KVM might simplify
> things and allow it to enforce all the necessary restrictions (e.g.,
> private memory cannot be mapped). What do you think?
> 
> Thanks,
> /fuad