linux-kernel - Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZFwPBqGeW+d9xMEs@google.com>
Date:   Wed, 10 May 2023 14:39:18 -0700
From:   Sean Christopherson <seanjc@...gle.com>
To:     Vishal Annapurve <vannapurve@...gle.com>
Cc:     David Hildenbrand <david@...hat.com>,
        Chao Peng <chao.p.peng@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        "Maciej S . Szmigiero" <mail@...iej.szmigiero.name>,
        Vlastimil Babka <vbabka@...e.cz>,
        Yu Zhang <yu.c.zhang@...ux.intel.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        dhildenb@...hat.com, Quentin Perret <qperret@...gle.com>,
        tabba@...gle.com, Michael Roth <michael.roth@....com>,
        wei.w.wang@...el.com, Mike Rapoport <rppt@...nel.org>,
        Liam Merwick <liam.merwick@...cle.com>,
        Isaku Yamahata <isaku.yamahata@...il.com>,
        Jarkko Sakkinen <jarkko@...nel.org>,
        Ackerley Tng <ackerleytng@...gle.com>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Hugh Dickins <hughd@...gle.com>,
        Christian Brauner <brauner@...nel.org>
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9]
 KVM: mm: fd-based approach for supporting KVM)

On Wed, May 10, 2023, Vishal Annapurve wrote:
> On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <seanjc@...gle.com> wrote:
> >
> > ...
> > cold.  I poked around a bit to see how we could avoid reinventing all of that
> > infrastructure for fd-only memory, and the best idea I could come up with is
> > basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
> > allow "mapping" fd-only memory, but ensure that memory is never actually present
> > from hardware's perspective.
> >
> 
> I am most likely missing a lot of context here and possibly venturing
> into an infeasible/already shot down direction here.

Both :-)

> But I would still like to get this discussed here before we move on.
> 
> I am wondering if it would make sense to implement
> restricted_mem/guest_mem file to expose both private and shared memory
> regions, inline with Kirill's original proposal now that the file
> implementation is controlled by KVM.
> 
> Thinking from userspace perspective:
> 1) Userspace creates guest mem files and is able to mmap them but all
> accesses to these files result into faults as no memory is allowed to
> be mapped into userspace VMM pagetables.

Never mapping anything into the userspace page table is infeasible.  Technically
it's doable, but it'd effectively require all of the work of an fd-based approach
(and probably significantly more), _and_ it'd require touching core mm code.

VMAs don't provide hva=>pfn information, they're the kernel's way of implementing
the abstraction provided to userspace by mmap(), mprotect() etc.  Among many other
things, a VMA describes properties of what is mapped, e.g. hugetblfs versus
anonymous, where memory is mapped (virtual address), how memory is mapped, e.g.
RWX protections, etc.  But a VMA doesn't track the physical address, that info
is all managed through the userspace page tables.

To make it possible to allow userspace to mmap() but not access memory (without
redoing how the kernel fundamentally manages virtual=>physical mappings), the
simplest approach is to install PTEs into userspace page tables, but never mark
them Present in hardware, i.e. prevent actually accessing the backing memory.
This is is exactly what Kirill's series in link [3] below implemented.

Issues that led to us abandoning the "map with special !Present PTEs" approach:

 - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings
   is inefficient and inflexible compared to software defined structures, especially
   for the expected use cases for CoCo guests.

 - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association,
   let alone a 1:1 pfn:gfn mapping.
 
 - Does not work for memory that isn't backed by 'struct page', e.g. if devices
   gain support for exposing encrypted memory regions to guests.

 - Poking into the VMAs to convert memory would be likely be less performant due
   to using infrastructure that is much "heavier", e.g. would require taking
   mmap_lock for write.

In short, shoehorning this into mmap() requires fighting how the kernel works at
pretty much every step, and in the end, adding e.g. fbind() is a lot easier.

> 2) Userspace registers mmaped HVA ranges with KVM with additional
> KVM_MEM_PRIVATE flag
> 3) Userspace converts memory attributes and this memory conversion
> allows userspace to access shared ranges of the file because those are
> allowed to be faulted in from guest_mem. Shared to private conversion
> unmaps the file ranges from userspace VMM pagetables.
> 4) Granularity of userspace pagetable mappings for shared ranges will
> have to be dictated by KVM guest_mem file implementation.
> 
> Caveat here is that once private pages are mapped into userspace view.
> 
> Benefits here:
> 1) Userspace view remains consistent while still being able to use HVA ranges
> 2) It would be possible to use HVA based APIs from userspace to do
> things like binding.
> 3) Double allocation wouldn't be a concern since hva ranges and gpa
> ranges possibly map to the same HPA ranges.

#3 isn't entirely correct.  If a different process (call it "B") maps shared memory,
and then the guest converts that memory from shared to private, the backing pages
for the previously shared mapping will still be mapped by process B unless userspace
also ensures process B also unmaps on conversion.

#3 is also a limiter.  E.g. if a guest is primarly backed by 1GiB pages, keeping
the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared,
and possibly even if the guest converts a few MiB of memory.

> > Code is available here if folks want to take a look before any kind of formal
> > posting:
> >
> >         https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> >
> > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
> > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com