linux-kernel - Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f413cc20-66fc-cf1e-47ab-b8f099c89583@redhat.com>
Date:   Wed, 1 Sep 2021 10:09:07 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Sean Christopherson <seanjc@...gle.com>
Cc:     Paolo Bonzini <pbonzini@...hat.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Borislav Petkov <bp@...en8.de>,
        Andy Lutomirski <luto@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Joerg Roedel <jroedel@...e.de>,
        Andi Kleen <ak@...ux.intel.com>,
        David Rientjes <rientjes@...gle.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Tom Lendacky <thomas.lendacky@....com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Varad Gautam <varad.gautam@...e.com>,
        Dario Faggioli <dfaggioli@...e.com>, x86@...nel.org,
        linux-mm@...ck.org, linux-coco@...ts.linux.dev,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        "Kirill A . Shutemov" <kirill@...temov.name>,
        Kuppuswamy Sathyanarayanan 
        <sathyanarayanan.kuppuswamy@...ux.intel.com>,
        Dave Hansen <dave.hansen@...el.com>,
        Yu Zhang <yu.c.zhang@...ux.intel.com>
Subject: Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private
 memory

>> Do we have to protect from that? How would KVM protect from user space
>> replacing private pages by shared pages in any of the models we discuss?
> 
> The overarching rule is that KVM needs to guarantee a given pfn is never mapped[*]
> as both private and shared, where "shared" also incorporates any mapping from the
> host.  Essentially it boils down to the kernel ensuring that a pfn is unmapped
> before it's converted to/from private, and KVM ensuring that it honors any
> unmap notifications from the kernel, e.g. via mmu_notifier or via a direct callback
> as proposed in this RFC.

Okay, so the fallocate(PUNCHHOLE) from user space could trigger the 
respective unmapping and freeing of backing storage.

> 
> As it pertains to PUNCH_HOLE, the responsibilities are no different than when the
> backing-store is destroyed; the backing-store needs to notify downstream MMUs
> (a.k.a. KVM) to unmap the pfn(s) before freeing the associated memory.

Right.

> 
> [*] Whether or not the kernel's direct mapping needs to be removed is debatable,
>      but my argument is that that behavior is not visible to userspace and thus
>      out of scope for this discussion, e.g. zapping/restoring the direct map can
>      be added/removed without impacting the userspace ABI.

Right. Removing it shouldn't also be requited IMHO. There are other ways 
to teach the kernel to not read/write some online pages (filter 
/proc/kcore, disable hibernation, strict access checks for /dev/mem ...).

> 
>>>> Define "ordinary" user memory slots as overlay on top of "encrypted" memory
>>>> slots.  Inside KVM, bail out if you encounter such a VMA inside a normal
>>>> user memory slot. When creating a "encryped" user memory slot, require that
>>>> the whole VMA is covered at creation time. You know the VMA can't change
>>>> later.
>>>
>>> This can work for the basic use cases, but even then I'd strongly prefer not to
>>> tie memslot correctness to the VMAs.  KVM doesn't truly care what lies behind
>>> the virtual address of a memslot, and when it does care, it tends to do poorly,
>>> e.g. see the whole PFNMAP snafu.  KVM cares about the pfn<->gfn mappings, and
>>> that's reflected in the infrastructure.  E.g. KVM relies on the mmu_notifiers
>>> to handle mprotect()/munmap()/etc...
>>
>> Right, and for the existing use cases this worked. But encrypted memory
>> breaks many assumptions we once made ...
>>
>> I have somewhat mixed feelings about pages that are mapped into $WHATEVER
>> page tables but not actually mapped into user space page tables. There is no
>> way to reach these via the rmap.
>>
>> We have something like that already via vfio. And that is fundamentally
>> broken when it comes to mmu notifiers, page pinning, page migration, ...
> 
> I'm not super familiar with VFIO internals, but the idea with the fd-based
> approach is that the backing-store would be in direct communication with KVM and
> would handle those operations through that direct channel.

Right. The problem I am seeing is that e.g., try_to_unmap() might not be 
able to actually fully unmap a page, because some non-synchronized KVM 
MMU still maps a page. It would be great to evaluate how the fd 
callbacks would fit into the whole picture, including the current rmap.

I guess I'm missing the bigger picture how it all fits together on the 
!KVM side.

> 
>>> As is, I don't think KVM would get any kind of notification if userpaces unmaps
>>> the VMA for a private memslot that does not have any entries in the host page
>>> tables.   I'm sure it's a solvable problem, e.g. by ensuring at least one page
>>> is touched by the backing store, but I don't think the end result would be any
>>> prettier than a dedicated API for KVM to consume.
>>>
>>> Relying on VMAs, and thus the mmu_notifiers, also doesn't provide line of sight
>>> to page migration or swap.  For those types of operations, KVM currently just
>>> reacts to invalidation notifications by zapping guest PTEs, and then gets the
>>> new pfn when the guest re-faults on the page.  That sequence doesn't work for
>>> TDX or SEV-SNP because the trusteday agent needs to do the memcpy() of the page
>>> contents, i.e. the host needs to call into KVM for the actual migration.
>>
>> Right, but I still think this is a kernel internal. You can do such
>> handshake later in the kernel IMHO.
> 
> It is kernel internal, but AFAICT it will be ugly because KVM "needs" to do the
> migration and that would invert the mmu_notifer API, e.g. instead of "telling"
> secondary MMUs to invalidate/change a mappings, the mm would be "asking"
> secondary MMus "can you move this?".  More below.

In my thinking, the the rmap via mmu notifiers would do the unmapping 
just as we know it (from primary MMU -> secondary MMU). Once 
try_to_unmap() succeeded, the fd provider could kick-off the migration 
via whatever callback.

> 
>> But I also already thought: is it really KVM that is to perform the
>> migration or is it the fd-provider that performs the migration? Who says
>> memfd_encrypted() doesn't default to a TDX "backend" on Intel CPUs that just
>> knows how to migrate such a page?
>>
>> I'd love to have some details on how that's supposed to work, and which
>> information we'd need to migrate/swap/... in addition to the EPFN and a new
>> SPFN.
> 
> KVM "needs" to do the migration.  On TDX, the migration will be a SEAMCALL,
> a post-VMXON instruction that transfers control to the TDX-Module, that at
> minimum needs a per-VM identifier, the gfn, and the page table level.  The call

The per-VM identifier and the GFN would be easy to grab. Page table 
level, not so sure -- do you mean the general page table depth? Or if 
it's mapped as 4k vs. 2M ... ? The latter could be answered by the fd 
provider already I assume.

Does the page still have to be mapped into the secondary MMU when 
performing the migration via TDX? I assume not, which would simplify 
things a lot.

> into the TDX-Module would also need to take a KVM lock (probably KVM's mmu_lock)
> to satisfy TDX's concurrency requirement, e.g. to avoid "spurious" errors due to
> the backing-store attempting to migrate memory that KVM is unmapping due to a
> memslot change.

Something like that might be handled by fixing private memory slots 
similar to in my draft, right?

> 
> The per-VM identifier may not apply to SEV-SNP, but I believe everything else
> holds true.

Thanks!

-- 
Thanks,

David / dhildenb