[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f413cc20-66fc-cf1e-47ab-b8f099c89583@redhat.com>
Date: Wed, 1 Sep 2021 10:09:07 +0200
From: David Hildenbrand <david@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>,
Vitaly Kuznetsov <vkuznets@...hat.com>,
Wanpeng Li <wanpengli@...cent.com>,
Jim Mattson <jmattson@...gle.com>,
Joerg Roedel <joro@...tes.org>, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, Borislav Petkov <bp@...en8.de>,
Andy Lutomirski <luto@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Joerg Roedel <jroedel@...e.de>,
Andi Kleen <ak@...ux.intel.com>,
David Rientjes <rientjes@...gle.com>,
Vlastimil Babka <vbabka@...e.cz>,
Tom Lendacky <thomas.lendacky@....com>,
Thomas Gleixner <tglx@...utronix.de>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>,
Varad Gautam <varad.gautam@...e.com>,
Dario Faggioli <dfaggioli@...e.com>, x86@...nel.org,
linux-mm@...ck.org, linux-coco@...ts.linux.dev,
"Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
"Kirill A . Shutemov" <kirill@...temov.name>,
Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@...ux.intel.com>,
Dave Hansen <dave.hansen@...el.com>,
Yu Zhang <yu.c.zhang@...ux.intel.com>
Subject: Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private
memory
>> Do we have to protect from that? How would KVM protect from user space
>> replacing private pages by shared pages in any of the models we discuss?
>
> The overarching rule is that KVM needs to guarantee a given pfn is never mapped[*]
> as both private and shared, where "shared" also incorporates any mapping from the
> host. Essentially it boils down to the kernel ensuring that a pfn is unmapped
> before it's converted to/from private, and KVM ensuring that it honors any
> unmap notifications from the kernel, e.g. via mmu_notifier or via a direct callback
> as proposed in this RFC.
Okay, so the fallocate(PUNCHHOLE) from user space could trigger the
respective unmapping and freeing of backing storage.
>
> As it pertains to PUNCH_HOLE, the responsibilities are no different than when the
> backing-store is destroyed; the backing-store needs to notify downstream MMUs
> (a.k.a. KVM) to unmap the pfn(s) before freeing the associated memory.
Right.
>
> [*] Whether or not the kernel's direct mapping needs to be removed is debatable,
> but my argument is that that behavior is not visible to userspace and thus
> out of scope for this discussion, e.g. zapping/restoring the direct map can
> be added/removed without impacting the userspace ABI.
Right. Removing it shouldn't also be requited IMHO. There are other ways
to teach the kernel to not read/write some online pages (filter
/proc/kcore, disable hibernation, strict access checks for /dev/mem ...).
>
>>>> Define "ordinary" user memory slots as overlay on top of "encrypted" memory
>>>> slots. Inside KVM, bail out if you encounter such a VMA inside a normal
>>>> user memory slot. When creating a "encryped" user memory slot, require that
>>>> the whole VMA is covered at creation time. You know the VMA can't change
>>>> later.
>>>
>>> This can work for the basic use cases, but even then I'd strongly prefer not to
>>> tie memslot correctness to the VMAs. KVM doesn't truly care what lies behind
>>> the virtual address of a memslot, and when it does care, it tends to do poorly,
>>> e.g. see the whole PFNMAP snafu. KVM cares about the pfn<->gfn mappings, and
>>> that's reflected in the infrastructure. E.g. KVM relies on the mmu_notifiers
>>> to handle mprotect()/munmap()/etc...
>>
>> Right, and for the existing use cases this worked. But encrypted memory
>> breaks many assumptions we once made ...
>>
>> I have somewhat mixed feelings about pages that are mapped into $WHATEVER
>> page tables but not actually mapped into user space page tables. There is no
>> way to reach these via the rmap.
>>
>> We have something like that already via vfio. And that is fundamentally
>> broken when it comes to mmu notifiers, page pinning, page migration, ...
>
> I'm not super familiar with VFIO internals, but the idea with the fd-based
> approach is that the backing-store would be in direct communication with KVM and
> would handle those operations through that direct channel.
Right. The problem I am seeing is that e.g., try_to_unmap() might not be
able to actually fully unmap a page, because some non-synchronized KVM
MMU still maps a page. It would be great to evaluate how the fd
callbacks would fit into the whole picture, including the current rmap.
I guess I'm missing the bigger picture how it all fits together on the
!KVM side.
>
>>> As is, I don't think KVM would get any kind of notification if userpaces unmaps
>>> the VMA for a private memslot that does not have any entries in the host page
>>> tables. I'm sure it's a solvable problem, e.g. by ensuring at least one page
>>> is touched by the backing store, but I don't think the end result would be any
>>> prettier than a dedicated API for KVM to consume.
>>>
>>> Relying on VMAs, and thus the mmu_notifiers, also doesn't provide line of sight
>>> to page migration or swap. For those types of operations, KVM currently just
>>> reacts to invalidation notifications by zapping guest PTEs, and then gets the
>>> new pfn when the guest re-faults on the page. That sequence doesn't work for
>>> TDX or SEV-SNP because the trusteday agent needs to do the memcpy() of the page
>>> contents, i.e. the host needs to call into KVM for the actual migration.
>>
>> Right, but I still think this is a kernel internal. You can do such
>> handshake later in the kernel IMHO.
>
> It is kernel internal, but AFAICT it will be ugly because KVM "needs" to do the
> migration and that would invert the mmu_notifer API, e.g. instead of "telling"
> secondary MMUs to invalidate/change a mappings, the mm would be "asking"
> secondary MMus "can you move this?". More below.
In my thinking, the the rmap via mmu notifiers would do the unmapping
just as we know it (from primary MMU -> secondary MMU). Once
try_to_unmap() succeeded, the fd provider could kick-off the migration
via whatever callback.
>
>> But I also already thought: is it really KVM that is to perform the
>> migration or is it the fd-provider that performs the migration? Who says
>> memfd_encrypted() doesn't default to a TDX "backend" on Intel CPUs that just
>> knows how to migrate such a page?
>>
>> I'd love to have some details on how that's supposed to work, and which
>> information we'd need to migrate/swap/... in addition to the EPFN and a new
>> SPFN.
>
> KVM "needs" to do the migration. On TDX, the migration will be a SEAMCALL,
> a post-VMXON instruction that transfers control to the TDX-Module, that at
> minimum needs a per-VM identifier, the gfn, and the page table level. The call
The per-VM identifier and the GFN would be easy to grab. Page table
level, not so sure -- do you mean the general page table depth? Or if
it's mapped as 4k vs. 2M ... ? The latter could be answered by the fd
provider already I assume.
Does the page still have to be mapped into the secondary MMU when
performing the migration via TDX? I assume not, which would simplify
things a lot.
> into the TDX-Module would also need to take a KVM lock (probably KVM's mmu_lock)
> to satisfy TDX's concurrency requirement, e.g. to avoid "spurious" errors due to
> the backing-store attempting to migrate memory that KVM is unmapping due to a
> memslot change.
Something like that might be handled by fixing private memory slots
similar to in my draft, right?
>
> The per-VM identifier may not apply to SEV-SNP, but I believe everything else
> holds true.
Thanks!
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists