linux-kernel - RE: [RFC PATCH 00/18] KVM: Post-copy live migration for guest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <DS0PR11MB63735DAF7F168405D120A5C6DCA32@DS0PR11MB6373.namprd11.prod.outlook.com>
Date: Wed, 17 Jul 2024 15:03:04 +0000
From: "Wang, Wei W" <wei.w.wang@...el.com>
To: James Houghton <jthoughton@...gle.com>
CC: Paolo Bonzini <pbonzini@...hat.com>, Marc Zyngier <maz@...nel.org>,
	"Oliver Upton" <oliver.upton@...ux.dev>, James Morse <james.morse@....com>,
	"Suzuki K Poulose" <suzuki.poulose@....com>, Zenghui Yu
	<yuzenghui@...wei.com>, "Sean Christopherson" <seanjc@...gle.com>, Shuah Khan
	<shuah@...nel.org>, "Axel Rasmussen" <axelrasmussen@...gle.com>, David
 Matlack <dmatlack@...gle.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-arm-kernel@...ts.infradead.org"
	<linux-arm-kernel@...ts.infradead.org>, "kvmarm@...ts.linux.dev"
	<kvmarm@...ts.linux.dev>, Peter Xu <peterx@...hat.com>
Subject: RE: [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd

On Wednesday, July 17, 2024 1:10 AM, James Houghton wrote:
> You're right that, today, including support for guest-private memory
> *only* indeed simplifies things (no async userfaults). I think your strategy for
> implementing post-copy would work (so, shared->private conversion faults for
> vCPU accesses to private memory, and userfaultfd for everything else).

Yes, it works and has been used for our internal tests.

> 
> I'm not 100% sure what should happen in the case of a non-vCPU access to
> should-be-private memory; today it seems like KVM just provides the shared
> version of the page, so conventional use of userfaultfd shouldn't break
> anything.

This seems to be the trusted IO usage (not aware of other usages, emulated device
backends, such as vhost, work with shared pages). Migration support for trusted device
passthrough doesn't seem to be architecturally ready yet. Especially for postcopy,
AFAIK, even the legacy VM case lacks the support for device passthrough (not sure if
you've made it internally). So it seems too early to discuss this in detail.


> 
> But eventually guest_memfd itself will support "shared" memory, 

OK, I thought of this. Not sure how feasible it would be to extend gmem for
shared memory. I think questions like below need to be investigated:
#1 what are the tangible benefits of gmem based shared memory, compared to the
     legacy shared memory that we have now?
#2 There would be some gaps to make gmem usable for shared pages. For
      example, would it support userspace to map (without security concerns)?
#3 if gmem gets extended to be something like hugetlb (e.g. 1GB), would it result
     in the same issue as hugetlb? 

The support of using gmem for shared memory isn't in place yet, and this seems
to be a dependency for the support being added here.

> and
> (IIUC) it won't use VMAs, so userfaultfd won't be usable (without changes
> anyway). For a non-confidential VM, all memory will be "shared", so shared-
> >private conversions can't help us there either.
> Starting everything as private almost works (so using private->shared
> conversions as a notification mechanism), but if the first time KVM attempts to
> use a page is not from a vCPU (and is from a place where we cannot easily
> return to userspace), the need for "async userfaults"
> comes back.

Yeah, this needs to be resolved for KVM userfaults. If gmem is used for private
pages only, this wouldn't be an issue (it will be covered by userfaultfd).


> 
> For this use case, it seems cleaner to have a new interface. (And, as far as I can
> tell, we would at least need some kind of "async userfault"-like mechanism.)
> 
> Another reason why, today, KVM Userfault is helpful is that userfaultfd has a
> couple drawbacks. Userfaultfd migration with HugeTLB-1G is basically
> unusable, as HugeTLB pages cannot be mapped at PAGE_SIZE. Some discussion
> here[1][2].
> 
> Moving the implementation of post-copy to KVM means that, throughout
> post-copy, we can avoid changes to the main mm page tables, and we only
> need to modify the second stage page tables. This saves the memory needed
> to store the extra set of shattered page tables, and we save the performance
> overhead of the page table modifications and accounting that mm does.

It would be nice to see some data for comparisons between kvm faults and userfaultfd
e.g., end to end latency of handling a page fault via getting data from the source.
(I didn't find data from the link you shared. Please correct me if I missed it)


> We don't necessarily need a way to go from no-fault -> fault for a page, that's
> right[4]. But we do need a way for KVM to be able to allow the access to
> proceed (i.e., go from fault -> no-fault). IOW, if we get a fault and come out to
> userspace, we need a way to tell KVM not to do that again.
> In the case of shared->private conversions, that mechanism is toggling the memory
> attributes for a gfn.  For conventional userfaultfd, that's using
> UFFDIO_COPY/CONTINUE/POISON.
> Maybe I'm misunderstanding your question.

We can come back to this after the dependency discussion above is done. (If gmem is only
used for private pages, the support for postcopy, including changes required for VMMs, would
be simpler)