lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADrL8HUv+RvazbOyx+NJ1oNd8FdMGd_T61Kjtia1cqJsN=WiOA@mail.gmail.com>
Date: Tue, 16 Jul 2024 10:10:27 -0700
From: James Houghton <jthoughton@...gle.com>
To: "Wang, Wei W" <wei.w.wang@...el.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Marc Zyngier <maz@...nel.org>, 
	Oliver Upton <oliver.upton@...ux.dev>, James Morse <james.morse@....com>, 
	Suzuki K Poulose <suzuki.poulose@....com>, Zenghui Yu <yuzenghui@...wei.com>, 
	Sean Christopherson <seanjc@...gle.com>, Shuah Khan <shuah@...nel.org>, 
	Axel Rasmussen <axelrasmussen@...gle.com>, David Matlack <dmatlack@...gle.com>, 
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	"linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, 
	"linux-arm-kernel@...ts.infradead.org" <linux-arm-kernel@...ts.infradead.org>, 
	"kvmarm@...ts.linux.dev" <kvmarm@...ts.linux.dev>, Peter Xu <peterx@...hat.com>
Subject: Re: [RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd

On Mon, Jul 15, 2024 at 8:28 AM Wang, Wei W <wei.w.wang@...el.com> wrote:
>
> On Thursday, July 11, 2024 7:42 AM, James Houghton wrote:
> > This patch series implements the KVM-based demand paging system that was
> > first introduced back in November[1] by David Matlack.
> >
> > The working name for this new system is KVM Userfault, but that name is very
> > confusing so it will not be the final name.
> >
> Hi James,
> I had implemented a similar approach for TDX post-copy migration, there are quite
> some differences though. Got some questions about your design below.

Thanks for the feedback!!

>
> > Problem: post-copy with guest_memfd
> > ===================================
> >
> > Post-copy live migration makes it possible to migrate VMs from one host to
> > another no matter how fast they are writing to memory while keeping the VM
> > paused for a minimal amount of time. For post-copy to work, we
> > need:
> >  1. to be able to prevent KVM from being able to access particular pages
> >     of guest memory until we have populated it  2. for userspace to know when
> > KVM is trying to access a particular
> >     page.
> >  3. a way to allow the access to proceed.
> >
> > Traditionally, post-copy live migration is implemented using userfaultfd, which
> > hooks into the main mm fault path. KVM hits this path when it is doing HVA ->
> > PFN translations (with GUP) or when it itself attempts to access guest memory.
> > Userfaultfd sends a page fault notification to userspace, and KVM goes to sleep.
> >
> > Userfaultfd works well, as it is not specific to KVM; everyone who attempts to
> > access guest memory will block the same way.
> >
> > However, with guest_memfd, we do not use GUP to translate from GFN to HPA
> > (nor is there an intermediate HVA).
> >
> > So userfaultfd in its current form cannot be used to support post-copy live
> > migration with guest_memfd-backed VMs.
> >
> > Solution: hook into the gfn -> pfn translation
> > ==============================================
> >
> > The only way to implement post-copy with a non-KVM-specific userfaultfd-like
> > system would be to introduce the concept of a file-userfault[2] to intercept
> > faults on a guest_memfd.
> >
> > Instead, we take the simpler approach of adding a KVM-specific API, and we
> > hook into the GFN -> HVA or GFN -> PFN translation steps (for traditional
> > memslots and for guest_memfd respectively).
>
>
> Why taking KVM_EXIT_MEMORY_FAULT faults for the traditional shared
> pages (i.e. GFN -> HVA)?
> It seems simpler if we use KVM_EXIT_MEMORY_FAULT for private pages only, leaving
> shared pages to go through the existing userfaultfd mechanism:
> - The need for “asynchronous userfaults,” introduced by patch 14, could be eliminated.
> - The additional support (e.g., KVM_MEMORY_EXIT_FLAG_USERFAULT) for private page
>   faults exiting to userspace for postcopy might not be necessary, because all pages on the
>   destination side are initially “shared,” and the guest’s first access will always cause an
>   exit to userspace for shared->private conversion. So VMM is able to leverage the exit to
>   fetch the page data from the source (VMM can know if a page data has been fetched
>   from the source or not).

You're right that, today, including support for guest-private memory
*only* indeed simplifies things (no async userfaults). I think your
strategy for implementing post-copy would work (so, shared->private
conversion faults for vCPU accesses to private memory, and userfaultfd
for everything else).

I'm not 100% sure what should happen in the case of a non-vCPU access
to should-be-private memory; today it seems like KVM just provides the
shared version of the page, so conventional use of userfaultfd
shouldn't break anything.

But eventually guest_memfd itself will support "shared" memory, and
(IIUC) it won't use VMAs, so userfaultfd won't be usable (without
changes anyway). For a non-confidential VM, all memory will be
"shared", so shared->private conversions can't help us there either.
Starting everything as private almost works (so using private->shared
conversions as a notification mechanism), but if the first time KVM
attempts to use a page is not from a vCPU (and is from a place where
we cannot easily return to userspace), the need for "async userfaults"
comes back.

For this use case, it seems cleaner to have a new interface. (And, as
far as I can tell, we would at least need some kind of "async
userfault"-like mechanism.)

Another reason why, today, KVM Userfault is helpful is that
userfaultfd has a couple drawbacks. Userfaultfd migration with
HugeTLB-1G is basically unusable, as HugeTLB pages cannot be mapped at
PAGE_SIZE. Some discussion here[1][2].

Moving the implementation of post-copy to KVM means that, throughout
post-copy, we can avoid changes to the main mm page tables, and we
only need to modify the second stage page tables. This saves the
memory needed to store the extra set of shattered page tables, and we
save the performance overhead of the page table modifications and
accounting that mm does.

There's some more discussion about these points in David's RFC[3].

[1]: https://lore.kernel.org/linux-mm/20230218002819.1486479-1-jthoughton@google.com/
[2]: https://lore.kernel.org/linux-mm/ZdcKwK7CXgEsm-Co@x1n/
[3]: https://lore.kernel.org/kvm/CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@mail.gmail.com/

>
> >

> > I have intentionally added support for traditional memslots, as the complexity
> > that it adds is minimal, and it is useful for some VMMs, as it can be used to
> > fully implement post-copy live migration.
> >
> > Implementation Details
> > ======================
> >
> > Let's break down how KVM implements each of the three core requirements
> > for implementing post-copy as laid out above:
> >
> > --- Preventing access: KVM_MEMORY_ATTRIBUTE_USERFAULT ---
> >
> > The most straightforward way to inform KVM of userfault-enabled pages is to
> > use a new memory attribute, say KVM_MEMORY_ATTRIBUTE_USERFAULT.
> >
> > There is already infrastructure in place for modifying and checking memory
> > attributes. Using this interface is slightly challenging, as there is no UAPI for
> > setting/clearing particular attributes; we must set the exact attributes we want.
> >
> > The synchronization that is in place for updating memory attributes is not
> > suitable for post-copy live migration either, which will require updating
> > memory attributes (from userfault to no-userfault) very frequently.
> >
> > Another potential interface could be to use something akin to a dirty bitmap,
> > where a bitmap describes which pages within a memslot (or VM) should trigger
> > userfaults. This way, it is straightforward to make updates to the userfault
> > status of a page cheap.
> >
> > When KVM Userfault is enabled, we need to be careful not to map a userfault
> > page in response to a fault on a non-userfault page. In this RFC, I've taken the
> > simplest approach: force new PTEs to be PAGE_SIZE.
> >
> > --- Page fault notifications ---
> >
> > For page faults generated by vCPUs running in guest mode, if the page the
> > vCPU is trying to access is a userfault-enabled page, we use
>
> Why is it necessary to add the per-page control (with uAPIs for VMM to set/clear)?
> Any functional issues if we just have all the page faults exit to userspace during the
> post-copy period?
> - As also mentioned above, userspace can easily know if a page needs to be
>   fetched from the source or not, so upon a fault exit to userspace, VMM can
>   decide to block the faulting vcpu thread or return back to KVM immediately.
> - If improvement is really needed (would need profiling first) to reduce number
>   of exits to userspace, a  KVM internal status (bitmap or xarray) seems sufficient.
>   Each page only needs to exit to userspace once for the purpose of fetching its data
>   from the source in postcopy. It doesn't seem to need userspace to enable the exit
>   again for the page (via a new uAPI), right?

We don't necessarily need a way to go from no-fault -> fault for a
page, that's right[4]. But we do need a way for KVM to be able to
allow the access to proceed (i.e., go from fault -> no-fault). IOW, if
we get a fault and come out to userspace, we need a way to tell KVM
not to do that again. In the case of shared->private conversions, that
mechanism is toggling the memory attributes for a gfn. For
conventional userfaultfd, that's using UFFDIO_COPY/CONTINUE/POISON.
Maybe I'm misunderstanding your question.

[4]: It is helpful for poison emulation for HugeTLB-backed VMs today,
but this is not important.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ