linux-kernel - Re: [RFC PATCH 0/6] KVM: x86: async PF user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z6u-WdbiW3n7iTjp@google.com>
Date: Tue, 11 Feb 2025 13:17:13 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Nikita Kalyazin <kalyazin@...zon.com>
Cc: pbonzini@...hat.com, corbet@....net, tglx@...utronix.de, mingo@...hat.com, 
	bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com, rostedt@...dmis.org, 
	mhiramat@...nel.org, mathieu.desnoyers@...icios.com, kvm@...r.kernel.org, 
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, 
	linux-trace-kernel@...r.kernel.org, jthoughton@...gle.com, david@...hat.com, 
	peterx@...hat.com, oleg@...hat.com, vkuznets@...hat.com, gshan@...hat.com, 
	graf@...zon.de, jgowans@...zon.com, roypat@...zon.co.uk, derekmn@...zon.com, 
	nsaenz@...zon.es, xmarcalx@...zon.com
Subject: Re: [RFC PATCH 0/6] KVM: x86: async PF user

On Mon, Nov 18, 2024, Nikita Kalyazin wrote:
> Async PF [1] allows to run other processes on a vCPU while the host
> handles a stage-2 fault caused by a process on that vCPU. When using
> VM-exit-based stage-2 fault handling [2], async PF functionality is lost
> because KVM does not run the vCPU while a fault is being handled so no
> other process can execute on the vCPU. This patch series extends
> VM-exit-based stage-2 fault handling with async PF support by letting
> userspace handle faults instead of the kernel, hence the "async PF user"
> name.
> 
> I circulated the idea with Paolo, Sean, David H, and James H at the LPC,
> and the only concern I heard was about injecting the "page not present"
> event via #PF exception in the CoCo case, where it may not work. In my
> implementation, I reused the existing code for doing that, so the async
> PF user implementation is on par with the present async PF
> implementation in this regard, and support for the CoCo case can be
> added separately.
> 
> Please note that this series is applied on top of the VM-exit-based
> stage-2 fault handling RFC [2].

...

> Nikita Kalyazin (6):
>   Documentation: KVM: add userfault KVM exit flag
>   Documentation: KVM: add async pf user doc
>   KVM: x86: add async ioctl support
>   KVM: trace events: add type argument to async pf
>   KVM: x86: async_pf_user: add infrastructure
>   KVM: x86: async_pf_user: hook to fault handling and add ioctl
> 
>  Documentation/virt/kvm/api.rst  |  35 ++++++
>  arch/x86/include/asm/kvm_host.h |  12 +-
>  arch/x86/kvm/Kconfig            |   7 ++
>  arch/x86/kvm/lapic.c            |   2 +
>  arch/x86/kvm/mmu/mmu.c          |  68 ++++++++++-
>  arch/x86/kvm/x86.c              | 101 +++++++++++++++-
>  arch/x86/kvm/x86.h              |   2 +
>  include/linux/kvm_host.h        |  30 +++++
>  include/linux/kvm_types.h       |   1 +
>  include/trace/events/kvm.h      |  50 +++++---
>  include/uapi/linux/kvm.h        |  12 +-
>  virt/kvm/Kconfig                |   3 +
>  virt/kvm/Makefile.kvm           |   1 +
>  virt/kvm/async_pf.c             |   2 +-
>  virt/kvm/async_pf_user.c        | 197 ++++++++++++++++++++++++++++++++
>  virt/kvm/async_pf_user.h        |  24 ++++
>  virt/kvm/kvm_main.c             |  14 +++
>  17 files changed, 535 insertions(+), 26 deletions(-)

I am supportive of the idea, but there is way too much copy+paste in this series.
And it's not just the code itself, it's all the structures and concepts.  Off the
top of my head, I can't think of any reason there needs to be a separate queue,
separate lock(s), etc.  The only difference between kernel APF and user APF is
what chunk of code is responsible for faulting in the page.

I suspect a good place to start would be something along the lines of the below
diff, and go from there.  Given that KVM already needs to special case the fake
"wake all" items, I'm guessing it won't be terribly difficult to teach the core
flows about userspace async #PF.

I'm also not sure that injecting async #PF for all userfaults is desirable.  For
in-kernel async #PF, KVM knows that faulting in the memory would sleep.  For
userfaults, KVM has no way of knowing if the userfault will sleep, i.e. should
be handled via async #PF.  The obvious answer is to have userspace only enable
userspace async #PF when it's useful, but "an all or nothing" approach isn't
great uAPI.  On the flip side, adding uAPI for a use case that doesn't exist
doesn't make sense either :-/

Exiting to userspace in vCPU context is also kludgy.  It makes sense for base
userfault, because the vCPU can't make forward progress until the fault is
resolved.  Actually, I'm not even sure it makes sense there.  I'll follow-up in
James' series.  Anyways, it definitely doesn't make sense for async #PF, because
the whole point is to let the vCPU run.  Signalling userspace would definitely
add complexity, but only because of the need to communicate the token and wait
for userspace to consume said token.  I'll think more on that.

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 0ee4816b079a..fc31b47cf9c5 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -177,7 +177,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
  * success, 'false' on failure (page fault has to be handled synchronously).
  */
 bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-                       unsigned long hva, struct kvm_arch_async_pf *arch)
+                       unsigned long hva, struct kvm_arch_async_pf *arch,
+                       bool userfault)
 {
        struct kvm_async_pf *work;
 
@@ -202,13 +203,16 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
        work->addr = hva;
        work->arch = *arch;
 
-       INIT_WORK(&work->work, async_pf_execute);
-
        list_add_tail(&work->queue, &vcpu->async_pf.queue);
        vcpu->async_pf.queued++;
        work->notpresent_injected = kvm_arch_async_page_not_present(vcpu, work);
 
-       schedule_work(&work->work);
+       if (userfault) {
+               work->userfault = true;
+       } else {
+               INIT_WORK(&work->work, async_pf_execute);
+               schedule_work(&work->work);
+       }
 
        return true;
 }