[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a7080c07-0fc5-45ce-92f7-5f432a67bc63@amazon.com>
Date: Wed, 12 Feb 2025 18:14:09 +0000
From: Nikita Kalyazin <kalyazin@...zon.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: <pbonzini@...hat.com>, <corbet@....net>, <tglx@...utronix.de>,
<mingo@...hat.com>, <bp@...en8.de>, <dave.hansen@...ux.intel.com>,
<hpa@...or.com>, <rostedt@...dmis.org>, <mhiramat@...nel.org>,
<mathieu.desnoyers@...icios.com>, <kvm@...r.kernel.org>,
<linux-doc@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<linux-trace-kernel@...r.kernel.org>, <jthoughton@...gle.com>,
<david@...hat.com>, <peterx@...hat.com>, <oleg@...hat.com>,
<vkuznets@...hat.com>, <gshan@...hat.com>, <graf@...zon.de>,
<jgowans@...zon.com>, <roypat@...zon.co.uk>, <derekmn@...zon.com>,
<nsaenz@...zon.es>, <xmarcalx@...zon.com>
Subject: Re: [RFC PATCH 0/6] KVM: x86: async PF user
On 11/02/2025 21:17, Sean Christopherson wrote:
> On Mon, Nov 18, 2024, Nikita Kalyazin wrote:
>> Async PF [1] allows to run other processes on a vCPU while the host
>> handles a stage-2 fault caused by a process on that vCPU. When using
>> VM-exit-based stage-2 fault handling [2], async PF functionality is lost
>> because KVM does not run the vCPU while a fault is being handled so no
>> other process can execute on the vCPU. This patch series extends
>> VM-exit-based stage-2 fault handling with async PF support by letting
>> userspace handle faults instead of the kernel, hence the "async PF user"
>> name.
>>
>> I circulated the idea with Paolo, Sean, David H, and James H at the LPC,
>> and the only concern I heard was about injecting the "page not present"
>> event via #PF exception in the CoCo case, where it may not work. In my
>> implementation, I reused the existing code for doing that, so the async
>> PF user implementation is on par with the present async PF
>> implementation in this regard, and support for the CoCo case can be
>> added separately.
>>
>> Please note that this series is applied on top of the VM-exit-based
>> stage-2 fault handling RFC [2].
>
> ...
>
>> Nikita Kalyazin (6):
>> Documentation: KVM: add userfault KVM exit flag
>> Documentation: KVM: add async pf user doc
>> KVM: x86: add async ioctl support
>> KVM: trace events: add type argument to async pf
>> KVM: x86: async_pf_user: add infrastructure
>> KVM: x86: async_pf_user: hook to fault handling and add ioctl
>>
>> Documentation/virt/kvm/api.rst | 35 ++++++
>> arch/x86/include/asm/kvm_host.h | 12 +-
>> arch/x86/kvm/Kconfig | 7 ++
>> arch/x86/kvm/lapic.c | 2 +
>> arch/x86/kvm/mmu/mmu.c | 68 ++++++++++-
>> arch/x86/kvm/x86.c | 101 +++++++++++++++-
>> arch/x86/kvm/x86.h | 2 +
>> include/linux/kvm_host.h | 30 +++++
>> include/linux/kvm_types.h | 1 +
>> include/trace/events/kvm.h | 50 +++++---
>> include/uapi/linux/kvm.h | 12 +-
>> virt/kvm/Kconfig | 3 +
>> virt/kvm/Makefile.kvm | 1 +
>> virt/kvm/async_pf.c | 2 +-
>> virt/kvm/async_pf_user.c | 197 ++++++++++++++++++++++++++++++++
>> virt/kvm/async_pf_user.h | 24 ++++
>> virt/kvm/kvm_main.c | 14 +++
>> 17 files changed, 535 insertions(+), 26 deletions(-)
>
> I am supportive of the idea, but there is way too much copy+paste in this series.
Hi Sean,
Yes, like I mentioned in the cover letter, I left the new implementation
isolated on purpose to make the scope of the change clear. There is
certainly lots of duplication that should be removed later on.
> And it's not just the code itself, it's all the structures and concepts. Off the
> top of my head, I can't think of any reason there needs to be a separate queue,
> separate lock(s), etc. The only difference between kernel APF and user APF is
> what chunk of code is responsible for faulting in the page.
There are two queues involved:
- "queue": stores in-flight faults. APF-kernel uses it to cancel all
works if needed. APF-user does not have a way to "cancel" userspace
works, but it uses the queue to look up the struct by the token when
userspace reports a completion.
- "ready": stores completed faults until KVM finds a chance to tell
guest about them.
I agree that the "ready" queue can be shared between APF-kernel and
-user as it's used in the same way. As for the "queue" queue, do you
think it's ok to process its elements differently based on the "type" of
them in a single loop [1] instead of having two separate queues?
[1] https://elixir.bootlin.com/linux/v6.13.2/source/virt/kvm/async_pf.c#L120
> I suspect a good place to start would be something along the lines of the below
> diff, and go from there. Given that KVM already needs to special case the fake
> "wake all" items, I'm guessing it won't be terribly difficult to teach the core
> flows about userspace async #PF.
That sounds sensible. I can certainly approach it in a "bottom up" way
by sparingly adding handling where it's different in APF-user rather
than adding it side by side and trying to merge common parts.
> I'm also not sure that injecting async #PF for all userfaults is desirable. For
> in-kernel async #PF, KVM knows that faulting in the memory would sleep. For
> userfaults, KVM has no way of knowing if the userfault will sleep, i.e. should
> be handled via async #PF. The obvious answer is to have userspace only enable
> userspace async #PF when it's useful, but "an all or nothing" approach isn't
> great uAPI. On the flip side, adding uAPI for a use case that doesn't exist
> doesn't make sense either :-/
I wasn't able to locate the code that would check whether faulting would
sleep in APF-kernel. KVM spins APF-kernel whenever it can ([2]).
Please let me know if I'm missing something here.
[2]
https://elixir.bootlin.com/linux/v6.13.2/source/arch/x86/kvm/mmu/mmu.c#L4360
> Exiting to userspace in vCPU context is also kludgy. It makes sense for base
> userfault, because the vCPU can't make forward progress until the fault is
> resolved. Actually, I'm not even sure it makes sense there. I'll follow-up in
Even though we exit to userspace, in case of APF-user, userspace is
supposed to VM enter straight after scheduling the async job, which is
then executed concurrently with the vCPU.
> James' series. Anyways, it definitely doesn't make sense for async #PF, because
> the whole point is to let the vCPU run. Signalling userspace would definitely
> add complexity, but only because of the need to communicate the token and wait
> for userspace to consume said token. I'll think more on that.
By signalling userspace you mean a new non-exit-to-userspace mechanism
similar to UFFD? What advantage can you see in it over exiting to
userspace (which already exists in James's series)?
Thanks,
Nikita
>
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index 0ee4816b079a..fc31b47cf9c5 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -177,7 +177,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
> * success, 'false' on failure (page fault has to be handled synchronously).
> */
> bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> - unsigned long hva, struct kvm_arch_async_pf *arch)
> + unsigned long hva, struct kvm_arch_async_pf *arch,
> + bool userfault)
> {
> struct kvm_async_pf *work;
>
> @@ -202,13 +203,16 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> work->addr = hva;
> work->arch = *arch;
>
> - INIT_WORK(&work->work, async_pf_execute);
> -
> list_add_tail(&work->queue, &vcpu->async_pf.queue);
> vcpu->async_pf.queued++;
> work->notpresent_injected = kvm_arch_async_page_not_present(vcpu, work);
>
> - schedule_work(&work->work);
> + if (userfault) {
> + work->userfault = true;
> + } else {
> + INIT_WORK(&work->work, async_pf_execute);
> + schedule_work(&work->work);
> + }
>
> return true;
> }
Powered by blists - more mailing lists