linux-kernel - Re: [RFC PATCH 0/6] KVM: x86: async PF user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a7080c07-0fc5-45ce-92f7-5f432a67bc63@amazon.com>
Date: Wed, 12 Feb 2025 18:14:09 +0000
From: Nikita Kalyazin <kalyazin@...zon.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: <pbonzini@...hat.com>, <corbet@....net>, <tglx@...utronix.de>,
	<mingo@...hat.com>, <bp@...en8.de>, <dave.hansen@...ux.intel.com>,
	<hpa@...or.com>, <rostedt@...dmis.org>, <mhiramat@...nel.org>,
	<mathieu.desnoyers@...icios.com>, <kvm@...r.kernel.org>,
	<linux-doc@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<linux-trace-kernel@...r.kernel.org>, <jthoughton@...gle.com>,
	<david@...hat.com>, <peterx@...hat.com>, <oleg@...hat.com>,
	<vkuznets@...hat.com>, <gshan@...hat.com>, <graf@...zon.de>,
	<jgowans@...zon.com>, <roypat@...zon.co.uk>, <derekmn@...zon.com>,
	<nsaenz@...zon.es>, <xmarcalx@...zon.com>
Subject: Re: [RFC PATCH 0/6] KVM: x86: async PF user

On 11/02/2025 21:17, Sean Christopherson wrote:
> On Mon, Nov 18, 2024, Nikita Kalyazin wrote:
>> Async PF [1] allows to run other processes on a vCPU while the host
>> handles a stage-2 fault caused by a process on that vCPU. When using
>> VM-exit-based stage-2 fault handling [2], async PF functionality is lost
>> because KVM does not run the vCPU while a fault is being handled so no
>> other process can execute on the vCPU. This patch series extends
>> VM-exit-based stage-2 fault handling with async PF support by letting
>> userspace handle faults instead of the kernel, hence the "async PF user"
>> name.
>>
>> I circulated the idea with Paolo, Sean, David H, and James H at the LPC,
>> and the only concern I heard was about injecting the "page not present"
>> event via #PF exception in the CoCo case, where it may not work. In my
>> implementation, I reused the existing code for doing that, so the async
>> PF user implementation is on par with the present async PF
>> implementation in this regard, and support for the CoCo case can be
>> added separately.
>>
>> Please note that this series is applied on top of the VM-exit-based
>> stage-2 fault handling RFC [2].
> 
> ...
> 
>> Nikita Kalyazin (6):
>>    Documentation: KVM: add userfault KVM exit flag
>>    Documentation: KVM: add async pf user doc
>>    KVM: x86: add async ioctl support
>>    KVM: trace events: add type argument to async pf
>>    KVM: x86: async_pf_user: add infrastructure
>>    KVM: x86: async_pf_user: hook to fault handling and add ioctl
>>
>>   Documentation/virt/kvm/api.rst  |  35 ++++++
>>   arch/x86/include/asm/kvm_host.h |  12 +-
>>   arch/x86/kvm/Kconfig            |   7 ++
>>   arch/x86/kvm/lapic.c            |   2 +
>>   arch/x86/kvm/mmu/mmu.c          |  68 ++++++++++-
>>   arch/x86/kvm/x86.c              | 101 +++++++++++++++-
>>   arch/x86/kvm/x86.h              |   2 +
>>   include/linux/kvm_host.h        |  30 +++++
>>   include/linux/kvm_types.h       |   1 +
>>   include/trace/events/kvm.h      |  50 +++++---
>>   include/uapi/linux/kvm.h        |  12 +-
>>   virt/kvm/Kconfig                |   3 +
>>   virt/kvm/Makefile.kvm           |   1 +
>>   virt/kvm/async_pf.c             |   2 +-
>>   virt/kvm/async_pf_user.c        | 197 ++++++++++++++++++++++++++++++++
>>   virt/kvm/async_pf_user.h        |  24 ++++
>>   virt/kvm/kvm_main.c             |  14 +++
>>   17 files changed, 535 insertions(+), 26 deletions(-)
> 
> I am supportive of the idea, but there is way too much copy+paste in this series.
Hi Sean,

Yes, like I mentioned in the cover letter, I left the new implementation 
isolated on purpose to make the scope of the change clear.  There is 
certainly lots of duplication that should be removed later on.

> And it's not just the code itself, it's all the structures and concepts.  Off the
> top of my head, I can't think of any reason there needs to be a separate queue,
> separate lock(s), etc.  The only difference between kernel APF and user APF is
> what chunk of code is responsible for faulting in the page.

There are two queues involved:
  - "queue": stores in-flight faults. APF-kernel uses it to cancel all 
works if needed.  APF-user does not have a way to "cancel" userspace 
works, but it uses the queue to look up the struct by the token when 
userspace reports a completion.
  - "ready": stores completed faults until KVM finds a chance to tell 
guest about them.

I agree that the "ready" queue can be shared between APF-kernel and 
-user as it's used in the same way.  As for the "queue" queue, do you 
think it's ok to process its elements differently based on the "type" of 
them in a single loop [1] instead of having two separate queues?

[1] https://elixir.bootlin.com/linux/v6.13.2/source/virt/kvm/async_pf.c#L120

> I suspect a good place to start would be something along the lines of the below
> diff, and go from there.  Given that KVM already needs to special case the fake
> "wake all" items, I'm guessing it won't be terribly difficult to teach the core
> flows about userspace async #PF.

That sounds sensible.  I can certainly approach it in a "bottom up" way 
by sparingly adding handling where it's different in APF-user rather 
than adding it side by side and trying to merge common parts.

> I'm also not sure that injecting async #PF for all userfaults is desirable.  For
> in-kernel async #PF, KVM knows that faulting in the memory would sleep.  For
> userfaults, KVM has no way of knowing if the userfault will sleep, i.e. should
> be handled via async #PF.  The obvious answer is to have userspace only enable
> userspace async #PF when it's useful, but "an all or nothing" approach isn't
> great uAPI.  On the flip side, adding uAPI for a use case that doesn't exist
> doesn't make sense either :-/

I wasn't able to locate the code that would check whether faulting would 
sleep in APF-kernel.  KVM spins APF-kernel whenever it can ([2]). 
Please let me know if I'm missing something here.

[2] 
https://elixir.bootlin.com/linux/v6.13.2/source/arch/x86/kvm/mmu/mmu.c#L4360

> Exiting to userspace in vCPU context is also kludgy.  It makes sense for base
> userfault, because the vCPU can't make forward progress until the fault is
> resolved.  Actually, I'm not even sure it makes sense there.  I'll follow-up in

Even though we exit to userspace, in case of APF-user, userspace is 
supposed to VM enter straight after scheduling the async job, which is 
then executed concurrently with the vCPU.

> James' series.  Anyways, it definitely doesn't make sense for async #PF, because
> the whole point is to let the vCPU run.  Signalling userspace would definitely
> add complexity, but only because of the need to communicate the token and wait
> for userspace to consume said token.  I'll think more on that.

By signalling userspace you mean a new non-exit-to-userspace mechanism 
similar to UFFD?  What advantage can you see in it over exiting to 
userspace (which already exists in James's series)?


Thanks,
Nikita

> 
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index 0ee4816b079a..fc31b47cf9c5 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -177,7 +177,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
>    * success, 'false' on failure (page fault has to be handled synchronously).
>    */
>   bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> -                       unsigned long hva, struct kvm_arch_async_pf *arch)
> +                       unsigned long hva, struct kvm_arch_async_pf *arch,
> +                       bool userfault)
>   {
>          struct kvm_async_pf *work;
> 
> @@ -202,13 +203,16 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>          work->addr = hva;
>          work->arch = *arch;
> 
> -       INIT_WORK(&work->work, async_pf_execute);
> -
>          list_add_tail(&work->queue, &vcpu->async_pf.queue);
>          vcpu->async_pf.queued++;
>          work->notpresent_injected = kvm_arch_async_page_not_present(vcpu, work);
> 
> -       schedule_work(&work->work);
> +       if (userfault) {
> +               work->userfault = true;
> +       } else {
> +               INIT_WORK(&work->work, async_pf_execute);
> +               schedule_work(&work->work);
> +       }
> 
>          return true;
>   }