linux-kernel - Re: [RFC PATCH 0/6] KVM: x86: async PF user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6eddd049-7c7a-406d-b763-78fa1e7d921b@amazon.com>
Date: Thu, 20 Feb 2025 18:29:01 +0000
From: Nikita Kalyazin <kalyazin@...zon.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: <pbonzini@...hat.com>, <corbet@....net>, <tglx@...utronix.de>,
	<mingo@...hat.com>, <bp@...en8.de>, <dave.hansen@...ux.intel.com>,
	<hpa@...or.com>, <rostedt@...dmis.org>, <mhiramat@...nel.org>,
	<mathieu.desnoyers@...icios.com>, <kvm@...r.kernel.org>,
	<linux-doc@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<linux-trace-kernel@...r.kernel.org>, <jthoughton@...gle.com>,
	<david@...hat.com>, <peterx@...hat.com>, <oleg@...hat.com>,
	<vkuznets@...hat.com>, <gshan@...hat.com>, <graf@...zon.de>,
	<jgowans@...zon.com>, <roypat@...zon.co.uk>, <derekmn@...zon.com>,
	<nsaenz@...zon.es>, <xmarcalx@...zon.com>
Subject: Re: [RFC PATCH 0/6] KVM: x86: async PF user

On 19/02/2025 15:17, Sean Christopherson wrote:
> On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
>> On 11/02/2025 21:17, Sean Christopherson wrote:
>>> On Mon, Nov 18, 2024, Nikita Kalyazin wrote:
>>> And it's not just the code itself, it's all the structures and concepts.  Off the
>>> top of my head, I can't think of any reason there needs to be a separate queue,
>>> separate lock(s), etc.  The only difference between kernel APF and user APF is
>>> what chunk of code is responsible for faulting in the page.
>>
>> There are two queues involved:
>>   - "queue": stores in-flight faults. APF-kernel uses it to cancel all works
>> if needed.  APF-user does not have a way to "cancel" userspace works, but it
>> uses the queue to look up the struct by the token when userspace reports a
>> completion.
>>   - "ready": stores completed faults until KVM finds a chance to tell guest
>> about them.
>>
>> I agree that the "ready" queue can be shared between APF-kernel and -user as
>> it's used in the same way.  As for the "queue" queue, do you think it's ok
>> to process its elements differently based on the "type" of them in a single
>> loop [1] instead of having two separate queues?
> 
> Yes.
> 
>> [1] https://elixir.bootlin.com/linux/v6.13.2/source/virt/kvm/async_pf.c#L120
>>
>>> I suspect a good place to start would be something along the lines of the below
>>> diff, and go from there.  Given that KVM already needs to special case the fake
>>> "wake all" items, I'm guessing it won't be terribly difficult to teach the core
>>> flows about userspace async #PF.
>>
>> That sounds sensible.  I can certainly approach it in a "bottom up" way by
>> sparingly adding handling where it's different in APF-user rather than
>> adding it side by side and trying to merge common parts.
>>
>>> I'm also not sure that injecting async #PF for all userfaults is desirable.  For
>>> in-kernel async #PF, KVM knows that faulting in the memory would sleep.  For
>>> userfaults, KVM has no way of knowing if the userfault will sleep, i.e. should
>>> be handled via async #PF.  The obvious answer is to have userspace only enable
>>> userspace async #PF when it's useful, but "an all or nothing" approach isn't
>>> great uAPI.  On the flip side, adding uAPI for a use case that doesn't exist
>>> doesn't make sense either :-/
>>
>> I wasn't able to locate the code that would check whether faulting would
>> sleep in APF-kernel.  KVM spins APF-kernel whenever it can ([2]). Please let
>> me know if I'm missing something here.
> 
> kvm_can_do_async_pf() will be reached if and only if faulting in the memory
> requires waiting.  If a page is swapped out, but faulting it back in doesn't
> require waiting, e.g. because it's in zswap and can be uncompressed synchronously,
> then the initial __kvm_faultin_pfn() with FOLL_NO_WAIT will succeed.
> 
>          /*
>           * If resolving the page failed because I/O is needed to fault-in the
>           * page, then either set up an asynchronous #PF to do the I/O, or if
>           * doing an async #PF isn't possible, retry with I/O allowed.  All
>           * other failures are terminal, i.e. retrying won't help.
>           */
>          if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
>                  return RET_PF_CONTINUE;
> 
>          if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
>                  trace_kvm_try_async_get_page(fault->addr, fault->gfn);
>                  if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) {
>                          trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
>                          kvm_make_request(KVM_REQ_APF_HALT, vcpu);
>                          return RET_PF_RETRY;
>                  } else if (kvm_arch_setup_async_pf(vcpu, fault)) {
>                          return RET_PF_RETRY;
>                  }
>          }
> 
> The conundrum with userspace async #PF is that if userspace is given only a single
> bit per gfn to force an exit, then KVM won't be able to differentiate between
> "faults" that will be handled synchronously by the vCPU task, and faults that
> usersepace will hand off to an I/O task.  If the fault is handled synchronously,
> KVM will needlessly inject a not-present #PF and a present IRQ.

Right, but from the guest's point of view, async PF means "it will 
probably take a while for the host to get the page, so I may consider 
doing something else in the meantime (ie schedule another process if 
available)".  If we are exiting to userspace, it isn't going to be quick 
anyway, so we can consider all such faults "long" and warranting the 
execution of the async PF protocol.  So always injecting a not-present 
#PF and page ready IRQ doesn't look too wrong in that case.

> But that's a non-issue if the known use cases are all-or-nothing, i.e. if all
> userspace faults are either synchronous or asynchronous.

Yes, pretty much.  The user will be choosing the extreme that is more 
performant for their specific usecase.

>> [2] https://elixir.bootlin.com/linux/v6.13.2/source/arch/x86/kvm/mmu/mmu.c#L4360
>>
>>> Exiting to userspace in vCPU context is also kludgy.  It makes sense for base
>>> userfault, because the vCPU can't make forward progress until the fault is
>>> resolved.  Actually, I'm not even sure it makes sense there.  I'll follow-up in
>>
>> Even though we exit to userspace, in case of APF-user, userspace is supposed
>> to VM enter straight after scheduling the async job, which is then executed
>> concurrently with the vCPU.
>>
>>> James' series.  Anyways, it definitely doesn't make sense for async #PF, because
>>> the whole point is to let the vCPU run.  Signalling userspace would definitely
>>> add complexity, but only because of the need to communicate the token and wait
>>> for userspace to consume said token.  I'll think more on that.
>>
>> By signalling userspace you mean a new non-exit-to-userspace mechanism
>> similar to UFFD?
> 
> Yes.
> 
>> What advantage can you see in it over exiting to userspace (which already exists
>> in James's series)?
> 
> It doesn't exit to userspace :-)
> 
> If userspace simply wakes a different task in response to the exit, then KVM
> should be able to wake said task, e.g. by signalling an eventfd, and resume the
> guest much faster than if the vCPU task needs to roundtrip to userspace.  Whether
> or not such an optimization is worth the complexity is an entirely different
> question though.

This reminds me of the discussion about VMA-less UFFD that was coming up 
several times, such as [1], but AFAIK hasn't materialised into something 
actionable.  I may be wrong, but James was looking into that and 
couldn't figure out a way to scale it sufficiently for his use case and 
had to stick with the VM-exit-based approach.  Can you see a world where 
VM-exit userfaults coexist with no-VM-exit way of handling async PFs?

[1]: https://lore.kernel.org/kvm/ZqwKuzfAs7pvdHAN@x1n/