linux-kernel - Re: [RFC PATCH 0/6] KVM: x86: async PF user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f820b630-13c1-4164-baa8-f5e8231612d1@amazon.com>
Date: Fri, 21 Feb 2025 11:02:20 +0000
From: Nikita Kalyazin <kalyazin@...zon.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: <pbonzini@...hat.com>, <corbet@....net>, <tglx@...utronix.de>,
	<mingo@...hat.com>, <bp@...en8.de>, <dave.hansen@...ux.intel.com>,
	<hpa@...or.com>, <rostedt@...dmis.org>, <mhiramat@...nel.org>,
	<mathieu.desnoyers@...icios.com>, <kvm@...r.kernel.org>,
	<linux-doc@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<linux-trace-kernel@...r.kernel.org>, <jthoughton@...gle.com>,
	<david@...hat.com>, <peterx@...hat.com>, <oleg@...hat.com>,
	<vkuznets@...hat.com>, <gshan@...hat.com>, <graf@...zon.de>,
	<jgowans@...zon.com>, <roypat@...zon.co.uk>, <derekmn@...zon.com>,
	<nsaenz@...zon.es>, <xmarcalx@...zon.com>
Subject: Re: [RFC PATCH 0/6] KVM: x86: async PF user

On 20/02/2025 18:49, Sean Christopherson wrote:
> On Thu, Feb 20, 2025, Nikita Kalyazin wrote:
>> On 19/02/2025 15:17, Sean Christopherson wrote:
>>> On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
>>> The conundrum with userspace async #PF is that if userspace is given only a single
>>> bit per gfn to force an exit, then KVM won't be able to differentiate between
>>> "faults" that will be handled synchronously by the vCPU task, and faults that
>>> usersepace will hand off to an I/O task.  If the fault is handled synchronously,
>>> KVM will needlessly inject a not-present #PF and a present IRQ.
>>
>> Right, but from the guest's point of view, async PF means "it will probably
>> take a while for the host to get the page, so I may consider doing something
>> else in the meantime (ie schedule another process if available)".
> 
> Except in this case, the guest never gets a chance to run, i.e. it can't do
> something else.  From the guest point of view, if KVM doesn't inject what is
> effectively a spurious async #PF, the VM-Exiting instruction simply took a (really)
> long time to execute.

Sorry, I didn't get that.  If userspace learns from the 
kvm_run::memory_fault::flags that the exit is due to an async PF, it 
should call kvm run immediately, inject the not-present PF and allow the 
guest to reschedule.  What do you mean by "the guest never gets a chance 
to run"?

>> If we are exiting to userspace, it isn't going to be quick anyway, so we can
>> consider all such faults "long" and warranting the execution of the async PF
>> protocol.  So always injecting a not-present #PF and page ready IRQ doesn't
>> look too wrong in that case.
> 
> There is no "wrong", it's simply wasteful.  The fact that the userspace exit is
> "long" is completely irrelevant.  Decompressing zswap is also slow, but it is
> done on the current CPU, i.e. is not background I/O, and so doesn't trigger async
> #PFs.
> 
> In the guest, if host userspace resolves the fault before redoing KVM_RUN, the
> vCPU will get two events back-to-back: an async #PF, and an IRQ signalling completion
> of that #PF.

Is this practically likely?  At least in our scenario (Firecracker 
snapshot restore) and probably in live migration postcopy, if a vCPU 
hits a fault, it's probably because the content of the page is somewhere 
remote (eg on the source machine or wherever the snapshot data is 
stored) and isn't going to be available quickly.  Conversely, if the 
page content is available, it must have already been prepopulated into 
guest memory pagecache, the bit in the bitmap is cleared and no exit to 
userspace occurs.

>>>> What advantage can you see in it over exiting to userspace (which already exists
>>>> in James's series)?
>>>
>>> It doesn't exit to userspace :-)
>>>
>>> If userspace simply wakes a different task in response to the exit, then KVM
>>> should be able to wake said task, e.g. by signalling an eventfd, and resume the
>>> guest much faster than if the vCPU task needs to roundtrip to userspace.  Whether
>>> or not such an optimization is worth the complexity is an entirely different
>>> question though.
>>
>> This reminds me of the discussion about VMA-less UFFD that was coming up
>> several times, such as [1], but AFAIK hasn't materialised into something
>> actionable.  I may be wrong, but James was looking into that and couldn't
>> figure out a way to scale it sufficiently for his use case and had to stick
>> with the VM-exit-based approach.  Can you see a world where VM-exit
>> userfaults coexist with no-VM-exit way of handling async PFs?
> 
> The issue with UFFD is that it's difficult to provide a generic "point of contact",
> whereas with KVM userfault, signalling can be tied to the vCPU, and KVM can provide
> per-vCPU buffers/structures to aid communication.
> 
> That said, supporting "exitless" KVM userfault would most definitely be premature
> optimization without strong evidence it would benefit a real world use case.

Does that mean that the "exitless" solution for async PF is a long-term 
one (if required), while the short-term would still be "exitful" (if we 
find a way to do it sensibly)?