lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2afda7a3-3c48-44e4-b462-49e0d223208b@amazon.com>
Date: Thu, 4 Dec 2025 17:27:04 +0000
From: Nikita Kalyazin <kalyazin@...zon.com>
To: "David Hildenbrand (Red Hat)" <david@...nel.org>, Peter Xu
	<peterx@...hat.com>
CC: Mike Rapoport <rppt@...nel.org>, <linux-mm@...ck.org>, Andrea Arcangeli
	<aarcange@...hat.com>, Andrew Morton <akpm@...ux-foundation.org>, "Axel
 Rasmussen" <axelrasmussen@...gle.com>, Baolin Wang
	<baolin.wang@...ux.alibaba.com>, Hugh Dickins <hughd@...gle.com>, "James
 Houghton" <jthoughton@...gle.com>, "Liam R. Howlett"
	<Liam.Howlett@...cle.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	Michal Hocko <mhocko@...e.com>, Paolo Bonzini <pbonzini@...hat.com>, "Sean
 Christopherson" <seanjc@...gle.com>, Shuah Khan <shuah@...nel.org>, "Suren
 Baghdasaryan" <surenb@...gle.com>, Vlastimil Babka <vbabka@...e.cz>,
	<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>,
	<linux-kselftest@...r.kernel.org>
Subject: Re: [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor
 mode



On 03/12/2025 10:03, Nikita Kalyazin wrote:
> On 03/12/2025 09:23, David Hildenbrand (Red Hat) wrote:
>> On 12/2/25 12:50, Nikita Kalyazin wrote:
>>>
>>>
>>> On 01/12/2025 20:57, Peter Xu wrote:
>>>> On Mon, Dec 01, 2025 at 08:12:38PM +0000, Nikita Kalyazin wrote:
>>>>>
>>>>>
>>>>> On 01/12/2025 18:35, Peter Xu wrote:
>>>>>> On Mon, Dec 01, 2025 at 04:48:22PM +0000, Nikita Kalyazin wrote:
>>>>>>> I believe I found the precise point where we convinced ourselves 
>>>>>>> that minor
>>>>>>> support was sufficient: [1].  If at this moment we don't find 
>>>>>>> that reasoning
>>>>>>> valid anymore, then indeed implementing missing is the only option.
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/kvm/Z9GsIDVYWoV8d8-C@x1.local
>>>>>>
>>>>>> Now after I re-read the discussion, I may have made a wrong statement
>>>>>> there, sorry.  I could have got slightly confused on when the write()
>>>>>> syscall can be involved.
>>>>>>
>>>>>> I agree if you want to get an event when cache missed with the 
>>>>>> current uffd
>>>>>> definitions and when pre-population is forbidden, then MISSING 
>>>>>> trap is
>>>>>> required.  That is, with/without the need of UFFDIO_COPY being 
>>>>>> available.
>>>>>>
>>>>>> Do I understand it right that UFFDIO_COPY is not allowed in your 
>>>>>> case, but
>>>>>> only write()?
>>>>>
>>>>> No, UFFDIO_COPY would work perfectly fine.  We will still use write()
>>>>> whenever we resolve stage-2 faults as they aren't visible to UFFD. 
>>>>> When a
>>>>> userfault occurs at an offset that already has a page in the cache, 
>>>>> we will
>>>>> have to keep using UFFDIO_CONTINUE so it looks like both will be 
>>>>> required:
>>>>>
>>>>>    - user mapping major fault -> UFFDIO_COPY (fills the cache and 
>>>>> sets up
>>>>> userspace PT)
>>>>>    - user mapping minor fault -> UFFDIO_CONTINUE (only sets up 
>>>>> userspace PT)
>>>>>    - stage-2 fault -> write() (only fills the cache)
>>>>
>>>> Is stage-2 fault about KVM_MEMORY_EXIT_FLAG_USERFAULT, per James's 
>>>> series?
>>>
>>> Yes, that's the one ([1]).
>>>
>>> [1]
>>> https://lore.kernel.org/kvm/20250618042424.330664-1- 
>>> jthoughton@...gle.com
>>>
>>>>
>>>> It looks fine indeed, but it looks slightly weird then, as you'll 
>>>> have two
>>>> ways to populate the page cache.  Logically here atomicity is indeed 
>>>> not
>>>> needed when you trap both MISSING + MINOR.
>>>
>>> I reran the test based on the UFFDIO_COPY prototype I had using your
>>> series [2], and UFFDIO_COPY is slower than write() to populate 512 MiB:
>>> 237 vs 202 ms (+17%).  Even though UFFDIO_COPY alone is functionally
>>> sufficient, I would prefer to have an option to use write() where
>>> possible and only falling back to UFFDIO_COPY for userspace faults to
>>> have better performance.
>>
>> Just so I understand correctly: we could even do without UFFDIO_COPY for
>> that scenario by using write() + minor faults?
> 
> We still need major fault notifications as well (which we were 
> accidentally generating until this version).  But we can resolve them 
> with write() + UFFDIO_CONTINUE instead of UFFDIO_COPY.

We had a conversation about that at the guest_memfd sync today:

Q: Is it possible from the API point of view to support MISSING 
notifications without supporting UFFDIO_COPY?

A: The manpage [1] says on UFFDIO_REGISTER_MODE_MISSING that "the page 
fault is resolved from user-space by either an UFFDIO_COPY or an 
UFFDIO_ZEROPAGE ioctl", but I don't think it's actually enforced 
anywhere in the code.

Q: UFFDIO_COPY is supposed to provide atomic semantics, while write() + 
UFFDIO_CONTINUE does not. Is it a problem?

A: It isn't a problem for the particular Firecracker use case because 1) 
vCPUs can be prevented from seeing partially populated pages in the 
cache via KVM userfault intercept [2] and 2) we do not use other 
userspace mappings.  However, as James pointed, in the general case, 
other actors may observe partially populated pages via other userspace 
mappings.

[1] https://man7.org/linux/man-pages/man2/userfaultfd.2.html
[2] 
https://lore.kernel.org/kvm/20250618042424.330664-1-jthoughton@google.com

> 
>>
>> But what you are saying is that there might be a performance benefit in
>> using UFFDIO_COPY for userspace faults, to avoid the write()+minor fault
>> overhead?
> 
> UFFDIO_COPY _may_ be faster to resolve userspace faults because it's a 
> single syscall instead of two, but the amount of userspace faults, at 
> least in our scenario, is negligible compared to the amount of stage-2 
> faults, so I wouldn't use it as an argument for supporting UFFDIO_COPY 
> if it can be avoided.
> 
>>
>> -- 
>> Cheers
>>
>> David
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ