linux-kernel - Re: [PATCH v2 1/4] mm: Introduce vm_uffd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a1a48a0e-62d3-48d0-b9c2-492eb190b99f@amazon.com>
Date: Thu, 18 Sep 2025 18:15:41 +0100
From: Nikita Kalyazin <kalyazin@...zon.com>
To: "Liam R. Howlett" <Liam.Howlett@...cle.com>, Mike Rapoport
	<rppt@...nel.org>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Peter Xu
	<peterx@...hat.com>, David Hildenbrand <david@...hat.com>, Suren Baghdasaryan
	<surenb@...gle.com>, <linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>,
	Vlastimil Babka <vbabka@...e.cz>, Muchun Song <muchun.song@...ux.dev>, "Hugh
 Dickins" <hughd@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>,
	"James Houghton" <jthoughton@...gle.com>, Michal Hocko <mhocko@...e.com>,
	"Andrea Arcangeli" <aarcange@...hat.com>, Oscar Salvador <osalvador@...e.de>,
	"Axel Rasmussen" <axelrasmussen@...gle.com>, Ujwal Kundur
	<ujwal.kundur@...il.com>
Subject: Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API



On 18/09/2025 17:47, Liam R. Howlett wrote:
> * Mike Rapoport <rppt@...nel.org> [250918 04:37]:
>> On Wed, Sep 17, 2025 at 12:53:05PM -0400, Liam R. Howlett wrote:
>>> * Mike Rapoport <rppt@...nel.org> [250917 05:26]:
>>>> Hi Liam,
>>>>
>>>> On Mon, Sep 08, 2025 at 12:53:37PM -0400, Liam R. Howlett wrote:
>>>>>
>>>>> Reading through the patches, I'm not entirely sure what you are
>>>>> proposing.
>>>>>
>>>>> What I was hoping to see by a generalization of the memory types is a
>>>>> much simpler shared code base until the code hit memory type specific
>>>>> areas where a function pointer could be used to keep things from getting
>>>>> complicated (or, I guess a switch statement..).
>>>>>
>>>>> What we don't want is non-mm code specifying values for the function
>>>>> pointer and doing what they want, or a function pointer that returns a
>>>>> core mm resource (in the old example this was a vma, here it is a
>>>>> folio).
>>>>>
>>>>>  From this patch set:
>>>>> +        * Return: zero if succeeded, negative for errors.
>>>>> +        */
>>>>> +       int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
>>>>> +                             struct folio **folio);
>>>>>
>>>>> This is one of the contention points in the current scenario as the
>>>>> folio would be returned.
>>>>
>>>> I don't see a problem with it. It's not any different from
>>>> vma_ops->fault(): a callback for a filesystem to get a folio that will be
>>>> mapped afterwards by the mm code.
>>>>
>>>
>>> I disagree, the filesystem vma_ops->fault() is not a config option like
>>> this one.  So we are on a path to enable uffd by default, and it really
>>> needs work beyond this series.  Setting up a list head and passing in
>>> through every call stack is far from idea.
>>
>> I don't follow you here. How addition of uffd callbacks guarded by a config
>> option to vma_ops leads to enabling uffd by by default?
> 
> Any new memory type that uses the above interface now needs uffd
> enabled, anyone planning to use guest_memfd needs it enabled, anyone
> able to get a module using this interface needs it enabled (by whoever
> gives them the kernel they use).  Kernel provides now need to enable
> UFFD - which is different than the example provided.
> 
>>
>>> I also think the filesystem model is not one we want to duplicate in mm
>>> for memory types - think of the test issues we have now and then have a
>>> look at the xfstests support of filesystems [1].
>>>
>>> So we are on a path of less test coverage, and more code that is
>>> actually about mm that is outside of mm.  So, is there another way?
>>
>> There are quite a few vma_ops outside fs/ not covered by xfstest, so the
>> test coverage argument is moot at best.
>> And anything in the kernel can grab a folio and do whatever it pleases.
> 
> Testing filesystems is nothing short of a nightmare and I don't want mm
> to march happily towards that end.  This interface is endlessly flexible
> and thus endlessly broken and working at the same time.
> 
> IOW, we have given anyone wanting to implement a new memory type
> infinite freedoms to run afoul, but they won't be looking for those
> people when things go horribly wrong - they will most likely see a
> memory issue and come here. syzbot will see a hang on some mm lock in an
> unrelated task, or whatever.
> 
> I would rather avoid the endlessly flexible interface to avoid incorrect
> uses in favour of a limited selection of choices, that could be expanded
> if necessary, but would be more visible to the mm people going in.  That
> is, people can add new memory types through adding them to mm/ instead
> of in driver/ or out of tree.
> 
> I could very much see someone looking to use this for a binder-type
> driver and that might work out really well!  But I don't want someone
> doing it and shoving the folio pointer in a custom struct because they
> *know* it's fine, so what's the big deal?  I don't mean to pick on
> binder, but this example comes to mind.
> 
>>
>> Nevertheless, let's step back for a second and instead focus on the problem
>> these patches are trying to solve, which is to allow guest_memfd implement
>> UFFD_CONTINUE (or minor fault in other terminology).
> 
> Well, this is about modularizing memory types, but the first user is
> supposed to be the guest-memfd support.
> 
>>
>> This means uffd should be able to map a folio that's already in
>> guest_memfd page cache to the faulted address. Obviously, the page table
>> update happens in uffd. But it still has to find what to map and we need
>> some way to let guest_memfd tell that to uffd.
>>
>> So we need a hook somewhere that will return a folio matching pgoff in
>> vma->file->inode.
>>
>> Do you see a way to implement it otherwise?
> 
> I must be missing something.
> 
> UFFDIO_CONTINUE currently enters through an ioctl that calls
> userfaultfd_continue() -> mfill_atomic_continue()... mfill_atomic() gets
> and uses the folio to actually do the work.  Right now, we don't hand
> out the folio, so what is different here?
> 
> I am under the impression that we don't need to return the folio, but
> may need to do work on it.  That is, we can give the mm side what it
> needs to call the related memory type functions to service the request.
> 
> For example, one could pass in the inode, pgoff, and memory type and the
> mm code could then call the fault handler for that memory type?
> 
> I didn't think Nikita had a folio returned in his first three patches
> [1], but then they built on other patches and it was difficult to follow
> along.  Is it because that interface was agreed on in a call on 23 Jan
> 2025 [2], as somewhat unclearly stated in [1]?

I believe you can safely ignore what was discussed in [2] as it is 
irrelevant to this discussion.  That was just reasoning why it was 
possible to use UserfaultFD for guest_memfd as opposed to inventing an 
alternative solution to handling faults in userspace.

Regarding returning a folio, [1] was calling vm_ops->fault() in 
UserfaultFD code.  The fault() itself gets a folio (at least in 
guest_memfd implementation [3]).  Does it look like a preferable 
solution to you?

The other patches it I was building on top were mmap support in 
guest_memfd [4], which is currently merged in kvm/next, and also part of 
[3].

[3] 
https://git.kernel.org/pub/scm/linux/kernel/git/david/linux.git/tree/virt/kvm/guest_memfd.c?id=911634bac3107b237dcd8fdcb6ac91a22741cbe7#n347
[4] https://lore.kernel.org/kvm/20250729225455.670324-1-seanjc@google.com

> 
> Thanks,
> Liam
> 
> [1].  https://lore.kernel.org/all/20250404154352.23078-1-kalyazin@amazon.com/
> [2].  https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.w1126rgli5e3
>