linux-kernel - Re: [PATCH 1/1] Fix zero copy I/O on __get_user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c29bd0cb-0907-4c5b-8244-83bbb256d964@redhat.com>
Date: Thu, 8 May 2025 17:37:04 +0200
From: David Hildenbrand <david@...hat.com>
To: Pantelis Antoniou <p.antoniou@...tner.samsung.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, Artem Krupotkin <artem.k@...sung.com>,
 Charles Briere <c.briere@...sung.com>,
 Wade Farnsworth <wade.farnsworth@...mens.com>, Peter Xu <peterx@...hat.com>
Subject: Re: [PATCH 1/1] Fix zero copy I/O on __get_user_pages allocated pages

On 08.05.25 17:23, Pantelis Antoniou wrote:
> On Thu, 8 May 2025 17:03:46 +0200
> David Hildenbrand <david@...hat.com> wrote:
> 
> Hi there,
> 
>> On 07. 05. 25 17: 41, Pantelis Antoniou wrote: Hi, > Recent updates
>> to net filesystems enabled zero copy operations, > which require
>> getting a user space page pinned. > > This does not work for pages
>> that were allocated via __get_user_pages
>> On 07.05.25 17:41, Pantelis Antoniou wrote:
>>
>> Hi,
>>
>>> Recent updates to net filesystems enabled zero copy operations,
>>> which require getting a user space page pinned.
>>>
>>> This does not work for pages that were allocated via
>>> __get_user_pages and then mapped to user-space via remap_pfn_rage.
>>
>> Right. Because the struct page of a VM_PFNMAP *must not be touched*.
>> It has to be treated like it doesn't exist.
>>
> 
> Well, that's not exactly the case. For pages mapped to user space via
> remap_pfn_range() the VM_PFNMAP bit is set even though the pages do
> have a struct page.

Yes. And VM_PFNMAP is the big flag that these pages shall not be 
touched. Even if it exists. Even if you say please. :)

See the comment above vm_normal_page():

	"Special" mappings do not wish to be associated with a "struct
	page" (either it doesn't exist, or it exists but they don't want
	to touch it)

VM_MIXEDMAP could maybe be used for that purpose: possibly GUP also has 
to be updated to make use of that. (I was hoping we can get rid of 
VM_MIXEDMAP at some point)


> 
> The details of how it happens are at the cover page of this patch but
> let me paste the relevant bits here.
> 
> "In our emulation environment we have noticed failing writes when
> performing I/O from a userspace mapped DRM GEM buffer object.
> The platform does not use VRAM, all graphics memory is regular DRAM
> memory, allocated via __get_free_pages
> 
> The same write was successful from a heap allocated bounce buffer.
> 
> The sequence of events is as follows.
> 
> 1. A BO (Buffer Object) is created, and it's backing memory is allocated via
>     __get_user_pages()

__get_user_pages() only allocates memory via page faults. Are you sure 
you meant __get_user_pages() here?

> 
> 2. Userspace mmaps a BO (Buffer Object) via a mmap call on the opened
>     file handle of a DRM driver. The mapping is done via the
>     drm_gem_mmap_obj() call.
> 
> 3. Userspace issues a write to a file copying the contents of the BO.
> 
> 3a. If the file is located on regular filesystem (like ext4), the write
>      completes successfully.
> 
> 3b. If the file is located on a network filesystem, like 9p the write fails.
> 
> The write fails because v9fs_file_write_iter() will call
> netfs_unbuffered_write_iter(), netfs_unbuffered_write_iter_locked() which will
> call netfs_extract_user_iter()
> 
> netfs_extract_user_iter() will in turn call iov_iter_extract_pages() which for
> a user backed iterator will call iov_iter_extract_user_pages which will call
> pin_user_pages_fast() which finally will call __gup_longterm_locked().
> 
> __gup_longterm_locked() will call __get_user_pages_locked() which will fail
> because the VMA is marked with the VM_IO and VM_PFNMAP flags."

So, drm_gem_mmap_obj() ends up using remap_pfn_rage()?

I can spot that drm_gem_mmap_obj() has a path where it explicitly sets

	vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND |
		     VM_DONTDUMP);

Which is a clear sign to core-MM (incl. GUP) to never mess with the 
mapped pages.

> 
>>>
>>> remap_pfn_range_internal() will turn on VM_IO | VM_PFNMAP vma bits.
>>> VM_PFNMAP in particular mark the pages as not having struct_page
>>> associated with them, which is not the case for __get_user_pages()
>>>
>>> This in turn makes any attempt to lock a page fail, and breaking
>>> I/O from that address range.
>>>
>>> This patch address it by special casing pages in those VMAs and not
>>> calling vm_normal_page() for them.
>>>
>>> Signed-off-by: Pantelis Antoniou <p.antoniou@...tner.samsung.com>
>>> ---
>>>    mm/gup.c | 22 ++++++++++++++++++----
>>>    1 file changed, 18 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 84461d384ae2..e185c18c0c81 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -833,6 +833,20 @@ static inline bool can_follow_write_pte(pte_t
>>> pte, struct page *page, return !userfaultfd_pte_wp(vma, pte);
>>>    }
>>>    
>>> +static struct page *gup_normal_page(struct vm_area_struct *vma,
>>> +		unsigned long address, pte_t pte)
>>> +{
>>> +	unsigned long pfn;
>>> +
>>> +	if (vma->vm_flags & (VM_MIXEDMAP | VM_PFNMAP)) {
>>> +		pfn = pte_pfn(pte);
>>> +		if (!pfn_valid(pfn) || is_zero_pfn(pfn) || pfn >
>>> highest_memmap_pfn)
>>> +			return NULL;
>>> +		return pfn_to_page(pfn);
>>> +	}
>>> +	return vm_normal_page(vma, address, pte);
>>
>> I enjoy seeing vm_normal_page() checks in GUP code.
>>
>> I don't enjoy seeing what you added before that :)
>>
>> If vm_normal_page() tells you "this is not a normal", then we should
>> not touch it. There is one exception: the shared zeropage.
>>
>>
>> So, unfortunately, this is wrong.
>>
> 
> Well, lets talk about a proper fix then for the previously mentioned
> user-space regression.

You really have to find out the responsible commit. GUP has been 
behaving like that forever I'm afraid.

And even the VM_PFNMAP was in drm_gem_mmap_obj() already at least in 
2012 if I am not wrong.

-- 
Cheers,

David / dhildenb