linux-kernel - Re: [PATCH 1/1] Fix zero copy I/O on __get_user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250508182356.45dbfd40@sarc.samsung.com>
Date: Thu, 8 May 2025 18:23:56 +0300
From: Pantelis Antoniou <p.antoniou@...tner.samsung.com>
To: David Hildenbrand <david@...hat.com>
CC: Andrew Morton <akpm@...ux-foundation.org>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>, Artem Krupotkin <artem.k@...sung.com>,
	Charles Briere <c.briere@...sung.com>, Wade Farnsworth
	<wade.farnsworth@...mens.com>, Peter Xu <peterx@...hat.com>
Subject: Re: [PATCH 1/1] Fix zero copy I/O on __get_user_pages allocated
 pages

On Thu, 8 May 2025 17:03:46 +0200
David Hildenbrand <david@...hat.com> wrote:

Hi there,

> On 07. 05. 25 17: 41, Pantelis Antoniou wrote: Hi, > Recent updates
> to net filesystems enabled zero copy operations, > which require
> getting a user space page pinned. > > This does not work for pages
> that were allocated via __get_user_pages 
> On 07.05.25 17:41, Pantelis Antoniou wrote:
> 
> Hi,
> 
> > Recent updates to net filesystems enabled zero copy operations,
> > which require getting a user space page pinned.
> > 
> > This does not work for pages that were allocated via
> > __get_user_pages and then mapped to user-space via remap_pfn_rage.
> 
> Right. Because the struct page of a VM_PFNMAP *must not be touched*.
> It has to be treated like it doesn't exist.
> 

Well, that's not exactly the case. For pages mapped to user space via
remap_pfn_range() the VM_PFNMAP bit is set even though the pages do
have a struct page.

The details of how it happens are at the cover page of this patch but
let me paste the relevant bits here.

"In our emulation environment we have noticed failing writes when
performing I/O from a userspace mapped DRM GEM buffer object.
The platform does not use VRAM, all graphics memory is regular DRAM
memory, allocated via __get_free_pages

The same write was successful from a heap allocated bounce buffer.

The sequence of events is as follows.

1. A BO (Buffer Object) is created, and it's backing memory is allocated via
   __get_user_pages()

2. Userspace mmaps a BO (Buffer Object) via a mmap call on the opened
   file handle of a DRM driver. The mapping is done via the
   drm_gem_mmap_obj() call.

3. Userspace issues a write to a file copying the contents of the BO.

3a. If the file is located on regular filesystem (like ext4), the write
    completes successfully.

3b. If the file is located on a network filesystem, like 9p the write fails.

The write fails because v9fs_file_write_iter() will call
netfs_unbuffered_write_iter(), netfs_unbuffered_write_iter_locked() which will 
call netfs_extract_user_iter() 

netfs_extract_user_iter() will in turn call iov_iter_extract_pages() which for
a user backed iterator will call iov_iter_extract_user_pages which will call
pin_user_pages_fast() which finally will call __gup_longterm_locked().

__gup_longterm_locked() will call __get_user_pages_locked() which will fail
because the VMA is marked with the VM_IO and VM_PFNMAP flags."

> > 
> > remap_pfn_range_internal() will turn on VM_IO | VM_PFNMAP vma bits.
> > VM_PFNMAP in particular mark the pages as not having struct_page
> > associated with them, which is not the case for __get_user_pages()
> > 
> > This in turn makes any attempt to lock a page fail, and breaking
> > I/O from that address range.
> > 
> > This patch address it by special casing pages in those VMAs and not
> > calling vm_normal_page() for them.
> > 
> > Signed-off-by: Pantelis Antoniou <p.antoniou@...tner.samsung.com>
> > ---
> >   mm/gup.c | 22 ++++++++++++++++++----
> >   1 file changed, 18 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 84461d384ae2..e185c18c0c81 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -833,6 +833,20 @@ static inline bool can_follow_write_pte(pte_t
> > pte, struct page *page, return !userfaultfd_pte_wp(vma, pte);
> >   }
> >   
> > +static struct page *gup_normal_page(struct vm_area_struct *vma,
> > +		unsigned long address, pte_t pte)
> > +{
> > +	unsigned long pfn;
> > +
> > +	if (vma->vm_flags & (VM_MIXEDMAP | VM_PFNMAP)) {
> > +		pfn = pte_pfn(pte);
> > +		if (!pfn_valid(pfn) || is_zero_pfn(pfn) || pfn >
> > highest_memmap_pfn)
> > +			return NULL;
> > +		return pfn_to_page(pfn);
> > +	}
> > +	return vm_normal_page(vma, address, pte);
> 
> I enjoy seeing vm_normal_page() checks in GUP code.
> 
> I don't enjoy seeing what you added before that :)
> 
> If vm_normal_page() tells you "this is not a normal", then we should
> not touch it. There is one exception: the shared zeropage.
> 
> 
> So, unfortunately, this is wrong.
> 

Well, lets talk about a proper fix then for the previously mentioned
user-space regression.

Regards

-- Pantelis