linux-kernel - Re: [PATCH 1/1] Fix zero copy I/O on __get_user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250508184720.17bd1f62@sarc.samsung.com>
Date: Thu, 8 May 2025 18:47:20 +0300
From: Pantelis Antoniou <p.antoniou@...tner.samsung.com>
To: David Hildenbrand <david@...hat.com>
CC: Andrew Morton <akpm@...ux-foundation.org>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>, Artem Krupotkin <artem.k@...sung.com>,
	Charles Briere <c.briere@...sung.com>, Wade Farnsworth
	<wade.farnsworth@...mens.com>, Peter Xu <peterx@...hat.com>
Subject: Re: [PATCH 1/1] Fix zero copy I/O on __get_user_pages allocated
 pages

On Thu, 8 May 2025 17:37:04 +0200
David Hildenbrand <david@...hat.com> wrote:

> On 08. 05. 25 17: 23, Pantelis Antoniou wrote: > On Thu, 8 May 2025
> 17: 03: 46 +0200 > David Hildenbrand <david@ redhat. com> wrote: > >
> Hi there, > >> On 07. 05. 25 17: 41, Pantelis Antoniou wrote: Hi, >
> Recent updates 
> On 08.05.25 17:23, Pantelis Antoniou wrote:
> > On Thu, 8 May 2025 17:03:46 +0200
> > David Hildenbrand <david@...hat.com> wrote:
> > 
> > Hi there,
> > 
> >> On 07. 05. 25 17: 41, Pantelis Antoniou wrote: Hi, > Recent updates
> >> to net filesystems enabled zero copy operations, > which require
> >> getting a user space page pinned. > > This does not work for pages
> >> that were allocated via __get_user_pages
> >> On 07.05.25 17:41, Pantelis Antoniou wrote:
> >>
> >> Hi,
> >>
> >>> Recent updates to net filesystems enabled zero copy operations,
> >>> which require getting a user space page pinned.
> >>>
> >>> This does not work for pages that were allocated via
> >>> __get_user_pages and then mapped to user-space via remap_pfn_rage.
> >>
> >> Right. Because the struct page of a VM_PFNMAP *must not be
> >> touched*. It has to be treated like it doesn't exist.
> >>
> > 
> > Well, that's not exactly the case. For pages mapped to user space
> > via remap_pfn_range() the VM_PFNMAP bit is set even though the
> > pages do have a struct page.
> 
> Yes. And VM_PFNMAP is the big flag that these pages shall not be 
> touched. Even if it exists. Even if you say please. :)
> 
> See the comment above vm_normal_page():
> 
> 	"Special" mappings do not wish to be associated with a "struct
> 	page" (either it doesn't exist, or it exists but they don't
> want to touch it)
> 
> VM_MIXEDMAP could maybe be used for that purpose: possibly GUP also
> has to be updated to make use of that. (I was hoping we can get rid
> of VM_MIXEDMAP at some point)
> 
> 
> > 
> > The details of how it happens are at the cover page of this patch
> > but let me paste the relevant bits here.
> > 
> > "In our emulation environment we have noticed failing writes when
> > performing I/O from a userspace mapped DRM GEM buffer object.
> > The platform does not use VRAM, all graphics memory is regular DRAM
> > memory, allocated via __get_free_pages
> > 
> > The same write was successful from a heap allocated bounce buffer.
> > 
> > The sequence of events is as follows.
> > 
> > 1. A BO (Buffer Object) is created, and it's backing memory is
> > allocated via __get_user_pages()
> 
> __get_user_pages() only allocates memory via page faults. Are you
> sure you meant __get_user_pages() here?
> 

Oops, yeah, __get_free_pages(). Apologies for the confusion.

> > 
> > 2. Userspace mmaps a BO (Buffer Object) via a mmap call on the
> > opened file handle of a DRM driver. The mapping is done via the
> >     drm_gem_mmap_obj() call.
> > 
> > 3. Userspace issues a write to a file copying the contents of the
> > BO.
> > 
> > 3a. If the file is located on regular filesystem (like ext4), the
> > write completes successfully.
> > 
> > 3b. If the file is located on a network filesystem, like 9p the
> > write fails.
> > 
> > The write fails because v9fs_file_write_iter() will call
> > netfs_unbuffered_write_iter(), netfs_unbuffered_write_iter_locked()
> > which will call netfs_extract_user_iter()
> > 
> > netfs_extract_user_iter() will in turn call
> > iov_iter_extract_pages() which for a user backed iterator will call
> > iov_iter_extract_user_pages which will call pin_user_pages_fast()
> > which finally will call __gup_longterm_locked().
> > 
> > __gup_longterm_locked() will call __get_user_pages_locked() which
> > will fail because the VMA is marked with the VM_IO and VM_PFNMAP
> > flags."
> 
> So, drm_gem_mmap_obj() ends up using remap_pfn_rage()?
> 

Yes.

> I can spot that drm_gem_mmap_obj() has a path where it explicitly sets
> 
> 	vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND |
> 		     VM_DONTDUMP);
> 
> Which is a clear sign to core-MM (incl. GUP) to never mess with the 
> mapped pages.
> 

Well, let just say this not quite right for pages that are normally
allocated via __get_free_pages().

DRM has to handle both VRAM and regular system memory maps so maybe it's
playing it safe here.

> > 
> >>>
> >>> remap_pfn_range_internal() will turn on VM_IO | VM_PFNMAP vma
> >>> bits. VM_PFNMAP in particular mark the pages as not having
> >>> struct_page associated with them, which is not the case for
> >>> __get_user_pages()
> >>>
> >>> This in turn makes any attempt to lock a page fail, and breaking
> >>> I/O from that address range.
> >>>
> >>> This patch address it by special casing pages in those VMAs and
> >>> not calling vm_normal_page() for them.
> >>>
> >>> Signed-off-by: Pantelis Antoniou <p.antoniou@...tner.samsung.com>
> >>> ---
> >>>    mm/gup.c | 22 ++++++++++++++++++----
> >>>    1 file changed, 18 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/mm/gup.c b/mm/gup.c
> >>> index 84461d384ae2..e185c18c0c81 100644
> >>> --- a/mm/gup.c
> >>> +++ b/mm/gup.c
> >>> @@ -833,6 +833,20 @@ static inline bool can_follow_write_pte(pte_t
> >>> pte, struct page *page, return !userfaultfd_pte_wp(vma, pte);
> >>>    }
> >>>    
> >>> +static struct page *gup_normal_page(struct vm_area_struct *vma,
> >>> +		unsigned long address, pte_t pte)
> >>> +{
> >>> +	unsigned long pfn;
> >>> +
> >>> +	if (vma->vm_flags & (VM_MIXEDMAP | VM_PFNMAP)) {
> >>> +		pfn = pte_pfn(pte);
> >>> +		if (!pfn_valid(pfn) || is_zero_pfn(pfn) || pfn >
> >>> highest_memmap_pfn)
> >>> +			return NULL;
> >>> +		return pfn_to_page(pfn);
> >>> +	}
> >>> +	return vm_normal_page(vma, address, pte);
> >>
> >> I enjoy seeing vm_normal_page() checks in GUP code.
> >>
> >> I don't enjoy seeing what you added before that :)
> >>
> >> If vm_normal_page() tells you "this is not a normal", then we
> >> should not touch it. There is one exception: the shared zeropage.
> >>
> >>
> >> So, unfortunately, this is wrong.
> >>
> > 
> > Well, lets talk about a proper fix then for the previously mentioned
> > user-space regression.
> 
> You really have to find out the responsible commit. GUP has been 
> behaving like that forever I'm afraid.
> 
> And even the VM_PFNMAP was in drm_gem_mmap_obj() already at least in 
> 2012 if I am not wrong.
> 

There is no single responsible commit, it was broken forever.
It is just that no-one has ever tried to pin the pages to perform I/O
before now.

Regards

-- Pantelis