linux-ext4 - Re: Possible regression in pin_user_pages_fast() behavior after commit 7ac67301e82f ("ext4: enable large folio for regular file")

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <0fec500c-52ea-473d-b276-826c0f4dd76f@huaweicloud.com>
Date: Wed, 22 Oct 2025 10:46:45 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Karol Wachowski <karol.wachowski@...ux.intel.com>
Cc: tytso@....edu, adilger.kernel@...ger.ca, linux-mm@...ck.org,
 linux-ext4@...r.kernel.org
Subject: Re: Possible regression in pin_user_pages_fast() behavior after
 commit 7ac67301e82f ("ext4: enable large folio for regular file")

[add mm list to CC]

On 10/20/2025 4:47 PM, Karol Wachowski wrote:
> Hi,
> 
> I can reproduce this on Intel's x86 (Meteor Lake and Lunar Lake Intel CPUs
> but I believe it's not platform dependent). It reproduces on stable.
> I have bisected this to the mentioned commit: 7ac67301e82f02b77a5c8e7377a1f414ef108b84
> and it reproduces every time if that commit is present. I have attached a patch at the
> end of this message that provides a very simple driver that creates character device
> which calls pin_user_pages_fast() on user provided user pointer and simple test application
> that creates 2 MB file on a filesystem (you have to ensure it's location is on ext4) and
> does IOCTL with pointer obtained through mmap of that file with specific flags to reproduce
> the issue.
> 
> When it reproduces user application hangs indefinitely and has to be interrupted.
> 
> I have also noticed that if we don't write to the file prior to mmap or the write size is less than
> 2 MB issue does not reproduce.
> 
> Patch with reproductor is attached at the end of this message, please let me know if that helps or
> if there's anything else I can provide to help to determine if it's a real issue.
> 
> -
> Karol
> 
Thank you for the reproducer. I can reproduce this issue on my x86 virtual
machine. After debugging and analyzing, I found that this is not a
filesystem issue, we can reproduce it on any filesystem that supports
large folios, such as XFS. However, anyway, IIUC, I think it's a real
issue.

The root cause of this issue is that calling pin_user_pages_fast() triggers
an infinite loop in __get_user_pages() when a PMD-sized(2MB on x86) and COW
mmaped large folio is passed to pin. To trigger this issue on x86, the
following conditions must be met. The specific triggering process is as
follows:

1. Call mmap with a 2MB size in MAP_PRIVATE mode for a file that has a 2MB
   folio installed in the page cache.

   addr = mmap(NULL, 2 * 1024 * 1024, PROT_READ, MAP_PRIVATE, file_fd, 0);
2. The kernel driver pass this mapped address to pin_user_pages_fast() in
   FOLL_LONGTERM mode.

   pin_user_pages_fast(addr, nr_pages, FOLL_LONGTERM, pages);

  ->  pin_user_pages_fast()
  |   gup_fast_fallback()
  |    __gup_longterm_locked()
  |     __get_user_pages_locked()
  |      __get_user_pages()
  |       follow_page_mask()
  |        follow_p4d_mask()
  |         follow_pud_mask()
  |          follow_pmd_mask() //pmd_leaf(pmdval) is true since it's pmd
  |                            //installed, This is normal in the first
  |                            //round, but it shouldn't happen in the
  |                            //second round.
  |           follow_huge_pmd() //gup_must_unshare() is always true
  |            return -EMLINK
  |   faultin_page()
  |    handle_mm_fault()
  |     wp_huge_pmd() //split pmd and fault back to PTE
  |     handle_pte_fault()  //
  |      do_pte_missing()
  |       do_fault()
  |        do_read_fault() //FAULT_FLAG_WRITE is not set
  |         finish_fault()
  |          do_set_pmd() //install leaf pmd again, I think this is wrong!!!
  |      do_wp_page() //copy private anno pages
  <-    goto retry

Due to an incorrectly large PMD set in do_read_fault(), follow_pmd_mask()
always returns -EMLINK, causing an infinite loop. Under normal
circumstances, I suppose it should fall back to do_wp_page(), which installs
the anonymous page into the PTE. This is also why mappings smaller than 2MB
do not trigger this issue. In addition, if you add FOLL_WRITE when calling
pin_user_pages_fast(), it also will not trigger this issue becasue do_fault()
will call do_cow_fault() to create anonymous pages.

The above is my analysis, and I tried the following fix, which can solve
the issue (I haven't done a full test yet). But I am not expert in the MM
field, I might have missed something, and this needs to be reviewed by MM
experts.

Best regards,
Yi.

diff --git a/mm/memory.c b/mm/memory.c
index 74b45e258323..64846a030a5b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5342,6 +5342,10 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
 	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
 		return ret;

+	if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE) &&
+	    !pmd_write(*vmf->pmd))
+		return ret;
+
 	if (folio_order(folio) != HPAGE_PMD_ORDER)
 		return ret;
 	page = &folio->page;