linux-kernel - Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFE9YTNcCHAGBtKi@localhost.localdomain>
Date: Tue, 17 Jun 2025 12:03:13 +0200
From: Oscar Salvador <osalvador@...e.de>
To: David Hildenbrand <david@...hat.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Muchun Song <muchun.song@...ux.dev>,
	James Houghton <jthoughton@...gle.com>,
	Peter Xu <peterx@...hat.com>, Gavin Guo <gavinguo@...lia.com>,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in
 the faulting path

On Mon, Jun 16, 2025 at 04:41:20PM +0200, David Hildenbrand wrote:
> On 16.06.25 16:10, Oscar Salvador wrote:
> > What do you mean by stable?
> 
> The same "stable" you used in the doc, that I complained about ;)

Touche :-D

> > In the generic faulting path, we're not worried about the page going away
> > because we hold a reference, so I guess the lock must be to keep content stable?
> 
> What you want to avoid is IIRC, is someone doing a truncation/reclaim on the
> folio while you are mapping it.

Ok, I see. I thought it was more about holding writes, but this makes sense.

> Take a look at truncate_inode_pages_range() where we do a folio_lock()
> around truncate_inode_folio().
> 
> In other words, while you hold the folio lock (and verified that the folio
> was not truncated yet: for example, that folio->mapping is still set), you
> know that it cannot get truncated concurrently -- without holding other
> expensive locks.
> 
> Observe how truncate_cleanup_folio() calls
> 
> 	if (folio_mapped(folio))
> 		unmap_mapping_folio(folio);
> 
> To remove all page table mappings.
> 
> So while holding the folio lock, new page table mappings are not expected to
> appear (IIRC).

Ah ok, so it's more that we don't end up mapping something that's not there
anymore (or something completely different).

> > I mean, yes, after we have mapped the page privately into the pagetables,
> > we don't have business about content-integrity anymore, so given this rule, yes,
> > I guess hugetlb_wp() wouldn't need the lock (for !anonymous) because we already
> > have mapped it privately at that point.
> 
> That's my understanding. And while holding the PTL it cannot get unmapped.
> Whenever you temporarily drop the PTL, you have to do a pte_same() check to
> make sure concurrent truncation didn't happen.

Yap, hugetlb_wp() drops the locks temporarily when it needs to unmap the private
page from other processes, but then does the pte_same() check.

> So far my understanding at least of common filemap code.
> 
> > 
> > But there's something I don't fully understand and makes me feel uneasy.
> > If the lock in the generic faultin path is to keep content stable till we
> > have mapped it privately, wouldn't be more correct to also hold it
> > during the copy in hugetlb_wp, to kinda emulate that?
> As long there us a page table mapping, it cannot get truncated. So if you
> find a PTE under PTL that maps that folio, truncation could not have
> happened.

I see, this makes a lot of sense, thanks for walking me through David!
Alright, then, with all this clear now we should:

- Not take any locks on hugetlb_fault()->hugetlb_wp(), hugetlb_wp() will take it
  if it's an anonymous folio (re-use check)
- Drop the lock in hugetlb_no_page() after we have mapped the page in
  the pagetables
- hugetlb_wp() will take the lock IFF the folio is anonymous

This will lead to something like the following:

 diff --git a/mm/hugetlb.c b/mm/hugetlb.c
 index dfa09fc3b2c6..4d48cda8a56d 100644
 --- a/mm/hugetlb.c
 +++ b/mm/hugetlb.c
 @@ -6198,6 +6198,8 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf)
  	 * in scenarios that used to work. As a side effect, there can still
  	 * be leaks between processes, for example, with FOLL_GET users.
  	 */
 +	if (folio_test_anon(old_folio))
 +		folio_lock(old_folio);
  	if (folio_mapcount(old_folio) == 1 && folio_test_anon(old_folio)) {
  		if (!PageAnonExclusive(&old_folio->page)) {
  			folio_move_anon_rmap(old_folio, vma);
 @@ -6212,6 +6214,8 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf)
  	}
  	VM_BUG_ON_PAGE(folio_test_anon(old_folio) &&
  		       PageAnonExclusive(&old_folio->page), &old_folio->page);
 +	if (folio_test_anon(old_folio))
 +		folio_unlock(old_folio);
 
  	/*
  	 * If the process that created a MAP_PRIVATE mapping is about to perform
 @@ -6537,11 +6541,6 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
  			}
  			new_pagecache_folio = true;
  		} else {
 -			/*
 -			 * hugetlb_wp() expects the folio to be locked in order to
 -			 * check whether we can re-use this page exclusively for us.
 -			 */
 -			folio_lock(folio);
  			anon_rmap = 1;
  		}
  	} else {
 @@ -6558,7 +6557,8 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 
  		/* Check for page in userfault range. */
  		if (userfaultfd_minor(vma)) {
 -			folio_unlock(folio);
 +			if (!anon_rmap)
 +				folio_unlock(folio);
  			folio_put(folio);
  			/* See comment in userfaultfd_missing() block above */
  			if (!hugetlb_pte_stable(h, mm, vmf->address, vmf->pte, vmf->orig_pte)) {
 @@ -6604,6 +6604,13 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
  		new_pte = huge_pte_mkuffd_wp(new_pte);
  	set_huge_pte_at(mm, vmf->address, vmf->pte, new_pte, huge_page_size(h));
 
 +	/*
 +	 * This folio cannot have been truncated since we were holding the lock,
 +	 * and we just mapped it into the pagetables. Drop the lock now.
 +	 */
 +	if (!anon_rmap)
 +		folio_unlock(folio);
 +
  	hugetlb_count_add(pages_per_huge_page(h), mm);
  	if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
  		/* Optimization, do the COW without a second fault */
 @@ -6619,8 +6626,6 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
  	 */
  	if (new_folio)
  		folio_set_hugetlb_migratable(folio);
 -
 -	folio_unlock(folio);
  out:
  	hugetlb_vma_unlock_read(vma);
 
 @@ -6639,8 +6644,8 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
  backout_unlocked:
  	if (new_folio && !new_pagecache_folio)
  		restore_reserve_on_error(h, vma, vmf->address, folio);
 -
 -	folio_unlock(folio);
 +	if (!anon_rmap)
 +		folio_unlock(folio);
  	folio_put(folio);
  	goto out;
  }
 @@ -6805,21 +6810,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  		/* Fallthrough to CoW */
  	}
 
 -	/*
 -	 * We need to lock the folio before calling hugetlb_wp().
 -	 * Either the folio is in the pagecache and we need to copy it over
 -	 * to another file, so it must remain stable throughout the operation,
 -	 * or the folio is anonymous and we need to lock it in order to check
 -	 * whether we can re-use it and mark it exclusive for this process.
 -	 * The timespan for the lock differs depending on the type, since
 -	 * anonymous folios only need to hold the lock while checking whether we
 -	 * can re-use it, while we need to hold it throughout the copy in case
 -	 * we are dealing with a folio from a pagecache.
 -	 * Representing this difference would be tricky with the current code,
 -	 * so just hold the lock for the duration of hugetlb_wp().
 -	 */
  	folio = page_folio(pte_page(vmf.orig_pte));
 -	folio_lock(folio);
  	folio_get(folio);
 
  	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
 @@ -6835,7 +6826,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  						flags & FAULT_FLAG_WRITE))
  		update_mmu_cache(vma, vmf.address, vmf.pte);
  out_put_page:
 -	folio_unlock(folio);
  	folio_put(folio);
  out_ptl:
  	spin_unlock(vmf.ptl);
  
This should be patch#2 with something like "Sorting out locking" per
title, and maybe explaining a bit more why the lock in hugelb_wp for
anonymous folios.

What do you think?

 

-- 
Oscar Salvador
SUSE Labs