linux-kernel - Re: [PATCH v3] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aDeBUXCRLRZobHq0@localhost.localdomain>
Date: Wed, 28 May 2025 23:34:09 +0200
From: Oscar Salvador <osalvador@...e.de>
To: David Hildenbrand <david@...hat.com>
Cc: Peter Xu <peterx@...hat.com>, Gavin Guo <gavinguo@...lia.com>,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	muchun.song@...ux.dev, akpm@...ux-foundation.org,
	mike.kravetz@...cle.com, kernel-dev@...lia.com,
	stable@...r.kernel.org, Hugh Dickins <hughd@...gle.com>,
	Florent Revest <revest@...gle.com>, Gavin Shan <gshan@...hat.com>
Subject: Re: [PATCH v3] mm/hugetlb: fix a deadlock with pagecache_folio and
 hugetlb_fault_mutex_table

On Wed, May 28, 2025 at 10:26:04PM +0200, David Hildenbrand wrote:
> Digging a bit:
> 
> commit 56c9cfb13c9b6516017eea4e8cbe22ea02e07ee6
> Author: Naoya Horiguchi <nao.horiguchi@...il.com>
> Date:   Fri Sep 10 13:23:04 2010 +0900
> 
>     hugetlb, rmap: fix confusing page locking in hugetlb_cow()
>     The "if (!trylock_page)" block in the avoidcopy path of hugetlb_cow()
>     looks confusing and is buggy.  Originally this trylock_page() was
>     intended to make sure that old_page is locked even when old_page !=
>     pagecache_page, because then only pagecache_page is locked.
> 
> Added the comment
> 
> +       /*
> +        * hugetlb_cow() requires page locks of pte_page(entry) and
> +        * pagecache_page, so here we need take the former one
> +        * when page != pagecache_page or !pagecache_page.
> +        * Note that locking order is always pagecache_page -> page,
> +        * so no worry about deadlock.
> +        */
> 
> 
> And
> 
> commit 0fe6e20b9c4c53b3e97096ee73a0857f60aad43f
> Author: Naoya Horiguchi <nao.horiguchi@...il.com>
> Date:   Fri May 28 09:29:16 2010 +0900
> 
>     hugetlb, rmap: add reverse mapping for hugepage
>     This patch adds reverse mapping feature for hugepage by introducing
>     mapcount for shared/private-mapped hugepage and anon_vma for
>     private-mapped hugepage.
>     While hugepage is not currently swappable, reverse mapping can be useful
>     for memory error handler.
>     Without this patch, memory error handler cannot identify processes
>     using the bad hugepage nor unmap it from them. That is:
>     - for shared hugepage:
>       we can collect processes using a hugepage through pagecache,
>       but can not unmap the hugepage because of the lack of mapcount.
>     - for privately mapped hugepage:
>       we can neither collect processes nor unmap the hugepage.
>     This patch solves these problems.
>     This patch include the bug fix given by commit 23be7468e8, so reverts it.
> 
> Added the real locking magic.

Yes, I have been checking "hugetlb, rmap: add reverse mapping for
hugepage", which added locking the now-so-called 'old_folio' in case
hugetlbfs_pagecache_page() didn't return anything.

Because in hugetlb_wp, this was added:

 @@ -2286,8 +2299,11 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
  retry_avoidcopy:
         /* If no-one else is actually using this page, avoid the copy
          * and just make the page writable */
 -       avoidcopy = (page_count(old_page) == 1);
 +       avoidcopy = (page_mapcount(old_page) == 1);
         if (avoidcopy) {
 +               if (!trylock_page(old_page))
 +                       if (PageAnon(old_page))
 +                               page_move_anon_rmap(old_page, vma, address);

So, as you mentioned, it was done to keep the rmap stable as I guess rmap test test the
PageLock. 


> Not that much changed regarding locking until COW support was added in
> 
> commit 1e8f889b10d8d2223105719e36ce45688fedbd59
> Author: David Gibson <david@...son.dropbear.id.au>
> Date:   Fri Jan 6 00:10:44 2006 -0800
> 
>     [PATCH] Hugetlb: Copy on Write support
>     Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be
>     supported.  This helps us to safely use hugetlb pages in many more
>     applications.  The patch makes the following changes.  If needed, I also have
>     it broken out according to the following paragraphs.
> 
> 
> Confusing.
> 
> Locking the *old_folio* when calling hugetlb_wp() makes sense when it is
> an anon folio because we might want to call folio_move_anon_rmap() to adjust the rmap root.

Yes, this is clear.

> Locking the pagecache folio when calling hugetlb_wp() if old_folio is an anon folio ...
> does not make sense to me.

I think this one is also clear.

> Locking the pagecache folio when calling hugetlb_wp if old_folio is a pageache folio ...
> also doesn't quite make sense for me.
> Again, we don't take the lock for ordinary pages, so what's special about hugetlb for the last
> case (reservations, I assume?).

So, this case is when pagecache_folio == old_folio.

I guess we are talking about resv_maps? But I think we cannot interfere there.
For the reserves to be modified the page has to go away.

Now, I have been checking this one too:

 commit 04f2cbe35699d22dbf428373682ead85ca1240f5
 Author: Mel Gorman <mel@....ul.ie>
 Date:   Wed Jul 23 21:27:25 2008 -0700
 
     hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed

And I think it is interesting.
That one added this chunk in hugetlb_fault():

 @@ -1126,8 +1283,15 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
         spin_lock(&mm->page_table_lock);
         /* Check for a racing update before calling hugetlb_cow */
         if (likely(pte_same(entry, huge_ptep_get(ptep))))
 -               if (write_access && !pte_write(entry))
 -                       ret = hugetlb_cow(mm, vma, address, ptep, entry);
 +               if (write_access && !pte_write(entry)) {
 +                       struct page *page;
 +                       page = hugetlbfs_pagecache_page(vma, address);
 +                       ret = hugetlb_cow(mm, vma, address, ptep, entry, page);
 +                       if (page) {
 +                               unlock_page(page);
 +                               put_page(page);
 +                       }
 +               }

So, it finds and lock the page in the pagecache, and calls hugetlb_cow.

hugetlb_fault() takes hugetlb_instantiation_mutex, and there is a
comment saying:

        /*
         * Serialize hugepage allocation and instantiation, so that we don't
         * get spurious allocation failures if two CPUs race to instantiate
         * the same page in the page cache.
         */
        mutex_lock(&hugetlb_instantiation_mutex);

But it does not say anything about truncation.
Actually, checking the truncation code from back then,
truncate_hugepages() (and none of its callers) take the hugetlb_instantiation_mutex,
as it is done today (e.g: current remove_inode_hugepages() code).

Back then, truncate_hugepages() relied only in lock_page():

 static void truncate_hugepages(struct inode *inode, loff_t lstart)
 {
  ...
  ...
  lock_page(page);
  truncate_huge_page(page);
  unlock_page(page);
 }

While today, remove_inode_hugepages() takes the mutex, and also the lock.
And then zaps the page and does its thing with resv_maps.

So I think that we should not even need the lock for hugetlb_wp
when pagecache_folio == old_folio (pagecache), because the mutex
already protects us from the page to go away, right (e.g: truncated)?
Besides we hold a reference on that page since
filemap_lock_hugetlb_folio() locks the page and increases its refcount.

All in all, I am leaning towards not being needed, but it's getting late
here..


-- 
Oscar Salvador
SUSE Labs