linux-kernel - Re: [PATCH v3] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aDcl2YM5wX-MwzbM@x1.local>
Date: Wed, 28 May 2025 11:03:53 -0400
From: Peter Xu <peterx@...hat.com>
To: Oscar Salvador <osalvador@...e.de>
Cc: Gavin Guo <gavinguo@...lia.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, muchun.song@...ux.dev,
	akpm@...ux-foundation.org, mike.kravetz@...cle.com,
	kernel-dev@...lia.com, stable@...r.kernel.org,
	Hugh Dickins <hughd@...gle.com>, Florent Revest <revest@...gle.com>,
	Gavin Shan <gshan@...hat.com>, David Hildenbrand <david@...hat.com>
Subject: Re: [PATCH v3] mm/hugetlb: fix a deadlock with pagecache_folio and
 hugetlb_fault_mutex_table

On Wed, May 28, 2025 at 11:27:46AM +0200, Oscar Salvador wrote:
> On Wed, May 28, 2025 at 10:33:26AM +0800, Gavin Guo wrote:
> > There is ABBA dead locking scenario happening between hugetlb_fault()
> > and hugetlb_wp() on the pagecache folio's lock and hugetlb global mutex,
> > which is reproducible with syzkaller [1]. As below stack traces reveal,
> > process-1 tries to take the hugetlb global mutex (A3), but with the
> > pagecache folio's lock hold. Process-2 took the hugetlb global mutex but
> > tries to take the pagecache folio's lock.
> > 
> > Process-1                               Process-2
> > =========                               =========
> > hugetlb_fault
> >    mutex_lock                  (A1)
> >    filemap_lock_hugetlb_folio  (B1)
> >    hugetlb_wp
> >      alloc_hugetlb_folio       #error
> >        mutex_unlock            (A2)
> >                                         hugetlb_fault
> >                                           mutex_lock                  (A4)
> >                                           filemap_lock_hugetlb_folio  (B4)
> >        unmap_ref_private
> >        mutex_lock              (A3)
> > 
> > Fix it by releasing the pagecache folio's lock at (A2) of process-1 so
> > that pagecache folio's lock is available to process-2 at (B4), to avoid
> > the deadlock. In process-1, a new variable is added to track if the
> > pagecache folio's lock has been released by its child function
> > hugetlb_wp() to avoid double releases on the lock in hugetlb_fault().
> > The similar changes are applied to hugetlb_no_page().
> > 
> > Link: https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=drive_link [1]
> > Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
> > Cc: <stable@...r.kernel.org>
> > Cc: Hugh Dickins <hughd@...gle.com>
> > Cc: Florent Revest <revest@...gle.com>
> > Reviewed-by: Gavin Shan <gshan@...hat.com>
> > Signed-off-by: Gavin Guo <gavinguo@...lia.com>
> ... 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 6a3cf7935c14..560b9b35262a 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6137,7 +6137,8 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
> >   * Keep the pte_same checks anyway to make transition from the mutex easier.
> >   */
> >  static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
> > -		       struct vm_fault *vmf)
> > +		       struct vm_fault *vmf,
> > +		       bool *pagecache_folio_locked)
> >  {
> >  	struct vm_area_struct *vma = vmf->vma;
> >  	struct mm_struct *mm = vma->vm_mm;
> > @@ -6234,6 +6235,18 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
> >  			u32 hash;
> >  
> >  			folio_put(old_folio);
> > +			/*
> > +			 * The pagecache_folio has to be unlocked to avoid
> > +			 * deadlock and we won't re-lock it in hugetlb_wp(). The
> > +			 * pagecache_folio could be truncated after being
> > +			 * unlocked. So its state should not be reliable
> > +			 * subsequently.
> > +			 */
> > +			if (pagecache_folio) {
> > +				folio_unlock(pagecache_folio);
> > +				if (pagecache_folio_locked)
> > +					*pagecache_folio_locked = false;
> > +			}
> 
> I am having a problem with this patch as I think it keeps carrying on an
> assumption that it is not true.
> 
> I was discussing this matter yesterday with Peter Xu (CCed now), who has also some
> experience in this field.
> 
> Exactly against what pagecache_folio's lock protects us when
> pagecache_folio != old_folio?
> 
> There are two cases here:
> 
> 1) pagecache_folio = old_folio  (original page in the pagecache)
> 2) pagecache_folio != old_folio (original page has already been mapped
>                                  privately and CoWed, old_folio contains
> 				 the new folio)
> 
> For case 1), we need to hold the lock because we are copying old_folio
> to the new one in hugetlb_wp(). That is clear.

So I'm not 100% sure we need the folio lock even for copy; IIUC a refcount
would be enough?

> 
> But for case 2), unless I am missing something, we do not really need the
> pagecache_folio's lock at all, do we? (only old_folio's one)
> The only reason pagecache_folio gets looked up in the pagecache is to check
> whether the current task has mapped and faulted in the file privately, which
> means that a reservation has been consumed (a new folio was allocated).
> That is what the whole dance about "old_folio != pagecache_folio &&
> HPAGE_RESV_OWNER" in hugetlb_wp() is about.
> 
> And the original mapping cannot really go away either from under us, as
> remove_inode_hugepages() needs to take the mutex in order to evict it,
> which would be the only reason counters like resv_huge_pages (adjusted in
> remove_inode_hugepages()->hugetlb_unreserve_pages()) would
> interfere with alloc_hugetlb_folio() from hugetlb_wp().
> 
> So, again, unless I am missing something there is no need for the
> pagecache_folio lock when pagecache_folio != old_folio, let alone the
> need to hold it throughout hugetlb_wp().
> I think we could just look up the cache, and unlock it right away.
> 
> So, the current situation (previous to this patch) is already misleading
> for case 2).
> 
> And comments like:
> 
>  /*
>   * The pagecache_folio has to be unlocked to avoid
>   * deadlock and we won't re-lock it in hugetlb_wp(). The
>   * pagecache_folio could be truncated after being
>   * unlocked. So its state should not be reliable
>   * subsequently.
>   */
> 
> Keep carrying on the assumption that we need the lock.
> 
> Now, if the above is true, I would much rather see this reworked (I have
> some ideas I discussed with Peter yesterday), than keep it as is.

Yes just to reply in public I also am not aware of why the folio lock is
needed considering hugetlb has the fault mutex.  I'm not sure if we should
rely more on the fault mutex, but that doesn't sound like an immediate
concern.

It may depend on whether my above understand was correct.. and only if so,
maybe we could avoid locking the folio completely.

Thanks,

> 
> Let me also CC David who tends to have a good overview in this.
> 
> -- 
> Oscar Salvador
> SUSE Labs
> 

-- 
Peter Xu