linux-kernel - Re: [PATCH v5 07/12] khugepaged: add mTHP support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAA1CXcDVe9jhjN4FAJq5Sh=kKA4t7OPcuJUqpbVSdo5qtZp1hA@mail.gmail.com>
Date: Thu, 1 May 2025 16:30:02 -0600
From: Nico Pache <npache@...hat.com>
To: Hugh Dickins <hughd@...gle.com>, Jann Horn <jannh@...gle.com>
Cc: linux-mm@...ck.org, linux-doc@...r.kernel.org, 
	linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org, 
	akpm@...ux-foundation.org, corbet@....net, rostedt@...dmis.org, 
	mhiramat@...nel.org, mathieu.desnoyers@...icios.com, david@...hat.com, 
	baohua@...nel.org, baolin.wang@...ux.alibaba.com, ryan.roberts@....com, 
	willy@...radead.org, peterx@...hat.com, ziy@...dia.com, 
	wangkefeng.wang@...wei.com, usamaarif642@...il.com, sunnanyong@...wei.com, 
	vishal.moola@...il.com, thomas.hellstrom@...ux.intel.com, 
	yang@...amperecomputing.com, kirill.shutemov@...ux.intel.com, 
	aarcange@...hat.com, raquini@...hat.com, dev.jain@....com, 
	anshuman.khandual@....com, catalin.marinas@....com, tiwai@...e.de, 
	will@...nel.org, dave.hansen@...ux.intel.com, jack@...e.cz, cl@...two.org, 
	jglisse@...gle.com, surenb@...gle.com, zokeefe@...gle.com, hannes@...xchg.org, 
	rientjes@...gle.com, mhocko@...e.com, rdunlap@...radead.org, 
	lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com
Subject: Re: [PATCH v5 07/12] khugepaged: add mTHP support

On Thu, May 1, 2025 at 6:59 AM Hugh Dickins <hughd@...gle.com> wrote:
>
> On Mon, 28 Apr 2025, Nico Pache wrote:
>
> > Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > While scanning PMD ranges for potential collapse candidates, keep track
> > of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> > represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> > mTHPs are enabled we remove the restriction of max_ptes_none during the
> > scan phase so we dont bailout early and miss potential mTHP candidates.
> >
> > After the scan is complete we will perform binary recursion on the
> > bitmap to determine which mTHP size would be most efficient to collapse
> > to. max_ptes_none will be scaled by the attempted collapse order to
> > determine how full a THP must be to be eligible.
> >
> > If a mTHP collapse is attempted, but contains swapped out, or shared
> > pages, we dont perform the collapse.
> >
> > Signed-off-by: Nico Pache <npache@...hat.com>
>
> There are locking errors in this patch.  Let me comment inline below,
> then at the end append the fix patch I'm using, to keep mm-new usable
> for me. But that's more of an emergency rescue than a recommended fixup:
> I don't much like your approach here, and hope it will change in v6.
Hi Hugh,

Thank you for testing, and providing such a detailed report + fix!

>
> > ---
> >  mm/khugepaged.c | 125 ++++++++++++++++++++++++++++++++++--------------
> >  1 file changed, 88 insertions(+), 37 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 6e67db86409a..3a846cd70c66 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1136,13 +1136,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >  {
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > -     pte_t *pte;
> > +     pte_t *pte, mthp_pte;
>
> I didn't wait to see the problem, just noticed that in the v4->v5
> update, pte gets used at out_up_write, but there are gotos before
> pte has been initialized. Declare pte = NULL here.

Ah thanks I missed that when I moved the PTE unmapping to out_up_write.

>
> >       pgtable_t pgtable;
> >       struct folio *folio;
> >       spinlock_t *pmd_ptl, *pte_ptl;
> >       int result = SCAN_FAIL;
> >       struct vm_area_struct *vma;
> >       struct mmu_notifier_range range;
> > +     unsigned long _address = address + offset * PAGE_SIZE;
> >
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> > @@ -1158,12 +1159,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >               *mmap_locked = false;
> >       }
> >
> > -     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > +     result = alloc_charge_folio(&folio, mm, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_nolock;
> >
> >       mmap_read_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> > +     *mmap_locked = true;
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
> >               goto out_nolock;
> > @@ -1181,13 +1183,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                * released when it fails. So we jump out_nolock directly in
> >                * that case.  Continuing to collapse causes inconsistency.
> >                */
> > -             result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > -                             referenced, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> > +                             referenced, order);
> >               if (result != SCAN_SUCCEED)
> >                       goto out_nolock;
> >       }
> >
> >       mmap_read_unlock(mm);
> > +     *mmap_locked = false;
> >       /*
> >        * Prevent all access to pagetables with the exception of
> >        * gup_fast later handled by the ptep_clear_flush and the VM
> > @@ -1197,7 +1200,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * mmap_lock.
> >        */
> >       mmap_write_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >       /* check if the pmd is still valid */
> > @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>
> I spent a long time trying to work out why the include/linux/swapops.h:511
> BUG is soon hit - the BUG which tells there's a migration entry left behind
> after its folio has been unlocked.

Can you please share how you reproduce the bug? I haven't been able to
reproduce any BUGs with most of the memory debuggers enabled. This has
also been through a number of mm test suites and some internal
testing. So i'm curious how to hit this issue!
>
> In the patch at the end you'll see that I've inserted a check here, to
> abort if the VMA following the revalidated "vma" is sharing the page table
> (and so also affected by clearing *pmd).  That turned out not to be the
> problem (WARN_ONs inserted never fired in my limited testing), but it still
> looks to me as if some such check is needed.  Or I may be wrong, and
> "revalidate" (a better place for the check) does actually check that, but
> it wasn't obvious, and I haven't spent more time looking at it (but it did
> appear to rule out the case of a VMA before "vma" sharing the page table

Hmm if I understand this correctly, the check you are adding is
already being handled in the PMD case through
thp_vma_suitable_order(s) (by checking that a PMD can fit in the VMA
@address); however, I think you are correct, we much check that the
VMA still spans the whole PMD region we are trying to a mTHP collapse
within it... if not its possible that a mremap+mmap might have occured
allowing a different mapping to belong to the same PMD (this also
makes support mTHP collapse in mappings <PMD sized difficult and is
the reason we haven't pursued it yet, the locking requires explode,
the simple solutions is as you noted, only support <PMD sized
collapses with a single VMA in the PMD range, this is a future
change).

I actually realized my revalidation is rather weak... mostly because
I'm passing the PMD address, not the mTHP starting address (see
address vs _address in collapse_huge_page). I think your check would
be handled if I pass it _address. If that was the case the following
check would be performed in thp_vma_suitable_order:

if (haddr < vma->vm_start || haddr + hpage_size > vma->vm_end)
return false;


>
> >       vma_start_write(vma);
> >       anon_vma_lock_write(vma->anon_vma);
> >
> > -     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > -                             address + HPAGE_PMD_SIZE);
> > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > +                             _address + (PAGE_SIZE << order));
>
> mmu_notifiers tend to be rather a mystery to me, so I've made no change
> below, but it's not obvious whether it's correct clear the *pmd but only
> notify of clearing a subset of that range: what's outside the range soon
> gets replaced as it was, but is that good enough?  I don't know.
Sadly I dont have a good answer for you as mmu_notifier is mostly a
mystery to me too. I assume the anon_vma_lock_write would prevent any
others from touching the pages we ARE NOT invalidating here, and that
may be enough. Honestly I think the anon_vma_lock_write allows up to
remove a lot of the locking (but as i stated im trying to keep the
locking the  *same* in this series).
>
> >       mmu_notifier_invalidate_range_start(&range);
> >
> >       pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > +
> >       /*
> >        * This removes any huge TLB entry from the CPU so we won't allow
> >        * huge and small TLB entries for the same virtual address to
>
> The line I want to comment on does not appear in this diff context:
>         _pmd = pmdp_collapse_flush(vma, address, pmd);
>
> That is appropriate for a THP occupying the whole range of the page table,
> but is a surprising way to handle an "mTHP" of just some of its ptes:
> I would expect you to be invalidating and replacing just those.

The reason for leaving it in the mTHP case was to prevent the
aforementioned GUP-fast race.
* Parallel GUP-fast is fine since GUP-fast will back off when
* it detects PMD is changed.

I tried to keep the same locking principles as the PMD case: remove
the PMD, hold all the same locks, collapse to (m)THP, reinstall the
PMD. This seems the easiest way to get it right and allowed me to
focus on adding the feature. The codebase being scattered with
mentions of potential races, and taking locks that might not be
needed, also made me weary.

I know Dev was looking into optimizations for the locking, and I have
some ideas of ways to improve it, or at least clean up some of the
locking.
>
> And that is the cause of the swapops:511 BUGs: "uninvolved" ptes are
> being temporarily hidden, so not found when remove_migration_ptes()
> goes looking for them.
>
> This reliance on pmdp_collapse_flush() can be rescued, with stricter
> locking (comment below); but I don't like it, and notice Jann has
> picked up on it too.  I hope v6 switches to handling ptes by ptes.
I believe the current approach with your anon_write_unlock fix would
suffice (if the gup race is a valid reason to keep the pmd flush), and
then we can follow up with locking improvements.
>
> > @@ -1226,18 +1230,16 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       mmu_notifier_invalidate_range_end(&range);
> >       tlb_remove_table_sync_one();
> >
> > -     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > +     pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
> >       if (pte) {
> > -             result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > -                                     &compound_pagelist, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> > +                                     &compound_pagelist, order);
> >               spin_unlock(pte_ptl);
> >       } else {
> >               result = SCAN_PMD_NULL;
> >       }
> >
> >       if (unlikely(result != SCAN_SUCCEED)) {
> > -             if (pte)
> > -                     pte_unmap(pte);
> >               spin_lock(pmd_ptl);
> >               BUG_ON(!pmd_none(*pmd));
> >               /*
> > @@ -1258,9 +1260,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       anon_vma_unlock_write(vma->anon_vma);
>
> Phew, it's just visible there in the context.  The anon_vma lock is what
> keeps out racing lookups; so, that anon_vma_unlock_write() (and its
> "All pages are isolated and locked" comment) is appropriate in the
> HPAGE_PMD_SIZEd THP case, but has to be left until later for mTHP ptes.
Makes sense!
>
> But the anon_vma lock may well span a much larger range than the pte
> lock, and the pmd lock certainly spans a much larger range than the
> pte lock; so we really prefer to release anon_vma lock and pmd lock
> as soon as is safe, and use pte lock in preference where possible.
>
> >
> >       result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > -                                        vma, address, pte_ptl,
> > -                                        &compound_pagelist, HPAGE_PMD_ORDER);
> > -     pte_unmap(pte);
> > +                                        vma, _address, pte_ptl,
> > +                                        &compound_pagelist, order);
> >       if (unlikely(result != SCAN_SUCCEED))
> >               goto out_up_write;
> >
> > @@ -1270,25 +1271,42 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * write.
> >        */
> >       __folio_mark_uptodate(folio);
> > -     pgtable = pmd_pgtable(_pmd);
> > -
> > -     _pmd = folio_mk_pmd(folio, vma->vm_page_prot);
> > -     _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > -
> > -     spin_lock(pmd_ptl);
> > -     BUG_ON(!pmd_none(*pmd));
> > -     folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> > -     folio_add_lru_vma(folio, vma);
> > -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > -     set_pmd_at(mm, address, pmd, _pmd);
> > -     update_mmu_cache_pmd(vma, address, pmd);
> > -     deferred_split_folio(folio, false);
> > -     spin_unlock(pmd_ptl);
> > +     if (order == HPAGE_PMD_ORDER) {
> > +             pgtable = pmd_pgtable(_pmd);
> > +             _pmd = folio_mk_pmd(folio, vma->vm_page_prot);
> > +             _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > +
> > +             spin_lock(pmd_ptl);
> > +             BUG_ON(!pmd_none(*pmd));
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +             set_pmd_at(mm, address, pmd, _pmd);
> > +             update_mmu_cache_pmd(vma, address, pmd);
> > +             deferred_split_folio(folio, false);
> > +             spin_unlock(pmd_ptl);
> > +     } else { /* mTHP collapse */
> > +             mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> > +             mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> > +
> > +             spin_lock(pmd_ptl);
>
> I haven't changed that, but it is odd: yes, pmd_ptl will be required
> when doing the pmd_populate(), but it serves no purpose here when
> fiddling around with ptes in a disconnected page table.
Yes I believe a lot of locking can be improved and some is unnecessary.
>
> > +             folio_ref_add(folio, (1 << order) - 1);
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
> > +             update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> > +
> > +             smp_wmb(); /* make pte visible before pmd */
> > +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > +             spin_unlock(pmd_ptl);
> > +     }
> >
> >       folio = NULL;
> >
> >       result = SCAN_SUCCEED;
>
> Somewhere around here it becomes safe for mTHP to anon_vma_unlock_write().
>
> >  out_up_write:
> > +     if (pte)
> > +             pte_unmap(pte);
> >       mmap_write_unlock(mm);
> >  out_nolock:
> >       *mmap_locked = false;
> > @@ -1364,31 +1382,58 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >  {
> >       pmd_t *pmd;
> >       pte_t *pte, *_pte;
> > +     int i;
> >       int result = SCAN_FAIL, referenced = 0;
> >       int none_or_zero = 0, shared = 0;
> >       struct page *page = NULL;
> >       struct folio *folio = NULL;
> >       unsigned long _address;
> > +     unsigned long enabled_orders;
> >       spinlock_t *ptl;
> >       int node = NUMA_NO_NODE, unmapped = 0;
> > +     bool is_pmd_only;
> >       bool writable = false;
> > -
> > +     int chunk_none_count = 0;
> > +     int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER);
> > +     unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> >       result = find_pmd_or_thp_or_none(mm, address, &pmd);
> >       if (result != SCAN_SUCCEED)
> >               goto out;
> >
> > +     bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> > +     bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> >       memset(cc->node_load, 0, sizeof(cc->node_load));
> >       nodes_clear(cc->alloc_nmask);
> > +
> > +     enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> > +             tva_flags, THP_ORDERS_ALL_ANON);
> > +
> > +     is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
> > +
> >       pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> >       if (!pte) {
> >               result = SCAN_PMD_NULL;
> >               goto out;
> >       }
> >
> > -     for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > -          _pte++, _address += PAGE_SIZE) {
> > +     for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +             /*
> > +              * we are reading in KHUGEPAGED_MIN_MTHP_NR page chunks. if
> > +              * there are pages in this chunk keep track of it in the bitmap
> > +              * for mTHP collapsing.
> > +              */
> > +             if (i % KHUGEPAGED_MIN_MTHP_NR == 0) {
> > +                     if (chunk_none_count <= scaled_none)
> > +                             bitmap_set(cc->mthp_bitmap,
> > +                                        i / KHUGEPAGED_MIN_MTHP_NR, 1);
> > +
> > +                     chunk_none_count = 0;
> > +             }
> > +
> > +             _pte = pte + i;
> > +             _address = address + i * PAGE_SIZE;
> >               pte_t pteval = ptep_get(_pte);
> >               if (is_swap_pte(pteval)) {
> >                       ++unmapped;
> > @@ -1411,10 +1456,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                       }
> >               }
> >               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > +                     ++chunk_none_count;
> >                       ++none_or_zero;
> >                       if (!userfaultfd_armed(vma) &&
> > -                         (!cc->is_khugepaged ||
> > -                          none_or_zero <= khugepaged_max_ptes_none)) {
> > +                         (!cc->is_khugepaged || !is_pmd_only ||
> > +                             none_or_zero <= khugepaged_max_ptes_none)) {
> >                               continue;
> >                       } else {
> >                               result = SCAN_EXCEED_NONE_PTE;
> > @@ -1510,6 +1556,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                                                                    address)))
> >                       referenced++;
> >       }
> > +
> >       if (!writable) {
> >               result = SCAN_PAGE_RO;
> >       } else if (cc->is_khugepaged &&
> > @@ -1522,8 +1569,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >  out_unmap:
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> > -             result = collapse_huge_page(mm, address, referenced,
> > -                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> > +             result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
> > +                            mmap_locked, enabled_orders);
> > +             if (result > 0)
> > +                     result = SCAN_SUCCEED;
> > +             else
> > +                     result = SCAN_FAIL;
> >       }
> >  out:
> >       trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
> > --
> > 2.48.1
>
> Fixes to 07/12 "khugepaged: add mTHP support".
> But I see now that the first hunk is actually not to this 07/12, but to
> 05/12 "khugepaged: generalize __collapse_huge_page_* for mTHP support":
> the mTHP check added in __collapse_huge_page_swapin() forgets to unmap
> and unlock before returning, causing RCU imbalance warnings and lockups.
> I won't separate it out here, let me leave that to you.
Thanks I'll get both of these issues fixed!
>
> And I had other fixes to v4, which you've fixed differently in v5,
> I haven't looked up which patch: where khugepaged_collapse_single_pmd()
> does mmap_read_(un)lock() around collapse_pte_mapped_thp().  I dislike
> your special use of result SCAN_ANY_PROCESS there, because mmap_locked

This is what the revalidate func is doing with
khugepaged_test_exit_or_disable so I tried to keep it consistent.
Sadly this was the cleanest solution I could come up with. If you dont
mind sharing what your solution to this was, I'd be happy to take a
look!

> is precisely the tool for that job, so just lock and unlock without
> setting *mmap_locked true (but I'd agree that mmap_locked is confusing,
> and offhand wouldn't want to assert exactly what it means - does it
> mean that mmap lock was *never* dropped, so "vma" is safe without
> revalidation? depends on where it's used perhaps).
I agree its use is confusing and I like to think of it as ONLY the
indicator of it being locked/unlocked. Im not sure if there were other
intended uses.
>
> Hugh
>
> ---
>  mm/khugepaged.c | 32 +++++++++++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 5 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index c1c637dbcb81..2c814c239d65 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1054,6 +1054,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>
>                 /* Dont swapin for mTHP collapse */
>                 if (order != HPAGE_PMD_ORDER) {
> +                       pte_unmap(pte);
> +                       mmap_read_unlock(mm);
>                         result = SCAN_EXCEED_SWAP_PTE;
>                         goto out;
>                 }
> @@ -1136,7 +1138,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  {
>         LIST_HEAD(compound_pagelist);
>         pmd_t *pmd, _pmd;
> -       pte_t *pte, mthp_pte;
> +       pte_t *pte = NULL, mthp_pte;
>         pgtable_t pgtable;
>         struct folio *folio;
>         spinlock_t *pmd_ptl, *pte_ptl;
> @@ -1208,6 +1210,21 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>         if (result != SCAN_SUCCEED)
>                 goto out_up_write;
>
> +       if (vma->vm_end < address + HPAGE_PMD_SIZE) {
> +               struct vm_area_struct *next_vma = find_vma(mm, vma->vm_end);
> +               /*
> +                * We must not clear *pmd if it is used by the following VMA.
> +                * Well, perhaps we could if it, and all following VMAs using
> +                * this same page table, share the same anon_vma, and so are
> +                * locked out together: but keep it simple for now (and this
> +                * code might better belong in hugepage_vma_revalidate()).
> +                */
> +               if (next_vma && next_vma->vm_start < address + HPAGE_PMD_SIZE) {
> +                       result = SCAN_ADDRESS_RANGE;
> +                       goto out_up_write;
> +               }
> +       }
> +
>         vma_start_write(vma);
>         anon_vma_lock_write(vma->anon_vma);
>
> @@ -1255,15 +1272,17 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>
>         /*
>          * All pages are isolated and locked so anon_vma rmap
> -        * can't run anymore.
> -        */
> -       anon_vma_unlock_write(vma->anon_vma);
> +        * can't run anymore - IF the entire extent has been isolated.
> +        * anon_vma lock may cover a large area: better unlock a.s.a.p.
> +        */
> +       if (order == HPAGE_PMD_ORDER)
> +               anon_vma_unlock_write(vma->anon_vma);
>
>         result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>                                            vma, _address, pte_ptl,
>                                            &compound_pagelist, order);
>         if (unlikely(result != SCAN_SUCCEED))
> -               goto out_up_write;
> +               goto out_unlock_anon_vma;
>
>         /*
>          * The smp_wmb() inside __folio_mark_uptodate() ensures the
> @@ -1304,6 +1323,9 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>         folio = NULL;
>
>         result = SCAN_SUCCEED;
> +out_unlock_anon_vma:
> +       if (order != HPAGE_PMD_ORDER)
> +               anon_vma_unlock_write(vma->anon_vma);
>  out_up_write:
>         if (pte)
>                 pte_unmap(pte);
> --
> 2.43.0
>