linux-kernel - Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <eac3e28c-974d-4bd5-a8f5-ec703295f624@lucifer.local>
Date: Wed, 4 Feb 2026 12:55:24 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Usama Arif <usamaarif642@...il.com>
Cc: ziy@...dia.com, Andrew Morton <akpm@...ux-foundation.org>,
        David Hildenbrand <david@...nel.org>, linux-mm@...ck.org,
        hannes@...xchg.org, riel@...riel.com, shakeel.butt@...ux.dev,
        kas@...nel.org, baohua@...nel.org, dev.jain@....com,
        baolin.wang@...ux.alibaba.com, npache@...hat.com,
        Liam.Howlett@...cle.com, ryan.roberts@....com, vbabka@...e.cz,
        lance.yang@...ux.dev, linux-kernel@...r.kernel.org,
        kernel-team@...a.com
Subject: Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support

On Tue, Feb 03, 2026 at 11:38:02PM -0800, Usama Arif wrote:
>
>
> On 02/02/2026 04:15, Lorenzo Stoakes wrote:
> > I think I'm going to have to do several passes on this, so this is just a
> > first one :)
> >
>
> Thanks! Really appreciate the reviews!

No worries!

>
> One thing over here is the higher level design decision when it comes to migration
> of 1G pages. As Zi said in [1]:
> "I also wonder what the purpose of PUD THP migration can be.
> It does not create memory fragmentation, since it is the largest folio size
> we have and contiguous. NUMA balancing 1GB THP seems too much work."
>
> > On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
> >> For page table management, PUD THPs need to pre-deposit page tables
> >> that will be used when the huge page is later split. When a PUD THP
> >> is allocated, we cannot know in advance when or why it might need to
> >> be split (COW, partial unmap, reclaim), but we need page tables ready
> >> for that eventuality. Similar to how PMD THPs deposit a single PTE
> >> table, PUD THPs deposit a PMD table which itself contains deposited
> >> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
> >> infrastructure and a new pud_huge_pmd field in ptdesc to store the
> >> deposited PMD.
> >
> > This feels like you're hacking this support in, honestly. The list_head
> > abuse only adds to that feeling.
> >
>
> Yeah so I hope turning it to something like [2] is the way forward.

Right, that's one option, though David suggested avoiding this altogether by
only pre-allocating PTEs?

>
> > And are we now not required to store rather a lot of memory to keep all of
> > this coherent?
>
> PMD THP allocates 1 4K page (pte_alloc_one) at fault time so that split
> doesnt fail.
>
> For PUD we allocate 2M worth of PTE page tables and 1 4K PMD table at fault
> time so that split doesnt fail due to there not being enough memory.
> Its not great, but its not bad as well.
> The alternative is to allocate this at split time and so we are not
> pre-reserving them. Now there is a chance that allocation and therefore split
> fails, so the tradeoff is some memory vs reliability. This patch favours
> reliability.

That's a significant amount of unmovable, unreclaimable memory though. Going
from 4K to 2M is a pretty huge uptick.

>
> Lets say a user gets 100x1G THPs. They would end up using ~200M for it.
> I think that is okish. If the user has 100G, 200M might not be an issue
> for them :)

But there's more than one user on boxes big enough for this, so this makes me
think we want this to be somehow opt-in right?

And that means we're incurring an unmovable memory penalty, the kind which we're
trying to avoid in general elsewhere in the kernel.

>
> >
> >>
> >> The deposited PMD tables are stored as a singly-linked stack using only
> >> page->lru.next as the link pointer. A doubly-linked list using the
> >> standard list_head mechanism would cause memory corruption: list_del()
> >> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
> >> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
> >> tables have their own deposited PTE tables stored in pmd_huge_pte,
> >> poisoning lru.prev would corrupt the PTE table list and cause crashes
> >> when withdrawing PTE tables during split. PMD THPs don't have this
> >> problem because their deposited PTE tables don't have sub-deposits.
> >> Using only lru.next avoids the overlap entirely.
> >
> > Yeah this is horrendous and a hack, I don't consider this at all
> > upstreamable.
> >
> > You need to completely rework this.
>
> Hopefully [2] is the path forward!

Ack

> >
> >>
> >> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
> >> have. The page_vma_mapped_walk() function is extended to recognize and
> >> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
> >> flag tells the unmap path to split PUD THPs before proceeding, since
> >> there is no PUD-level migration entry format - the split converts the
> >> single PUD mapping into individual PTE mappings that can be migrated
> >> or swapped normally.
> >
> > Individual PTE... mappings? You need to be a lot clearer here, page tables
> > are naturally confusing with entries vs. tables.
> >
> > Let's be VERY specific here. Do you mean you have 1 PMD table and 512 PTE
> > tables reserved, spanning 1 PUD entry and 262,144 PTE entries?
> >
>
> Yes that is correct, Thanks! I will change the commit message in the next revision
> to what you have written: 1 PMD table and 512 PTE tables reserved, spanning
> 1 PUD entry and 262,144 PTE entries.

Yeah :) my concerns remain :)

>
> >>
> >> Signed-off-by: Usama Arif <usamaarif642@...il.com>
> >
> > How does this change interact with existing DAX/VFIO code, which now it
> > seems will be subject to the mechanisms you introduce here?
>
> I think what you mean here is the change in try_to_migrate_one?
>
>
> So one

Unfinished sentence? :P

No I mean currently we support 1G THP for DAX/VFIO right? So how does this
interplay with how that currently works? Does that change how DAX/VFIO works?
Will that impact existing users?

Or are we extending the existing mechanism?

>
> >
> > Right now DAX/VFIO is only obtainable via a specially THP-aligned
> > get_unmapped_area() + then can only be obtained at fault time.
> > > Is that the intent here also?
> >
>
> Ah thanks for pointing this out. This is something the series is missing.
>
> What I did in the selftest and benchmark was fault on an address that was already aligned.
> i.e. basically call the below function before faulting in.
>
> static inline void *pud_align(void *addr)
> {
> 	return (void *)(((unsigned long)addr + PUD_SIZE - 1) & ~(PUD_SIZE - 1));
> }

Right yeah :)

>
>
> What I think you are suggesting this series is missing is the below diff? (its untested).
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 87b2c21df4a49..461158a0840db 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1236,6 +1236,12 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>         unsigned long ret;
>         loff_t off = (loff_t)pgoff << PAGE_SHIFT;
>
> +       if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) && len >= PUD_SIZE) {
> +               ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PUD_SIZE, vm_flags);
> +               if (ret)
> +                       return ret;
> +       }

No not that, that's going to cause issues, see commit d4148aeab4 for details as
to why this can go wrong.

In __get_unmapped_area() where the current 'if PMD size aligned then align area'
logic, like that.

> +
>
>
> > What is your intent - that khugepaged do this, or on alloc? How does it
> > interact with MADV_COLLAPSE?
> >
>
> Ah basically what I mentioned in [3], we want to go slow. Only enable PUD THPs
> page faults at the start. If there is data supporting that khugepaged will work
> than we do it, but we keep it disabled.

Yes I think khugepaged is probably never going to be all that a good idea with
this.

>
> > I noted on the 2nd patch, but you're changing THP_ORDERS_ALL_ANON which
> > alters __thp_vma_allowable_orders() behaviour, that change belongs here...
> >
> >
>
> Thanks for this! I only tried to split this code into logical commits
> after the whole thing was working. Some things are tightly coupled
> and I would need to move them to the right commit.

Yes there's a bunch of things that need tweaking here, to reiterate let's try to
pay down technical debt here and avoid copy/pasting :>)

>
> >> ---
> >>  include/linux/huge_mm.h  |  5 +++
> >>  include/linux/mm.h       | 19 ++++++++
> >>  include/linux/mm_types.h |  5 ++-
> >>  include/linux/pgtable.h  |  8 ++++
> >>  include/linux/rmap.h     |  7 ++-
> >>  mm/huge_memory.c         |  8 ++++
> >>  mm/internal.h            |  3 ++
> >>  mm/page_vma_mapped.c     | 35 +++++++++++++++
> >>  mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
> >>  mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
> >>  10 files changed, 260 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index a4d9f964dfdea..e672e45bb9cc7 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -463,10 +463,15 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> >>  		unsigned long address);
> >>
> >>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> >> +			   unsigned long address);
> >>  int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >>  		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
> >>  		    unsigned long cp_flags);
> >>  #else
> >> +static inline void
> >> +split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> >> +		      unsigned long address) {}
> >>  static inline int
> >>  change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >>  		pud_t *pudp, unsigned long addr, pgprot_t newprot,
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index ab2e7e30aef96..a15e18df0f771 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -3455,6 +3455,22 @@ static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
> >>   * considered ready to switch to split PUD locks yet; there may be places
> >>   * which need to be converted from page_table_lock.
> >>   */
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +static inline struct page *pud_pgtable_page(pud_t *pud)
> >> +{
> >> +	unsigned long mask = ~(PTRS_PER_PUD * sizeof(pud_t) - 1);
> >> +
> >> +	return virt_to_page((void *)((unsigned long)pud & mask));
> >> +}
> >> +
> >> +static inline struct ptdesc *pud_ptdesc(pud_t *pud)
> >> +{
> >> +	return page_ptdesc(pud_pgtable_page(pud));
> >> +}
> >> +
> >> +#define pud_huge_pmd(pud) (pud_ptdesc(pud)->pud_huge_pmd)
> >> +#endif
> >> +
> >>  static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
> >>  {
> >>  	return &mm->page_table_lock;
> >> @@ -3471,6 +3487,9 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
> >>  static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
> >>  {
> >>  	__pagetable_ctor(ptdesc);
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +	ptdesc->pud_huge_pmd = NULL;
> >> +#endif
> >>  }
> >>
> >>  static inline void pagetable_p4d_ctor(struct ptdesc *ptdesc)
> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >> index 78950eb8926dc..26a38490ae2e1 100644
> >> --- a/include/linux/mm_types.h
> >> +++ b/include/linux/mm_types.h
> >> @@ -577,7 +577,10 @@ struct ptdesc {
> >>  		struct list_head pt_list;
> >>  		struct {
> >>  			unsigned long _pt_pad_1;
> >> -			pgtable_t pmd_huge_pte;
> >> +			union {
> >> +				pgtable_t pmd_huge_pte;  /* For PMD tables: deposited PTE */
> >> +				pgtable_t pud_huge_pmd;  /* For PUD tables: deposited PMD list */
> >> +			};
> >>  		};
> >>  	};
> >>  	unsigned long __page_mapping;
> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >> index 2f0dd3a4ace1a..3ce733c1d71a2 100644
> >> --- a/include/linux/pgtable.h
> >> +++ b/include/linux/pgtable.h
> >> @@ -1168,6 +1168,14 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
> >>  #define arch_needs_pgtable_deposit() (false)
> >>  #endif
> >>
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
> >> +					   pmd_t *pmd_table);
> >> +extern pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
> >> +extern void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable);
> >> +extern pgtable_t pud_withdraw_pte(pmd_t *pmd_table);
> >
> > These are useless extern's.
> >
>
>
> ack
>
> These are coming from the existing functions from the file:
> extern void pgtable_trans_huge_deposit
> extern pgtable_t pgtable_trans_huge_withdraw
>
> I think the externs can be removed from these as well? We can
> fix those in a separate patch.

Generally the approach is to remove externs when adding/changing new stuff as
otherwise we get completely useless churn on that and annoying git history
changes.

>
>
> >> +#endif
> >> +
> >>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>  /*
> >>   * This is an implementation of pmdp_establish() that is only suitable for an
> >> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> >> index daa92a58585d9..08cd0a0eb8763 100644
> >> --- a/include/linux/rmap.h
> >> +++ b/include/linux/rmap.h
> >> @@ -101,6 +101,7 @@ enum ttu_flags {
> >>  					 * do a final flush if necessary */
> >>  	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
> >>  					 * caller holds it */
> >> +	TTU_SPLIT_HUGE_PUD	= 0x100, /* split huge PUD if any */
> >>  };
> >>
> >>  #ifdef CONFIG_MMU
> >> @@ -473,6 +474,8 @@ void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages,
> >>  	folio_add_anon_rmap_ptes(folio, page, 1, vma, address, flags)
> >>  void folio_add_anon_rmap_pmd(struct folio *, struct page *,
> >>  		struct vm_area_struct *, unsigned long address, rmap_t flags);
> >> +void folio_add_anon_rmap_pud(struct folio *, struct page *,
> >> +		struct vm_area_struct *, unsigned long address, rmap_t flags);
> >>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
> >>  		unsigned long address, rmap_t flags);
> >>  void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
> >> @@ -933,6 +936,7 @@ struct page_vma_mapped_walk {
> >>  	pgoff_t pgoff;
> >>  	struct vm_area_struct *vma;
> >>  	unsigned long address;
> >> +	pud_t *pud;
> >>  	pmd_t *pmd;
> >>  	pte_t *pte;
> >>  	spinlock_t *ptl;
> >> @@ -970,7 +974,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
> >>  static inline void
> >>  page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
> >>  {
> >> -	WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte);
> >> +	WARN_ON_ONCE(!pvmw->pud && !pvmw->pmd && !pvmw->pte);
> >>
> >>  	if (likely(pvmw->ptl))
> >>  		spin_unlock(pvmw->ptl);
> >> @@ -978,6 +982,7 @@ page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
> >>  		WARN_ON_ONCE(1);
> >>
> >>  	pvmw->ptl = NULL;
> >> +	pvmw->pud = NULL;
> >>  	pvmw->pmd = NULL;
> >>  	pvmw->pte = NULL;
> >>  }
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 40cf59301c21a..3128b3beedb0a 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -2933,6 +2933,14 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> >>  	spin_unlock(ptl);
> >>  	mmu_notifier_invalidate_range_end(&range);
> >>  }
> >> +
> >> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> >> +			   unsigned long address)
> >> +{
> >> +	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PUD_SIZE));
> >> +	if (pud_trans_huge(*pud))
> >> +		__split_huge_pud_locked(vma, pud, address);
> >> +}
> >>  #else
> >>  void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> >>  		unsigned long address)
> >> diff --git a/mm/internal.h b/mm/internal.h
> >> index 9ee336aa03656..21d5c00f638dc 100644
> >> --- a/mm/internal.h
> >> +++ b/mm/internal.h
> >> @@ -545,6 +545,9 @@ int user_proactive_reclaim(char *buf,
> >>   * in mm/rmap.c:
> >>   */
> >>  pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address);
> >> +#endif
> >>
> >>  /*
> >>   * in mm/page_alloc.c
> >> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> >> index b38a1d00c971b..d31eafba38041 100644
> >> --- a/mm/page_vma_mapped.c
> >> +++ b/mm/page_vma_mapped.c
> >> @@ -146,6 +146,18 @@ static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
> >>  	return true;
> >>  }
> >>
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +/* Returns true if the two ranges overlap.  Careful to not overflow. */
> >> +static bool check_pud(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
> >> +{
> >> +	if ((pfn + HPAGE_PUD_NR - 1) < pvmw->pfn)
> >> +		return false;
> >> +	if (pfn > pvmw->pfn + pvmw->nr_pages - 1)
> >> +		return false;
> >> +	return true;
> >> +}
> >> +#endif
> >> +
> >>  static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
> >>  {
> >>  	pvmw->address = (pvmw->address + size) & ~(size - 1);
> >> @@ -188,6 +200,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> >>  	pud_t *pud;
> >>  	pmd_t pmde;
> >>
> >> +	/* The only possible pud mapping has been handled on last iteration */
> >> +	if (pvmw->pud && !pvmw->pmd)
> >> +		return not_found(pvmw);
> >> +
> >>  	/* The only possible pmd mapping has been handled on last iteration */
> >>  	if (pvmw->pmd && !pvmw->pte)
> >>  		return not_found(pvmw);
> >> @@ -234,6 +250,25 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> >>  			continue;
> >>  		}
> >>
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >
> > Said it elsewhere, but it's really weird to treat an arch having the
> > ability to do something as a go ahead for doing it.
> >
> >> +		/* Check for PUD-mapped THP */
> >> +		if (pud_trans_huge(*pud)) {
> >> +			pvmw->pud = pud;
> >> +			pvmw->ptl = pud_lock(mm, pud);
> >> +			if (likely(pud_trans_huge(*pud))) {
> >> +				if (pvmw->flags & PVMW_MIGRATION)
> >> +					return not_found(pvmw);
> >> +				if (!check_pud(pud_pfn(*pud), pvmw))
> >> +					return not_found(pvmw);
> >> +				return true;
> >> +			}
> >> +			/* PUD was split under us, retry at PMD level */
> >> +			spin_unlock(pvmw->ptl);
> >> +			pvmw->ptl = NULL;
> >> +			pvmw->pud = NULL;
> >> +		}
> >> +#endif
> >> +
> >
> > Yeah, as I said elsewhere, we got to be refactoring not copy/pasting with
> > modifications :)
> >
>
> Yeah there is repeated code in multiple places, where all I did was replace
> what was done from PMD into PUD. In a lot of places, its actually difficult
> to not repeat the code (unless we want function macros, which is much worse
> IMO).

Not if we actually refactor the existing code :)

When I wanted to make functional changes to mremap I took a lot of time to
refactor the code into something sane before even starting that.

Because I _could_ have added the features there as-is, but it would have been
hellish to do so as-is and added more confusion etc.

So yeah, I think a similar mentality has to be had with this change.

>
> >
> >>  		pvmw->pmd = pmd_offset(pud, pvmw->address);
> >>  		/*
> >>  		 * Make sure the pmd value isn't cached in a register by the
> >> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> >> index d3aec7a9926ad..2047558ddcd79 100644
> >> --- a/mm/pgtable-generic.c
> >> +++ b/mm/pgtable-generic.c
> >> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
> >>  }
> >>  #endif
> >>
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +/*
> >> + * Deposit page tables for PUD THP.
> >> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
> >> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
> >> + *
> >> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
> >> + * list_head. This is because lru.prev (offset 16) overlaps with
> >> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
> >> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.
> >
> > This is horrible and feels like a hack? Treating a doubly-linked list as a
> > singly-linked one like this is not upstreamable.
> >
> >> + *
> >> + * PTE tables should be deposited into the PMD using pud_deposit_pte().
> >> + */
> >> +void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
> >> +				    pmd_t *pmd_table)
> >
> > This is a horrid, you're depositing the PMD using the... questionable
> > list_head abuse, but then also have pud_deposit_pte()... But here we're
> > depositing a PMD shouldn't the name reflect that?
> >
> >> +{
> >> +	pgtable_t pmd_page = virt_to_page(pmd_table);
> >> +
> >> +	assert_spin_locked(pud_lockptr(mm, pudp));
> >> +
> >> +	/* Push onto stack using only lru.next as the link */
> >> +	pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);
> >
> > Yikes...
> >
> >> +	pud_huge_pmd(pudp) = pmd_page;
> >> +}
> >> +
> >> +/*
> >> + * Withdraw the deposited PMD table for PUD THP split or zap.
> >> + * Called with PUD lock held.
> >> + * Returns NULL if no more PMD tables are deposited.
> >> + */
> >> +pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
> >> +{
> >> +	pgtable_t pmd_page;
> >> +
> >> +	assert_spin_locked(pud_lockptr(mm, pudp));
> >> +
> >> +	pmd_page = pud_huge_pmd(pudp);
> >> +	if (!pmd_page)
> >> +		return NULL;
> >> +
> >> +	/* Pop from stack - lru.next points to next PMD page (or NULL) */
> >> +	pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;
> >
> > Where's the popping? You're just assigning here.
>
>
> Ack on all of the above. Hopefully [1] is better.

Thanks!

> >
> >> +
> >> +	return page_address(pmd_page);
> >> +}
> >> +
> >> +/*
> >> + * Deposit a PTE table into a standalone PMD table (not yet in page table hierarchy).
> >> + * Used for PUD THP pre-deposit. The PMD table's pmd_huge_pte stores a linked list.
> >> + * No lock assertion since the PMD isn't visible yet.
> >> + */
> >> +void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable)
> >> +{
> >> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
> >> +
> >> +	/* FIFO - add to front of list */
> >> +	if (!ptdesc->pmd_huge_pte)
> >> +		INIT_LIST_HEAD(&pgtable->lru);
> >> +	else
> >> +		list_add(&pgtable->lru, &ptdesc->pmd_huge_pte->lru);
> >> +	ptdesc->pmd_huge_pte = pgtable;
> >> +}
> >> +
> >> +/*
> >> + * Withdraw a PTE table from a standalone PMD table.
> >> + * Returns NULL if no more PTE tables are deposited.
> >> + */
> >> +pgtable_t pud_withdraw_pte(pmd_t *pmd_table)
> >> +{
> >> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
> >> +	pgtable_t pgtable;
> >> +
> >> +	pgtable = ptdesc->pmd_huge_pte;
> >> +	if (!pgtable)
> >> +		return NULL;
> >> +	ptdesc->pmd_huge_pte = list_first_entry_or_null(&pgtable->lru,
> >> +							struct page, lru);
> >> +	if (ptdesc->pmd_huge_pte)
> >> +		list_del(&pgtable->lru);
> >> +	return pgtable;
> >> +}
> >> +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> >> +
> >>  #ifndef __HAVE_ARCH_PMDP_INVALIDATE
> >>  pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> >>  		     pmd_t *pmdp)
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 7b9879ef442d9..69acabd763da4 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -811,6 +811,32 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> >>  	return pmd;
> >>  }
> >>
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +/*
> >> + * Returns the actual pud_t* where we expect 'address' to be mapped from, or
> >> + * NULL if it doesn't exist.  No guarantees / checks on what the pud_t*
> >> + * represents.
> >> + */
> >> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address)
> >
> > This series seems to be full of copy/paste.
> >
> > It's just not acceptable given the state of THP code as I said in reply to
> > the cover letter - you need to _refactor_ the code.
> >
> > The code is bug-prone and difficult to maintain as-is, your series has to
> > improve the technical debt, not add to it.
> >
>
> In some cases we might not be able to avoid the copy, but this is definitely
> a place where we dont need to. I will change here. Thanks!

I disagree, see above :) But thanks on this one

>
> >> +{
> >> +	pgd_t *pgd;
> >> +	p4d_t *p4d;
> >> +	pud_t *pud = NULL;
> >> +
> >> +	pgd = pgd_offset(mm, address);
> >> +	if (!pgd_present(*pgd))
> >> +		goto out;
> >> +
> >> +	p4d = p4d_offset(pgd, address);
> >> +	if (!p4d_present(*p4d))
> >> +		goto out;
> >> +
> >> +	pud = pud_offset(p4d, address);
> >> +out:
> >> +	return pud;
> >> +}
> >> +#endif
> >> +
> >>  struct folio_referenced_arg {
> >>  	int mapcount;
> >>  	int referenced;
> >> @@ -1415,11 +1441,7 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
> >>  			SetPageAnonExclusive(page);
> >>  			break;
> >>  		case PGTABLE_LEVEL_PUD:
> >> -			/*
> >> -			 * Keep the compiler happy, we don't support anonymous
> >> -			 * PUD mappings.
> >> -			 */
> >> -			WARN_ON_ONCE(1);
> >> +			SetPageAnonExclusive(page);
> >>  			break;
> >>  		default:
> >>  			BUILD_BUG();
> >> @@ -1503,6 +1525,31 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
> >>  #endif
> >>  }
> >>
> >> +/**
> >> + * folio_add_anon_rmap_pud - add a PUD mapping to a page range of an anon folio
> >> + * @folio:	The folio to add the mapping to
> >> + * @page:	The first page to add
> >> + * @vma:	The vm area in which the mapping is added
> >> + * @address:	The user virtual address of the first page to map
> >> + * @flags:	The rmap flags
> >> + *
> >> + * The page range of folio is defined by [first_page, first_page + HPAGE_PUD_NR)
> >> + *
> >> + * The caller needs to hold the page table lock, and the page must be locked in
> >> + * the anon_vma case: to serialize mapping,index checking after setting.
> >> + */
> >> +void folio_add_anon_rmap_pud(struct folio *folio, struct page *page,
> >> +		struct vm_area_struct *vma, unsigned long address, rmap_t flags)
> >> +{
> >> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
> >> +	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> >> +	__folio_add_anon_rmap(folio, page, HPAGE_PUD_NR, vma, address, flags,
> >> +			      PGTABLE_LEVEL_PUD);
> >> +#else
> >> +	WARN_ON_ONCE(true);
> >> +#endif
> >> +}
> >
> > More copy/paste... Maybe unavoidable in this case, but be good to try.
> >
> >> +
> >>  /**
> >>   * folio_add_new_anon_rmap - Add mapping to a new anonymous folio.
> >>   * @folio:	The folio to add the mapping to.
> >> @@ -1934,6 +1981,20 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>  		}
> >>
> >>  		if (!pvmw.pte) {
> >> +			/*
> >> +			 * Check for PUD-mapped THP first.
> >> +			 * If we have a PUD mapping and TTU_SPLIT_HUGE_PUD is set,
> >> +			 * split the PUD to PMD level and restart the walk.
> >> +			 */
> >
> > This is literally describing the code below, it's not useful.
>
> Ack, Will remove this comment, Thanks!

Thanks

> >
> >> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
> >> +				if (flags & TTU_SPLIT_HUGE_PUD) {
> >> +					split_huge_pud_locked(vma, pvmw.pud, pvmw.address);
> >> +					flags &= ~TTU_SPLIT_HUGE_PUD;
> >> +					page_vma_mapped_walk_restart(&pvmw);
> >> +					continue;
> >> +				}
> >> +			}
> >> +
> >>  			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> >>  				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
> >>  					goto walk_done;
> >> @@ -2325,6 +2386,27 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
> >>  	mmu_notifier_invalidate_range_start(&range);
> >>
> >>  	while (page_vma_mapped_walk(&pvmw)) {
> >> +		/* Handle PUD-mapped THP first */
> >
> > How did/will this interact with DAX, VFIO PUD THP?
>
> It wont interact with DAX. try_to_migrate does the below and just returns:
>
> 	if (folio_is_zone_device(folio) &&
> 	    (!folio_is_device_private(folio) && !folio_is_device_coherent(folio)))
> 		return;
>
> so DAX would never reach here.

Hmm folio_is_zone_device() always returns true for DAX?

Also that's just one rmap call right?

>
> I think vfio pages are pinned and therefore cant be migrated? (I have
> not looked at vfio code, I will try to get a better understanding tomorrow,
> but please let me know if that sounds wrong.)

OK I've not dug into this either please do check, and be good really to test
this code vs. actual DAX/VFIO scenarios if you can find a way to test that, thanks!

>
>
> >
> >> +		if (!pvmw.pte && !pvmw.pmd) {
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >
> > Won't pud_trans_huge() imply this...
> >
>
> Agreed, I think it should cover it.
Thanks!

>
> >> +			/*
> >> +			 * PUD-mapped THP: skip migration to preserve the huge
> >> +			 * page. Splitting would defeat the purpose of PUD THPs.
> >> +			 * Return false to indicate migration failure, which
> >> +			 * will cause alloc_contig_range() to try a different
> >> +			 * memory region.
> >> +			 */
> >> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
> >> +				page_vma_mapped_walk_done(&pvmw);
> >> +				ret = false;
> >> +				break;
> >> +			}
> >> +#endif
> >> +			/* Unexpected state: !pte && !pmd but not a PUD THP */
> >> +			page_vma_mapped_walk_done(&pvmw);
> >> +			break;
> >> +		}
> >> +
> >>  		/* PMD-mapped THP migration entry */
> >>  		if (!pvmw.pte) {
> >>  			__maybe_unused unsigned long pfn;
> >> @@ -2607,10 +2689,10 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
> >>
> >>  	/*
> >>  	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
> >> -	 * TTU_SPLIT_HUGE_PMD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
> >> +	 * TTU_SPLIT_HUGE_PMD, TTU_SPLIT_HUGE_PUD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
> >>  	 */
> >>  	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
> >> -					TTU_SYNC | TTU_BATCH_FLUSH)))
> >> +					TTU_SPLIT_HUGE_PUD | TTU_SYNC | TTU_BATCH_FLUSH)))
> >>  		return;
> >>
> >>  	if (folio_is_zone_device(folio) &&
> >> --
> >> 2.47.3
> >>
> >
> > This isn't a final review, I'll have to look more thoroughly through here
> > over time and you're going to have to be patient in general :)
> >
> > Cheers, Lorenzo
>
>
> Thanks for the review, this is awesome!

Ack, will do more when I have time, and obviously you're getting a lot of input
from others too.

Be good to get a summary at next THP cabal ;)

>
>
> [1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/
> [2] https://lore.kernel.org/all/05d5918f-b61b-4091-b8c6-20eebfffc3c4@gmail.com/
> [3] https://lore.kernel.org/all/2efaa5ed-bd09-41f0-9c07-5cd6cccc4595@gmail.com/
>
>
>

cheers, Lorenzo