[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a95971cb-bd90-48de-8b8f-2d6d1f13085a@gmail.com>
Date: Wed, 4 Feb 2026 22:40:11 -0800
From: Usama Arif <usamaarif642@...il.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: ziy@...dia.com, Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...nel.org>, linux-mm@...ck.org,
hannes@...xchg.org, riel@...riel.com, shakeel.butt@...ux.dev,
kas@...nel.org, baohua@...nel.org, dev.jain@....com,
baolin.wang@...ux.alibaba.com, npache@...hat.com, Liam.Howlett@...cle.com,
ryan.roberts@....com, vbabka@...e.cz, lance.yang@...ux.dev,
linux-kernel@...r.kernel.org, kernel-team@...a.com
Subject: Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support
On 04/02/2026 04:55, Lorenzo Stoakes wrote:
> On Tue, Feb 03, 2026 at 11:38:02PM -0800, Usama Arif wrote:
>>
>>
>> On 02/02/2026 04:15, Lorenzo Stoakes wrote:
>>> I think I'm going to have to do several passes on this, so this is just a
>>> first one :)
>>>
>>
>> Thanks! Really appreciate the reviews!
>
> No worries!
>
>>
>> One thing over here is the higher level design decision when it comes to migration
>> of 1G pages. As Zi said in [1]:
>> "I also wonder what the purpose of PUD THP migration can be.
>> It does not create memory fragmentation, since it is the largest folio size
>> we have and contiguous. NUMA balancing 1GB THP seems too much work."
>>
>>> On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
>>>> For page table management, PUD THPs need to pre-deposit page tables
>>>> that will be used when the huge page is later split. When a PUD THP
>>>> is allocated, we cannot know in advance when or why it might need to
>>>> be split (COW, partial unmap, reclaim), but we need page tables ready
>>>> for that eventuality. Similar to how PMD THPs deposit a single PTE
>>>> table, PUD THPs deposit a PMD table which itself contains deposited
>>>> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
>>>> infrastructure and a new pud_huge_pmd field in ptdesc to store the
>>>> deposited PMD.
>>>
>>> This feels like you're hacking this support in, honestly. The list_head
>>> abuse only adds to that feeling.
>>>
>>
>> Yeah so I hope turning it to something like [2] is the way forward.
>
> Right, that's one option, though David suggested avoiding this altogether by
> only pre-allocating PTEs?
Maybe I dont understand it properly, but that wont work, right?
You need 1 PMD table and 512 PTE tables for a PMD. Cant just have PTE tables, right?
>
>>
>>> And are we now not required to store rather a lot of memory to keep all of
>>> this coherent?
>>
>> PMD THP allocates 1 4K page (pte_alloc_one) at fault time so that split
>> doesnt fail.
>>
>> For PUD we allocate 2M worth of PTE page tables and 1 4K PMD table at fault
>> time so that split doesnt fail due to there not being enough memory.
>> Its not great, but its not bad as well.
>> The alternative is to allocate this at split time and so we are not
>> pre-reserving them. Now there is a chance that allocation and therefore split
>> fails, so the tradeoff is some memory vs reliability. This patch favours
>> reliability.
>
> That's a significant amount of unmovable, unreclaimable memory though. Going
> from 4K to 2M is a pretty huge uptick.
>
Yeah I dont like it either, but its the only way to make sure split doesnt fail.
I think there will always be a tradeoff between reliability and memory.
>>
>> Lets say a user gets 100x1G THPs. They would end up using ~200M for it.
>> I think that is okish. If the user has 100G, 200M might not be an issue
>> for them :)
>
> But there's more than one user on boxes big enough for this, so this makes me
> think we want this to be somehow opt-in right?
>
Do you mean madvise?
Also an idea (although probably a very bad one :)) is to have MADV_HUGEPAGE_1G.
> And that means we're incurring an unmovable memory penalty, the kind which we're
> trying to avoid in general elsewhere in the kernel.
>
ack.
>>
>>>
>>>>
>>>> The deposited PMD tables are stored as a singly-linked stack using only
>>>> page->lru.next as the link pointer. A doubly-linked list using the
>>>> standard list_head mechanism would cause memory corruption: list_del()
>>>> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
>>>> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
>>>> tables have their own deposited PTE tables stored in pmd_huge_pte,
>>>> poisoning lru.prev would corrupt the PTE table list and cause crashes
>>>> when withdrawing PTE tables during split. PMD THPs don't have this
>>>> problem because their deposited PTE tables don't have sub-deposits.
>>>> Using only lru.next avoids the overlap entirely.
>>>
>>> Yeah this is horrendous and a hack, I don't consider this at all
>>> upstreamable.
>>>
>>> You need to completely rework this.
>>
>> Hopefully [2] is the path forward!
>
> Ack
>
>>>
>>>>
>>>> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
>>>> have. The page_vma_mapped_walk() function is extended to recognize and
>>>> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
>>>> flag tells the unmap path to split PUD THPs before proceeding, since
>>>> there is no PUD-level migration entry format - the split converts the
>>>> single PUD mapping into individual PTE mappings that can be migrated
>>>> or swapped normally.
>>>
>>> Individual PTE... mappings? You need to be a lot clearer here, page tables
>>> are naturally confusing with entries vs. tables.
>>>
>>> Let's be VERY specific here. Do you mean you have 1 PMD table and 512 PTE
>>> tables reserved, spanning 1 PUD entry and 262,144 PTE entries?
>>>
>>
>> Yes that is correct, Thanks! I will change the commit message in the next revision
>> to what you have written: 1 PMD table and 512 PTE tables reserved, spanning
>> 1 PUD entry and 262,144 PTE entries.
>
> Yeah :) my concerns remain :)
>
>>
>>>>
>>>> Signed-off-by: Usama Arif <usamaarif642@...il.com>
>>>
>>> How does this change interact with existing DAX/VFIO code, which now it
>>> seems will be subject to the mechanisms you introduce here?
>>
>> I think what you mean here is the change in try_to_migrate_one?
>>
>>
>> So one
>
> Unfinished sentence? :P
>
> No I mean currently we support 1G THP for DAX/VFIO right? So how does this
> interplay with how that currently works? Does that change how DAX/VFIO works?
> Will that impact existing users?
>
> Or are we extending the existing mechanism?
>
A lot of the stuff like copy_huge_pud, zap_huge_pud, __split_huge_pud_locked,
create_huge_pud, wp_huge_pud... is protected by vma_is_anonymous check. I will
try and do a better audit of DAX and VFIO.
>>
>>>
>>> Right now DAX/VFIO is only obtainable via a specially THP-aligned
>>> get_unmapped_area() + then can only be obtained at fault time.
>>>> Is that the intent here also?
>>>
>>
>> Ah thanks for pointing this out. This is something the series is missing.
>>
>> What I did in the selftest and benchmark was fault on an address that was already aligned.
>> i.e. basically call the below function before faulting in.
>>
>> static inline void *pud_align(void *addr)
>> {
>> return (void *)(((unsigned long)addr + PUD_SIZE - 1) & ~(PUD_SIZE - 1));
>> }
>
> Right yeah :)
>
>>
>>
>> What I think you are suggesting this series is missing is the below diff? (its untested).
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 87b2c21df4a49..461158a0840db 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1236,6 +1236,12 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>> unsigned long ret;
>> loff_t off = (loff_t)pgoff << PAGE_SHIFT;
>>
>> + if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) && len >= PUD_SIZE) {
>> + ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PUD_SIZE, vm_flags);
>> + if (ret)
>> + return ret;
>> + }
>
> No not that, that's going to cause issues, see commit d4148aeab4 for details as
> to why this can go wrong.
>
> In __get_unmapped_area() where the current 'if PMD size aligned then align area'
> logic, like that.
Ack, Thanks for pointing to this. I will also add another selftest to see that we actually
get this from _get_unmapped_area when we dont do the pud_align trick I currently have in
the selftests.
>
>> +
>>
>>
>>> What is your intent - that khugepaged do this, or on alloc? How does it
>>> interact with MADV_COLLAPSE?
>>>
>>
>> Ah basically what I mentioned in [3], we want to go slow. Only enable PUD THPs
>> page faults at the start. If there is data supporting that khugepaged will work
>> than we do it, but we keep it disabled.
>
> Yes I think khugepaged is probably never going to be all that a good idea with
> this.
>
>>
>>> I noted on the 2nd patch, but you're changing THP_ORDERS_ALL_ANON which
>>> alters __thp_vma_allowable_orders() behaviour, that change belongs here...
>>>
>>>
>>
>> Thanks for this! I only tried to split this code into logical commits
>> after the whole thing was working. Some things are tightly coupled
>> and I would need to move them to the right commit.
>
> Yes there's a bunch of things that need tweaking here, to reiterate let's try to
> pay down technical debt here and avoid copy/pasting :>)
Yes, will spend a lot more time thinking about how to avoid copy/paste.
>
>>
>>>> ---
>>>> include/linux/huge_mm.h | 5 +++
>>>> include/linux/mm.h | 19 ++++++++
>>>> include/linux/mm_types.h | 5 ++-
>>>> include/linux/pgtable.h | 8 ++++
>>>> include/linux/rmap.h | 7 ++-
>>>> mm/huge_memory.c | 8 ++++
>>>> mm/internal.h | 3 ++
>>>> mm/page_vma_mapped.c | 35 +++++++++++++++
>>>> mm/pgtable-generic.c | 83 ++++++++++++++++++++++++++++++++++
>>>> mm/rmap.c | 96 +++++++++++++++++++++++++++++++++++++---
>>>> 10 files changed, 260 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index a4d9f964dfdea..e672e45bb9cc7 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -463,10 +463,15 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>>>> unsigned long address);
>>>>
>>>> #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>>>> + unsigned long address);
>>>> int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>> pud_t *pudp, unsigned long addr, pgprot_t newprot,
>>>> unsigned long cp_flags);
>>>> #else
>>>> +static inline void
>>>> +split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>>>> + unsigned long address) {}
>>>> static inline int
>>>> change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>> pud_t *pudp, unsigned long addr, pgprot_t newprot,
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index ab2e7e30aef96..a15e18df0f771 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3455,6 +3455,22 @@ static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
>>>> * considered ready to switch to split PUD locks yet; there may be places
>>>> * which need to be converted from page_table_lock.
>>>> */
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +static inline struct page *pud_pgtable_page(pud_t *pud)
>>>> +{
>>>> + unsigned long mask = ~(PTRS_PER_PUD * sizeof(pud_t) - 1);
>>>> +
>>>> + return virt_to_page((void *)((unsigned long)pud & mask));
>>>> +}
>>>> +
>>>> +static inline struct ptdesc *pud_ptdesc(pud_t *pud)
>>>> +{
>>>> + return page_ptdesc(pud_pgtable_page(pud));
>>>> +}
>>>> +
>>>> +#define pud_huge_pmd(pud) (pud_ptdesc(pud)->pud_huge_pmd)
>>>> +#endif
>>>> +
>>>> static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
>>>> {
>>>> return &mm->page_table_lock;
>>>> @@ -3471,6 +3487,9 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
>>>> static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
>>>> {
>>>> __pagetable_ctor(ptdesc);
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> + ptdesc->pud_huge_pmd = NULL;
>>>> +#endif
>>>> }
>>>>
>>>> static inline void pagetable_p4d_ctor(struct ptdesc *ptdesc)
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index 78950eb8926dc..26a38490ae2e1 100644
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -577,7 +577,10 @@ struct ptdesc {
>>>> struct list_head pt_list;
>>>> struct {
>>>> unsigned long _pt_pad_1;
>>>> - pgtable_t pmd_huge_pte;
>>>> + union {
>>>> + pgtable_t pmd_huge_pte; /* For PMD tables: deposited PTE */
>>>> + pgtable_t pud_huge_pmd; /* For PUD tables: deposited PMD list */
>>>> + };
>>>> };
>>>> };
>>>> unsigned long __page_mapping;
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index 2f0dd3a4ace1a..3ce733c1d71a2 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -1168,6 +1168,14 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
>>>> #define arch_needs_pgtable_deposit() (false)
>>>> #endif
>>>>
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
>>>> + pmd_t *pmd_table);
>>>> +extern pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
>>>> +extern void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable);
>>>> +extern pgtable_t pud_withdraw_pte(pmd_t *pmd_table);
>>>
>>> These are useless extern's.
>>>
>>
>>
>> ack
>>
>> These are coming from the existing functions from the file:
>> extern void pgtable_trans_huge_deposit
>> extern pgtable_t pgtable_trans_huge_withdraw
>>
>> I think the externs can be removed from these as well? We can
>> fix those in a separate patch.
>
> Generally the approach is to remove externs when adding/changing new stuff as
> otherwise we get completely useless churn on that and annoying git history
> changes.
Ack
>>
>>
>>>> +#endif
>>>> +
>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> /*
>>>> * This is an implementation of pmdp_establish() that is only suitable for an
>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>> index daa92a58585d9..08cd0a0eb8763 100644
>>>> --- a/include/linux/rmap.h
>>>> +++ b/include/linux/rmap.h
>>>> @@ -101,6 +101,7 @@ enum ttu_flags {
>>>> * do a final flush if necessary */
>>>> TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock:
>>>> * caller holds it */
>>>> + TTU_SPLIT_HUGE_PUD = 0x100, /* split huge PUD if any */
>>>> };
>>>>
>>>> #ifdef CONFIG_MMU
>>>> @@ -473,6 +474,8 @@ void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages,
>>>> folio_add_anon_rmap_ptes(folio, page, 1, vma, address, flags)
>>>> void folio_add_anon_rmap_pmd(struct folio *, struct page *,
>>>> struct vm_area_struct *, unsigned long address, rmap_t flags);
>>>> +void folio_add_anon_rmap_pud(struct folio *, struct page *,
>>>> + struct vm_area_struct *, unsigned long address, rmap_t flags);
>>>> void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>>> unsigned long address, rmap_t flags);
>>>> void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
>>>> @@ -933,6 +936,7 @@ struct page_vma_mapped_walk {
>>>> pgoff_t pgoff;
>>>> struct vm_area_struct *vma;
>>>> unsigned long address;
>>>> + pud_t *pud;
>>>> pmd_t *pmd;
>>>> pte_t *pte;
>>>> spinlock_t *ptl;
>>>> @@ -970,7 +974,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>>>> static inline void
>>>> page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>>>> {
>>>> - WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte);
>>>> + WARN_ON_ONCE(!pvmw->pud && !pvmw->pmd && !pvmw->pte);
>>>>
>>>> if (likely(pvmw->ptl))
>>>> spin_unlock(pvmw->ptl);
>>>> @@ -978,6 +982,7 @@ page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>>>> WARN_ON_ONCE(1);
>>>>
>>>> pvmw->ptl = NULL;
>>>> + pvmw->pud = NULL;
>>>> pvmw->pmd = NULL;
>>>> pvmw->pte = NULL;
>>>> }
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 40cf59301c21a..3128b3beedb0a 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -2933,6 +2933,14 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>>>> spin_unlock(ptl);
>>>> mmu_notifier_invalidate_range_end(&range);
>>>> }
>>>> +
>>>> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>>>> + unsigned long address)
>>>> +{
>>>> + VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PUD_SIZE));
>>>> + if (pud_trans_huge(*pud))
>>>> + __split_huge_pud_locked(vma, pud, address);
>>>> +}
>>>> #else
>>>> void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>>>> unsigned long address)
>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>> index 9ee336aa03656..21d5c00f638dc 100644
>>>> --- a/mm/internal.h
>>>> +++ b/mm/internal.h
>>>> @@ -545,6 +545,9 @@ int user_proactive_reclaim(char *buf,
>>>> * in mm/rmap.c:
>>>> */
>>>> pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address);
>>>> +#endif
>>>>
>>>> /*
>>>> * in mm/page_alloc.c
>>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>>> index b38a1d00c971b..d31eafba38041 100644
>>>> --- a/mm/page_vma_mapped.c
>>>> +++ b/mm/page_vma_mapped.c
>>>> @@ -146,6 +146,18 @@ static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
>>>> return true;
>>>> }
>>>>
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +/* Returns true if the two ranges overlap. Careful to not overflow. */
>>>> +static bool check_pud(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
>>>> +{
>>>> + if ((pfn + HPAGE_PUD_NR - 1) < pvmw->pfn)
>>>> + return false;
>>>> + if (pfn > pvmw->pfn + pvmw->nr_pages - 1)
>>>> + return false;
>>>> + return true;
>>>> +}
>>>> +#endif
>>>> +
>>>> static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
>>>> {
>>>> pvmw->address = (pvmw->address + size) & ~(size - 1);
>>>> @@ -188,6 +200,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>> pud_t *pud;
>>>> pmd_t pmde;
>>>>
>>>> + /* The only possible pud mapping has been handled on last iteration */
>>>> + if (pvmw->pud && !pvmw->pmd)
>>>> + return not_found(pvmw);
>>>> +
>>>> /* The only possible pmd mapping has been handled on last iteration */
>>>> if (pvmw->pmd && !pvmw->pte)
>>>> return not_found(pvmw);
>>>> @@ -234,6 +250,25 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>> continue;
>>>> }
>>>>
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>
>>> Said it elsewhere, but it's really weird to treat an arch having the
>>> ability to do something as a go ahead for doing it.
>>>
>>>> + /* Check for PUD-mapped THP */
>>>> + if (pud_trans_huge(*pud)) {
>>>> + pvmw->pud = pud;
>>>> + pvmw->ptl = pud_lock(mm, pud);
>>>> + if (likely(pud_trans_huge(*pud))) {
>>>> + if (pvmw->flags & PVMW_MIGRATION)
>>>> + return not_found(pvmw);
>>>> + if (!check_pud(pud_pfn(*pud), pvmw))
>>>> + return not_found(pvmw);
>>>> + return true;
>>>> + }
>>>> + /* PUD was split under us, retry at PMD level */
>>>> + spin_unlock(pvmw->ptl);
>>>> + pvmw->ptl = NULL;
>>>> + pvmw->pud = NULL;
>>>> + }
>>>> +#endif
>>>> +
>>>
>>> Yeah, as I said elsewhere, we got to be refactoring not copy/pasting with
>>> modifications :)
>>>
>>
>> Yeah there is repeated code in multiple places, where all I did was replace
>> what was done from PMD into PUD. In a lot of places, its actually difficult
>> to not repeat the code (unless we want function macros, which is much worse
>> IMO).
>
> Not if we actually refactor the existing code :)
>
> When I wanted to make functional changes to mremap I took a lot of time to
> refactor the code into something sane before even starting that.
>
> Because I _could_ have added the features there as-is, but it would have been
> hellish to do so as-is and added more confusion etc.
>
> So yeah, I think a similar mentality has to be had with this change.
Ack, I will spend a lot more time thinking about the refractoring.
>
>>
>>>
>>>> pvmw->pmd = pmd_offset(pud, pvmw->address);
>>>> /*
>>>> * Make sure the pmd value isn't cached in a register by the
>>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>>> index d3aec7a9926ad..2047558ddcd79 100644
>>>> --- a/mm/pgtable-generic.c
>>>> +++ b/mm/pgtable-generic.c
>>>> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>>>> }
>>>> #endif
>>>>
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +/*
>>>> + * Deposit page tables for PUD THP.
>>>> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
>>>> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
>>>> + *
>>>> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
>>>> + * list_head. This is because lru.prev (offset 16) overlaps with
>>>> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
>>>> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.
>>>
>>> This is horrible and feels like a hack? Treating a doubly-linked list as a
>>> singly-linked one like this is not upstreamable.
>>>
>>>> + *
>>>> + * PTE tables should be deposited into the PMD using pud_deposit_pte().
>>>> + */
>>>> +void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
>>>> + pmd_t *pmd_table)
>>>
>>> This is a horrid, you're depositing the PMD using the... questionable
>>> list_head abuse, but then also have pud_deposit_pte()... But here we're
>>> depositing a PMD shouldn't the name reflect that?
>>>
>>>> +{
>>>> + pgtable_t pmd_page = virt_to_page(pmd_table);
>>>> +
>>>> + assert_spin_locked(pud_lockptr(mm, pudp));
>>>> +
>>>> + /* Push onto stack using only lru.next as the link */
>>>> + pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);
>>>
>>> Yikes...
>>>
>>>> + pud_huge_pmd(pudp) = pmd_page;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Withdraw the deposited PMD table for PUD THP split or zap.
>>>> + * Called with PUD lock held.
>>>> + * Returns NULL if no more PMD tables are deposited.
>>>> + */
>>>> +pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
>>>> +{
>>>> + pgtable_t pmd_page;
>>>> +
>>>> + assert_spin_locked(pud_lockptr(mm, pudp));
>>>> +
>>>> + pmd_page = pud_huge_pmd(pudp);
>>>> + if (!pmd_page)
>>>> + return NULL;
>>>> +
>>>> + /* Pop from stack - lru.next points to next PMD page (or NULL) */
>>>> + pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;
>>>
>>> Where's the popping? You're just assigning here.
>>
>>
>> Ack on all of the above. Hopefully [1] is better.
>
> Thanks!
>
>>>
>>>> +
>>>> + return page_address(pmd_page);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Deposit a PTE table into a standalone PMD table (not yet in page table hierarchy).
>>>> + * Used for PUD THP pre-deposit. The PMD table's pmd_huge_pte stores a linked list.
>>>> + * No lock assertion since the PMD isn't visible yet.
>>>> + */
>>>> +void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable)
>>>> +{
>>>> + struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
>>>> +
>>>> + /* FIFO - add to front of list */
>>>> + if (!ptdesc->pmd_huge_pte)
>>>> + INIT_LIST_HEAD(&pgtable->lru);
>>>> + else
>>>> + list_add(&pgtable->lru, &ptdesc->pmd_huge_pte->lru);
>>>> + ptdesc->pmd_huge_pte = pgtable;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Withdraw a PTE table from a standalone PMD table.
>>>> + * Returns NULL if no more PTE tables are deposited.
>>>> + */
>>>> +pgtable_t pud_withdraw_pte(pmd_t *pmd_table)
>>>> +{
>>>> + struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
>>>> + pgtable_t pgtable;
>>>> +
>>>> + pgtable = ptdesc->pmd_huge_pte;
>>>> + if (!pgtable)
>>>> + return NULL;
>>>> + ptdesc->pmd_huge_pte = list_first_entry_or_null(&pgtable->lru,
>>>> + struct page, lru);
>>>> + if (ptdesc->pmd_huge_pte)
>>>> + list_del(&pgtable->lru);
>>>> + return pgtable;
>>>> +}
>>>> +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>>>> +
>>>> #ifndef __HAVE_ARCH_PMDP_INVALIDATE
>>>> pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>>>> pmd_t *pmdp)
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 7b9879ef442d9..69acabd763da4 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -811,6 +811,32 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
>>>> return pmd;
>>>> }
>>>>
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +/*
>>>> + * Returns the actual pud_t* where we expect 'address' to be mapped from, or
>>>> + * NULL if it doesn't exist. No guarantees / checks on what the pud_t*
>>>> + * represents.
>>>> + */
>>>> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address)
>>>
>>> This series seems to be full of copy/paste.
>>>
>>> It's just not acceptable given the state of THP code as I said in reply to
>>> the cover letter - you need to _refactor_ the code.
>>>
>>> The code is bug-prone and difficult to maintain as-is, your series has to
>>> improve the technical debt, not add to it.
>>>
>>
>> In some cases we might not be able to avoid the copy, but this is definitely
>> a place where we dont need to. I will change here. Thanks!
>
> I disagree, see above :) But thanks on this one
>
>>
>>>> +{
>>>> + pgd_t *pgd;
>>>> + p4d_t *p4d;
>>>> + pud_t *pud = NULL;
>>>> +
>>>> + pgd = pgd_offset(mm, address);
>>>> + if (!pgd_present(*pgd))
>>>> + goto out;
>>>> +
>>>> + p4d = p4d_offset(pgd, address);
>>>> + if (!p4d_present(*p4d))
>>>> + goto out;
>>>> +
>>>> + pud = pud_offset(p4d, address);
>>>> +out:
>>>> + return pud;
>>>> +}
>>>> +#endif
>>>> +
>>>> struct folio_referenced_arg {
>>>> int mapcount;
>>>> int referenced;
>>>> @@ -1415,11 +1441,7 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>>>> SetPageAnonExclusive(page);
>>>> break;
>>>> case PGTABLE_LEVEL_PUD:
>>>> - /*
>>>> - * Keep the compiler happy, we don't support anonymous
>>>> - * PUD mappings.
>>>> - */
>>>> - WARN_ON_ONCE(1);
>>>> + SetPageAnonExclusive(page);
>>>> break;
>>>> default:
>>>> BUILD_BUG();
>>>> @@ -1503,6 +1525,31 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
>>>> #endif
>>>> }
>>>>
>>>> +/**
>>>> + * folio_add_anon_rmap_pud - add a PUD mapping to a page range of an anon folio
>>>> + * @folio: The folio to add the mapping to
>>>> + * @page: The first page to add
>>>> + * @vma: The vm area in which the mapping is added
>>>> + * @address: The user virtual address of the first page to map
>>>> + * @flags: The rmap flags
>>>> + *
>>>> + * The page range of folio is defined by [first_page, first_page + HPAGE_PUD_NR)
>>>> + *
>>>> + * The caller needs to hold the page table lock, and the page must be locked in
>>>> + * the anon_vma case: to serialize mapping,index checking after setting.
>>>> + */
>>>> +void folio_add_anon_rmap_pud(struct folio *folio, struct page *page,
>>>> + struct vm_area_struct *vma, unsigned long address, rmap_t flags)
>>>> +{
>>>> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
>>>> + defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
>>>> + __folio_add_anon_rmap(folio, page, HPAGE_PUD_NR, vma, address, flags,
>>>> + PGTABLE_LEVEL_PUD);
>>>> +#else
>>>> + WARN_ON_ONCE(true);
>>>> +#endif
>>>> +}
>>>
>>> More copy/paste... Maybe unavoidable in this case, but be good to try.
>>>
>>>> +
>>>> /**
>>>> * folio_add_new_anon_rmap - Add mapping to a new anonymous folio.
>>>> * @folio: The folio to add the mapping to.
>>>> @@ -1934,6 +1981,20 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>> }
>>>>
>>>> if (!pvmw.pte) {
>>>> + /*
>>>> + * Check for PUD-mapped THP first.
>>>> + * If we have a PUD mapping and TTU_SPLIT_HUGE_PUD is set,
>>>> + * split the PUD to PMD level and restart the walk.
>>>> + */
>>>
>>> This is literally describing the code below, it's not useful.
>>
>> Ack, Will remove this comment, Thanks!
>
> Thanks
>
>>>
>>>> + if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
>>>> + if (flags & TTU_SPLIT_HUGE_PUD) {
>>>> + split_huge_pud_locked(vma, pvmw.pud, pvmw.address);
>>>> + flags &= ~TTU_SPLIT_HUGE_PUD;
>>>> + page_vma_mapped_walk_restart(&pvmw);
>>>> + continue;
>>>> + }
>>>> + }
>>>> +
>>>> if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
>>>> if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
>>>> goto walk_done;
>>>> @@ -2325,6 +2386,27 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>>> mmu_notifier_invalidate_range_start(&range);
>>>>
>>>> while (page_vma_mapped_walk(&pvmw)) {
>>>> + /* Handle PUD-mapped THP first */
>>>
>>> How did/will this interact with DAX, VFIO PUD THP?
>>
>> It wont interact with DAX. try_to_migrate does the below and just returns:
>>
>> if (folio_is_zone_device(folio) &&
>> (!folio_is_device_private(folio) && !folio_is_device_coherent(folio)))
>> return;
>>
>> so DAX would never reach here.
>
> Hmm folio_is_zone_device() always returns true for DAX?
>
Yes, that is my understanding. Both fsdax and devdax call into
devm_memremap_pages() -> memremap_pages() in mm/memremap.c, which
unconditionally places all pages in ZONE_DEVICE.
> Also that's just one rmap call right?
>
Yes,
>>
>> I think vfio pages are pinned and therefore cant be migrated? (I have
>> not looked at vfio code, I will try to get a better understanding tomorrow,
>> but please let me know if that sounds wrong.)
>
> OK I've not dug into this either please do check, and be good really to test
> this code vs. actual DAX/VFIO scenarios if you can find a way to test that, thanks!
I think DAX is ok, will check more into VFIO. I will also CC the people who added
DAX and VFIO PUD support in the next RFC.
>
>>
>>
>>>
>>>> + if (!pvmw.pte && !pvmw.pmd) {
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>
>>> Won't pud_trans_huge() imply this...
>>>
>>
>> Agreed, I think it should cover it.
> Thanks!
>
>>
>>>> + /*
>>>> + * PUD-mapped THP: skip migration to preserve the huge
>>>> + * page. Splitting would defeat the purpose of PUD THPs.
>>>> + * Return false to indicate migration failure, which
>>>> + * will cause alloc_contig_range() to try a different
>>>> + * memory region.
>>>> + */
>>>> + if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
>>>> + page_vma_mapped_walk_done(&pvmw);
>>>> + ret = false;
>>>> + break;
>>>> + }
>>>> +#endif
>>>> + /* Unexpected state: !pte && !pmd but not a PUD THP */
>>>> + page_vma_mapped_walk_done(&pvmw);
>>>> + break;
>>>> + }
>>>> +
>>>> /* PMD-mapped THP migration entry */
>>>> if (!pvmw.pte) {
>>>> __maybe_unused unsigned long pfn;
>>>> @@ -2607,10 +2689,10 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
>>>>
>>>> /*
>>>> * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
>>>> - * TTU_SPLIT_HUGE_PMD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
>>>> + * TTU_SPLIT_HUGE_PMD, TTU_SPLIT_HUGE_PUD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
>>>> */
>>>> if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
>>>> - TTU_SYNC | TTU_BATCH_FLUSH)))
>>>> + TTU_SPLIT_HUGE_PUD | TTU_SYNC | TTU_BATCH_FLUSH)))
>>>> return;
>>>>
>>>> if (folio_is_zone_device(folio) &&
>>>> --
>>>> 2.47.3
>>>>
>>>
>>> This isn't a final review, I'll have to look more thoroughly through here
>>> over time and you're going to have to be patient in general :)
>>>
>>> Cheers, Lorenzo
>>
>>
>> Thanks for the review, this is awesome!
>
> Ack, will do more when I have time, and obviously you're getting a lot of input
> from others too.
>
> Be good to get a summary at next THP cabal ;)
>
>>
>>
>> [1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/
>> [2] https://lore.kernel.org/all/05d5918f-b61b-4091-b8c6-20eebfffc3c4@gmail.com/
>> [3] https://lore.kernel.org/all/2efaa5ed-bd09-41f0-9c07-5cd6cccc4595@gmail.com/
>>
>>
>>
>
> cheers, Lorenzo
Powered by blists - more mailing lists