lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <05d5918f-b61b-4091-b8c6-20eebfffc3c4@gmail.com>
Date: Tue, 3 Feb 2026 14:07:25 -0800
From: Usama Arif <usamaarif642@...il.com>
To: Zi Yan <ziy@...dia.com>, Kiryl Shutsemau <kas@...nel.org>,
 lorenzo.stoakes@...cle.com
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 David Hildenbrand <david@...nel.org>, linux-mm@...ck.org,
 hannes@...xchg.org, riel@...riel.com, shakeel.butt@...ux.dev,
 baohua@...nel.org, dev.jain@....com, baolin.wang@...ux.alibaba.com,
 npache@...hat.com, Liam.Howlett@...cle.com, ryan.roberts@....com,
 vbabka@...e.cz, lance.yang@...ux.dev, linux-kernel@...r.kernel.org,
 kernel-team@...a.com
Subject: Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support



On 02/02/2026 08:01, Zi Yan wrote:
> On 2 Feb 2026, at 5:44, Kiryl Shutsemau wrote:
> 
>> On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
>>> For page table management, PUD THPs need to pre-deposit page tables
>>> that will be used when the huge page is later split. When a PUD THP
>>> is allocated, we cannot know in advance when or why it might need to
>>> be split (COW, partial unmap, reclaim), but we need page tables ready
>>> for that eventuality. Similar to how PMD THPs deposit a single PTE
>>> table, PUD THPs deposit a PMD table which itself contains deposited
>>> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
>>> infrastructure and a new pud_huge_pmd field in ptdesc to store the
>>> deposited PMD.
>>>
>>> The deposited PMD tables are stored as a singly-linked stack using only
>>> page->lru.next as the link pointer. A doubly-linked list using the
>>> standard list_head mechanism would cause memory corruption: list_del()
>>> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
>>> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
>>> tables have their own deposited PTE tables stored in pmd_huge_pte,
>>> poisoning lru.prev would corrupt the PTE table list and cause crashes
>>> when withdrawing PTE tables during split. PMD THPs don't have this
>>> problem because their deposited PTE tables don't have sub-deposits.
>>> Using only lru.next avoids the overlap entirely.
>>>
>>> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
>>> have. The page_vma_mapped_walk() function is extended to recognize and
>>> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
>>> flag tells the unmap path to split PUD THPs before proceeding, since
>>> there is no PUD-level migration entry format - the split converts the
>>> single PUD mapping into individual PTE mappings that can be migrated
>>> or swapped normally.
>>>
>>> Signed-off-by: Usama Arif <usamaarif642@...il.com>
>>> ---
>>>  include/linux/huge_mm.h  |  5 +++
>>>  include/linux/mm.h       | 19 ++++++++
>>>  include/linux/mm_types.h |  5 ++-
>>>  include/linux/pgtable.h  |  8 ++++
>>>  include/linux/rmap.h     |  7 ++-
>>>  mm/huge_memory.c         |  8 ++++
>>>  mm/internal.h            |  3 ++
>>>  mm/page_vma_mapped.c     | 35 +++++++++++++++
>>>  mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
>>>  mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
>>>  10 files changed, 260 insertions(+), 9 deletions(-)
>>>
> 
> <snip>
> 
>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>> index d3aec7a9926ad..2047558ddcd79 100644
>>> --- a/mm/pgtable-generic.c
>>> +++ b/mm/pgtable-generic.c
>>> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>>>  }
>>>  #endif
>>>
>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>> +/*
>>> + * Deposit page tables for PUD THP.
>>> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
>>> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
>>> + *
>>> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
>>> + * list_head. This is because lru.prev (offset 16) overlaps with
>>> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
>>> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.
>>
>> This is ugly.
>>
>> Sounds like you want to use llist_node/head instead of list_head for this.
>>
>> You might able to avoid taking the lock in some cases. Note that
>> pud_lockptr() is mm->page_table_lock as of now.
> 
> I agree. I used llist_node/head in my implementation[1] and it works.
> I have an illustration at[2] to show the concept. Feel free to reuse the code.
> 
> 
> [1] https://lore.kernel.org/all/20200928193428.GB30994@casper.infradead.org/
> [2] https://normal.zone/blog/2021-01-04-linux-1gb-thp-2/#new-mechanism
> 
> Best Regards,
> Yan, Zi



Ah I should have looked at your patches more! I started working by just using lru
and was using list_add/list_del which was ofcourse corrupting the list and took me
way more time than I would like to admit to debug what was going on! The diagrams
in your 2nd link are really useful. I ended up drawing by hand those to debug
the corruption issue. I will point to that link in the next series :) 

How about something like the below diff over this patch? (Not included the comment
changes that I will make everywhere)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 26a38490ae2e1..3653e24ce97d7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -99,6 +99,9 @@ struct page {
                                struct list_head buddy_list;
                                struct list_head pcp_list;
                                struct llist_node pcp_llist;
+
+                               /* PMD pagetable deposit head */
+                               struct llist_node pgtable_deposit_head;
                        };
                        struct address_space *mapping;
                        union {
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 2047558ddcd79..764f14d0afcbb 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -215,9 +215,7 @@ void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
 
        assert_spin_locked(pud_lockptr(mm, pudp));
 
-       /* Push onto stack using only lru.next as the link */
-       pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);
-       pud_huge_pmd(pudp) = pmd_page;
+       llist_add(&pmd_page->pgtable_deposit_head, (struct llist_head *)&pud_huge_pmd(pudp));
 }
 
 /*
@@ -227,16 +225,16 @@ void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
  */
 pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
 {
+       struct llist_node *node;
        pgtable_t pmd_page;
 
        assert_spin_locked(pud_lockptr(mm, pudp));
 
-       pmd_page = pud_huge_pmd(pudp);
-       if (!pmd_page)
+       node = llist_del_first((struct llist_head *)&pud_huge_pmd(pudp));
+       if (!node)
                return NULL;
 
-       /* Pop from stack - lru.next points to next PMD page (or NULL) */
-       pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;
+       pmd_page = llist_entry(node, struct page, pgtable_deposit_head);
 
        return page_address(pmd_page);
 }

 Also, Zi is it ok if I add your Co-developed by on this patch in future revisions?
 I didn't want to do that without your explicit approval.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ