[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251015064926.1887643-1-qiuxu.zhuo@intel.com>
Date: Wed, 15 Oct 2025 14:49:26 +0800
From: Qiuxu Zhuo <qiuxu.zhuo@...el.com>
To: akpm@...ux-foundation.org,
david@...hat.com,
lorenzo.stoakes@...cle.com,
linmiaohe@...wei.com,
tony.luck@...el.com
Cc: qiuxu.zhuo@...el.com,
ziy@...dia.com,
baolin.wang@...ux.alibaba.com,
Liam.Howlett@...cle.com,
npache@...hat.com,
ryan.roberts@....com,
dev.jain@....com,
baohua@...nel.org,
nao.horiguchi@...il.com,
farrah.chen@...el.com,
jiaqiyan@...gle.com,
lance.yang@...ux.dev,
richard.weiyang@...il.com,
linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: [PATCH v4 1/1] mm: prevent poison consumption when splitting THP
When performing memory error injection on a THP (Transparent Huge Page)
mapped to userspace on an x86 server, the kernel panics with the following
trace. The expected behavior is to terminate the affected process instead
of panicking the kernel, as the x86 Machine Check code can recover from an
in-userspace #MC.
mce: [Hardware Error]: CPU 0: Machine Check Exception: f Bank 3: bd80000000070134
mce: [Hardware Error]: RIP 10:<ffffffff8372f8bc> {memchr_inv+0x4c/0xf0}
mce: [Hardware Error]: TSC afff7bbff88a ADDR 1d301b000 MISC 80 PPIN 1e741e77539027db
mce: [Hardware Error]: PROCESSOR 0:d06d0 TIME 1758093249 SOCKET 0 APIC 0 microcode 80000320
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
Kernel panic - not syncing: Fatal local machine check
The root cause of this panic is that handling a memory failure triggered by
an in-userspace #MC necessitates splitting the THP. The splitting process
employs a mechanism, implemented in try_to_map_unused_to_zeropage(), which
reads the pages in the THP to identify zero-filled pages. However, reading
the pages in the THP results in a second in-kernel #MC, occurring before
the initial memory_failure() completes, ultimately leading to a kernel
panic. See the kernel panic call trace on the two #MCs.
First Machine Check occurs // [1]
memory_failure() // [2]
try_to_split_thp_page()
split_huge_page()
split_huge_page_to_list_to_order()
__folio_split() // [3]
remap_page()
remove_migration_ptes()
remove_migration_pte()
try_to_map_unused_to_zeropage() // [4]
memchr_inv() // [5]
Second Machine Check occurs // [6]
Kernel panic
[1] Triggered by accessing a hardware-poisoned THP in userspace, which is
typically recoverable by terminating the affected process.
[2] Call folio_set_has_hwpoisoned() before try_to_split_thp_page().
[3] Pass the RMP_USE_SHARED_ZEROPAGE remap flag to remap_page().
[4] Try to map the unused THP to zeropage.
[5] Re-access pages in the hw-poisoned THP in the kernel.
[6] Triggered in-kernel, leading to a panic kernel.
In Step[2], memory_failure() sets the poisoned flag on the page in the
THP by TestSetPageHWPoison() before calling try_to_split_thp_page().
As suggested by David Hildenbrand, fix this panic by not accessing to the
poisoned page in the THP during zeropage identification, while continuing
to scan unaffected pages in the THP for possible zeropage mapping. This
prevents a second in-kernel #MC that would cause kernel panic in Step[4].
Thanks to Andrew Zaborowski for his initial work on fixing this issue.
Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp")
Fixes: dafff3f4c850 ("mm: split underused THPs")
Reported-by: Farrah Chen <farrah.chen@...el.com>
Suggested-by: David Hildenbrand <david@...hat.com>
Tested-by: Farrah Chen <farrah.chen@...el.com>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@...el.com>
Acked-by: Lance Yang <lance.yang@...ux.dev>
Reviewed-by: Wei Yang <richard.weiyang@...il.com>
Acked-by: Zi Yan <ziy@...dia.com>
Reviewed-by: Miaohe Lin <linmiaohe@...wei.com>
Acked-by: David Hildenbrand <david@...hat.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@...el.com>
---
v3 -> v4:
- No code changes.
- s/sub-page of the THP/page in the THP/ in the commit message.
- s/sub-pages of the THP/pages in the THP/ in the commit message.
- Simplify the credits in the commit message.
- Collect David Hildenbrand's "Acked-by:" tag.
v2 -> v3:
- No code changes.
- Rebased on top of v6.18-rc1 and retested.
- Add two "Fixes:" tags.
- Collect Lance Yang's "Acked-by:" tag.
- Collect Wei Yang's "Reviewed-by:" tag.
- Collect Zi Yan's "Acked-by:" tag.
- Collect Miaohe's "Reviewed-by:" tag.
v1 -> v2:
- Apply David Hildenbrand's fix suggestion.
- Update the commit message to reflect the new fix.
- Add David Hildenbrand's "Suggested-by:" tag.
- Remove Andrew Zaborowski's SoB but add credits to him in the commit message.
[ I cannot reach him to get his SoB for the completely rewritten commit
message and new fix approach. ]
mm/huge_memory.c | 3 +++
mm/migrate.c | 3 ++-
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1b81680b4225..1d1b74950332 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4109,6 +4109,9 @@ static bool thp_underused(struct folio *folio)
if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
return false;
+ if (folio_contain_hwpoisoned_page(folio))
+ return false;
+
for (i = 0; i < folio_nr_pages(folio); i++) {
if (pages_identical(folio_page(folio, i), ZERO_PAGE(0))) {
if (++num_zero_pages > khugepaged_max_ptes_none)
diff --git a/mm/migrate.c b/mm/migrate.c
index e3065c9edb55..c0e9f15be2a2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -301,8 +301,9 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
struct page *page = folio_page(folio, idx);
pte_t newpte;
- if (PageCompound(page))
+ if (PageCompound(page) || PageHWPoison(page))
return false;
+
VM_BUG_ON_PAGE(!PageAnon(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(pte_present(old_pte), page);
base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
--
2.43.0
Powered by blists - more mailing lists