[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260203192352.2674184-1-jiaqiyan@google.com>
Date: Tue, 3 Feb 2026 19:23:49 +0000
From: Jiaqi Yan <jiaqiyan@...gle.com>
To: linmiaohe@...wei.com, william.roche@...cle.com, harry.yoo@...cle.com,
jane.chu@...cle.com
Cc: nao.horiguchi@...il.com, tony.luck@...el.com, wangkefeng.wang@...wei.com,
willy@...radead.org, akpm@...ux-foundation.org, osalvador@...e.de,
rientjes@...gle.com, duenwen@...gle.com, jthoughton@...gle.com,
jgg@...dia.com, ankita@...dia.com, peterx@...hat.com,
sidhartha.kumar@...cle.com, ziy@...dia.com, david@...hat.com,
dave.hansen@...ux.intel.com, muchun.song@...ux.dev, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
Jiaqi Yan <jiaqiyan@...gle.com>
Subject: [PATCH v3 0/3] memfd-based Userspace MFR Policy for HugeTLB
Problem
=======
This patchset is a follow-up for the userspace memory failure
recovery (MFR) policy proposed in [1] and [2], but focused on
a smaller scope: HugeTLB.
To recap the problem for HugeTLB discussed in [1] and [2]:
Cloud providers like Google and Oracle usually serve capacity-
and performance-critical guest memory with 1G HugeTLB
hugepages, as this significantly reduces the overhead
associated with managing page tables and TLB misses. However,
the kernel's current MFR behavior for HugeTLB is not ideal.
Once a byte of memory in a hugepage is hardware corrupted, the
kernel discards the whole hugepage, including the healthy
portion, from the HugeTLB system. Customer workload running in
the VM can hardly recover from such a great loss of memory.
[1] and [2] proposed the idea that the decision to keep or
discard a large chunk of contiguous memory exclusively owned
by a userspace process due to a recoverable uncorrected
memory error (UE) should be controlled by userspace. What this
means in the Cloud case is that, since a virtual machine
monitor (VMM) has taken host memory to exclusively back the
guest memory for a VM, the VMM can keep holding the memory
even after memory errors occur.
MFD_MF_KEEP_UE_MAPPED for HugeTLB
=================================
[2] proposed a solution centered around the memfd associated
with the memory exclusively owned by userspace.
A userspace process must opt into the MFD_MF_KEEP_UE_MAPPED
policy when it creates a new HugeTLB-backed memfd:
#define MFD_MF_KEEP_UE_MAPPED 0x0020U
int memfd_create(const char *name, unsigned int flags);
For any hugepage associated with the MFD_MF_KEEP_UE_MAPPED
enabled memfd, whenever it runs into a UE, MFR doesn't hard
offline the HWPoison huge folio. In other words, the
HWPoison memory remains accessible via the returned memfd
or the memory mapping created with that memfd. MFR still sends
SIGBUS to the userspace process as required. MFR also still
maintains HWPoison metadata on the hugepage having the UE.
A HWPoison hugepage will be immediately isolated and
prevented from future allocation once userspace truncates it
via the memfd, or the owning memfd is closed.
By default MFD_MF_KEEP_UE_MAPPED is not set, and MFR hard
offlines hugepages having UEs.
Implementation
==============
Implementation is relatively straightforward with two major parts.
Part 1: When hugepages owned by an MFD_MF_KEEP_UE_MAPPED
enabled memfd run into a UE:
* MFR defers hard offline operations, i.e., unmapping and
dissolving. MFR still sets HWPoison flags and holds a refcount
for every raw HWPoison page. MFR still sends SIGBUS to the
consuming thread, but si_addr_lsb will be reduced to PAGE_SHIFT.
* If the memory was not faulted in yet, the fault handler also
needs to unblock the fault to the HWPoison folio.
Part 2: When an MFD_MF_KEEP_UE_MAPPED enabled memfd is being
released, or when a userspace process truncates a range of
hugepages belonging to an MFD_MF_KEEP_UE_MAPPED enabled memfd:
* When the HugeTLB in-memory file system removes a filemap's
folios one by one, it asks MFR to deal with HWPoison folios
on the fly, implemented by filemap_offline_hwpoison_folio().
* MFR drops the refcounts being held for the raw HWPoison
pages within the folio. Now that the HWPoison folio becomes
a free HugeTLB folio, MFR dissolves it into a set of raw pages.
Changelog
=========
v3 -> v2 [3]
- Rebase onto [4] to simplify filemap_offline_hwpoison_folio_hugetlb().
With free_has_hwpoisoned() rejecting HWPoison subpages in a HugeTLB
folio, there is no need to take_page_off_buddy() after
dissolve_free_hugetlb_folio().
- Address comments from William Roche <william.roche@...cle.com> and
Jane Chu <jane.chu@...cle.com>.
- Update size_shift in kill_accessing_process() if MFD_MF_KEEP_UE_MAPPED
is enabled. Thanks William Roche <william.roche@...cle.com> for providing
his patch on this.
- Add a new tunable to hugetlb-mfr to control the number of pages within
the 1st hugepage to MADV_HWPOISON.
v2 -> v1 [2]
- Rebase onto commit 6da43bbeb6918 ("Merge tag 'vfio-v6.18-rc6' of
https://github.com/awilliam/linux-vfio").
- Remove populate_memfd_hwp_folios() and offline_memfd_hwp_folios() so
that no memory allocation is needed during releasing HWPoison memfd.
- Insert filemap_offline_hwpoison_folio() into remove_inode_single_folio().
Now dissolving and offlining HWPoison huge folios is done on the fly.
- Fix the bug pointed out by William Roche <william.roche@...cle.com>:
call take_page_off_buddy() no matter HWPoison page is buddy page or not.
- Remove update_per_node_mf_stats() when dissolve failed.
- Make hugetlb-mfr allocate 4 1G hugepages to cover new code introduced
in remove_inode_hugepages().
- Make hugetlb-mfr support testing both 1GB and 2MB HugeTLB hugepages.
- Fix some typos in documentation.
[1] https://lwn.net/Articles/991513
[2] https://lore.kernel.org/lkml/20250118231549.1652825-1-jiaqiyan@google.com
[3] https://lore.kernel.org/linux-mm/20251116013223.1557158-3-jiaqiyan@google.com
[4] https://lore.kernel.org/linux-mm/20260202194125.2191216-1-jiaqiyan@google.com
Jiaqi Yan (3):
mm: memfd/hugetlb: introduce memfd-based userspace MFR policy
selftests/mm: test userspace MFR for HugeTLB hugepage
Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED
Documentation/userspace-api/index.rst | 1 +
.../userspace-api/mfd_mfr_policy.rst | 60 +++
fs/hugetlbfs/inode.c | 25 +-
include/linux/hugetlb.h | 7 +
include/linux/pagemap.h | 23 ++
include/uapi/linux/memfd.h | 6 +
mm/hugetlb.c | 8 +-
mm/memfd.c | 15 +-
mm/memory-failure.c | 124 +++++-
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 3 +
tools/testing/selftests/mm/hugetlb-mfr.c | 369 ++++++++++++++++++
12 files changed, 627 insertions(+), 15 deletions(-)
create mode 100644 Documentation/userspace-api/mfd_mfr_policy.rst
create mode 100644 tools/testing/selftests/mm/hugetlb-mfr.c
--
2.53.0.rc2.204.g2597b5adb4-goog
Powered by blists - more mailing lists