linux-kernel - Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250919155832.1084091-1-william.roche@oracle.com>
Date: Fri, 19 Sep 2025 15:58:32 +0000
From: “William Roche <william.roche@...cle.com>
To: jiaqiyan@...gle.com, jgg@...dia.com
Cc: akpm@...ux-foundation.org, ankita@...dia.com, dave.hansen@...ux.intel.com,
        david@...hat.com, duenwen@...gle.com, jane.chu@...cle.com,
        jthoughton@...gle.com, linmiaohe@...wei.com,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, muchun.song@...ux.dev, nao.horiguchi@...il.com,
        osalvador@...e.de, peterx@...hat.com, rientjes@...gle.com,
        sidhartha.kumar@...cle.com, tony.luck@...el.com,
        wangkefeng.wang@...wei.com, willy@...radead.org, harry.yoo@...cle.com
Subject: Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd

From: William Roche <william.roche@...cle.com>

Hello,

The possibility to keep a VM using large hugetlbfs pages running after a memory
error is very important, and the possibility described here could be a good
candidate to address this issue.

So I would like to provide my feedback after testing this code with the
introduction of persistent errors in the address space: My tests used a VM
running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
test program provided with this project. But instead of injecting the errors
with madvise calls from this program, I get the guest physical address of a
location and inject the error from the hypervisor into the VM, so that any
subsequent access to the location is prevented directly from the hypervisor
level.

Using this framework, I realized that the code provided here has a problem:
When the error impacts a large folio, the release of this folio doesn't isolate
the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
a known poisoned page to get_page_from_freelist().

This revealed some mm limitations, as I would have expected that the
check_new_pages() mechanism used by the __rmqueue functions would filter these
pages out, but I noticed that this has been disabled by default in 2023 with:
[PATCH] mm, page_alloc: reduce page alloc/free sanity checks
https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz

This problem seems to be avoided if we call take_page_off_buddy(page) in the
filemap_offline_hwpoison_folio_hugetlb() function without testing if
PageBuddy(page) is true first.
But according to me it leaves a (small) race condition where a new page
allocation could get a poisoned sub-page between the dissolve phase and the
attempt to remove it from the buddy allocator.

I do have the impression that a correct behavior (isolating an impacted
sub-page and remapping the valid memory content) using large pages is
currently only achieved with Transparent Huge Pages.
If performance requires using Hugetlb pages, than maybe we could accept to
loose a huge page after a memory impacted MFD_MF_KEEP_UE_MAPPED memfd segment
is released ? If it can easily avoid some other corruption.

I'm very interested in finding an appropriate way to deal with memory errors on
hugetlbfs pages, and willing to help to build a valid solution. This project
showed a real possibility to do so, even in cases where pinned memory is used -
with VFIO for example.

I would really be interested in knowing your feedback about this project, and
if another solution is considered more adapted to deal with errors on hugetlbfs
pages, please let us know.

Thanks in advance for your answers.
William.