linux-kernel - Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACw3F521fi5HWhCKi_KrkNLXkw668HO4h8+DjkP2+vBuK-=org@mail.gmail.com>
Date: Mon, 13 Oct 2025 15:14:32 -0700
From: Jiaqi Yan <jiaqiyan@...gle.com>
To: “William Roche <william.roche@...cle.com>, 
	Ackerley Tng <ackerleytng@...gle.com>
Cc: jgg@...dia.com, akpm@...ux-foundation.org, ankita@...dia.com, 
	dave.hansen@...ux.intel.com, david@...hat.com, duenwen@...gle.com, 
	jane.chu@...cle.com, jthoughton@...gle.com, linmiaohe@...wei.com, 
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org, 
	linux-mm@...ck.org, muchun.song@...ux.dev, nao.horiguchi@...il.com, 
	osalvador@...e.de, peterx@...hat.com, rientjes@...gle.com, 
	sidhartha.kumar@...cle.com, tony.luck@...el.com, wangkefeng.wang@...wei.com, 
	willy@...radead.org, harry.yoo@...cle.com
Subject: Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd

On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@...cle.com> wrote:
>
> From: William Roche <william.roche@...cle.com>
>
> Hello,
>
> The possibility to keep a VM using large hugetlbfs pages running after a memory
> error is very important, and the possibility described here could be a good
> candidate to address this issue.

Thanks for expressing interest, William, and sorry for getting back to
you so late.

>
> So I would like to provide my feedback after testing this code with the
> introduction of persistent errors in the address space: My tests used a VM
> running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
> test program provided with this project. But instead of injecting the errors
> with madvise calls from this program, I get the guest physical address of a
> location and inject the error from the hypervisor into the VM, so that any
> subsequent access to the location is prevented directly from the hypervisor
> level.

This is exactly what VMM should do: when it owns or manages the VM
memory with MFD_MF_KEEP_UE_MAPPED, it is then VMM's responsibility to
isolate guest/VCPUs from poisoned memory pages, e.g. by intercepting
such memory accesses.

>
> Using this framework, I realized that the code provided here has a problem:
> When the error impacts a large folio, the release of this folio doesn't isolate
> the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
> a known poisoned page to get_page_from_freelist().

Just curious, how exactly you can repro this leaking of a known poison
page? It may help me debug my patch.

>
> This revealed some mm limitations, as I would have expected that the
> check_new_pages() mechanism used by the __rmqueue functions would filter these
> pages out, but I noticed that this has been disabled by default in 2023 with:
> [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
> https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz

Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
and testing but didn't notice any WARNING on "bad page"; It is very
likely I was just lucky.

>
>
> This problem seems to be avoided if we call take_page_off_buddy(page) in the
> filemap_offline_hwpoison_folio_hugetlb() function without testing if
> PageBuddy(page) is true first.

Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
not. take_page_off_buddy will check PageBuddy or not, on the page_head
of different page orders. So maybe somehow a known poisoned page is
not taken off from buddy allocator due to this?

Let me try to fix it in v2, by the end of the week. If you could test
with your way of repro as well, that will be very helpful!

> But according to me it leaves a (small) race condition where a new page
> allocation could get a poisoned sub-page between the dissolve phase and the
> attempt to remove it from the buddy allocator.
>
> I do have the impression that a correct behavior (isolating an impacted
> sub-page and remapping the valid memory content) using large pages is
> currently only achieved with Transparent Huge Pages.
> If performance requires using Hugetlb pages, than maybe we could accept to
> loose a huge page after a memory impacted MFD_MF_KEEP_UE_MAPPED memfd segment
> is released ? If it can easily avoid some other corruption.
>
> I'm very interested in finding an appropriate way to deal with memory errors on
> hugetlbfs pages, and willing to help to build a valid solution. This project
> showed a real possibility to do so, even in cases where pinned memory is used -
> with VFIO for example.
>
> I would really be interested in knowing your feedback about this project, and
> if another solution is considered more adapted to deal with errors on hugetlbfs
> pages, please let us know.

There is also another possible path if VMM can change to back VM
memory with *1G guest_memfd*, which wraps 1G hugetlbfs. In Ackerley's
work [1], guest_memfd can split the 1G page for conversions. If we
re-use the splitting for memory failure recovery, we can probably
achieve something generally similar to THP's memory failure recovery:
split 1G to 2M and 4k chunks, then unmap only 4k of poisoned page. We
still lose the 1G TLB size so VM may be subject to some performance
sacrifice.

[1] https://lore.kernel.org/linux-mm/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com

>
> Thanks in advance for your answers.
> William.