linux-kernel - Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <24367c05-ad1f-4546-b2ed-69e587113a54@oracle.com>
Date: Tue, 14 Oct 2025 22:57:14 +0200
From: William Roche <william.roche@...cle.com>
To: Jiaqi Yan <jiaqiyan@...gle.com>, Ackerley Tng <ackerleytng@...gle.com>
Cc: jgg@...dia.com, akpm@...ux-foundation.org, ankita@...dia.com,
        dave.hansen@...ux.intel.com, david@...hat.com, duenwen@...gle.com,
        jane.chu@...cle.com, jthoughton@...gle.com, linmiaohe@...wei.com,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, muchun.song@...ux.dev, nao.horiguchi@...il.com,
        osalvador@...e.de, peterx@...hat.com, rientjes@...gle.com,
        sidhartha.kumar@...cle.com, tony.luck@...el.com,
        wangkefeng.wang@...wei.com, willy@...radead.org, harry.yoo@...cle.com
Subject: Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd

On 10/14/25 00:14, Jiaqi Yan wrote:
 > On Fri, Sep 19, 2025 at 8:58 AM William Roche wrote:
 > [...]
 >>
 >> Using this framework, I realized that the code provided here has a
 >> problem:
 >> When the error impacts a large folio, the release of this folio
 >> doesn't isolate the sub-page(s) actually impacted by the poison.
 >> __rmqueue_pcplist() can return a known poisoned page to
 >> get_page_from_freelist().
 >
 > Just curious, how exactly you can repro this leaking of a known poison
 > page? It may help me debug my patch.
 >

When the memfd segment impacted by a memory error is released, the 
sub-page impacted by a memory error is not removed from the freelist and 
an allocation of memory (large enough to increase the chance to get this 
page) crashes the system with the following stack trace (for example):

[  479.572513] RIP: 0010:clear_page_erms+0xb/0x20
[...]
[  479.587565]  post_alloc_hook+0xbd/0xd0
[  479.588371]  get_page_from_freelist+0x3a6/0x6d0
[  479.589221]  ? srso_alias_return_thunk+0x5/0xfbef5
[  479.590122]  __alloc_frozen_pages_noprof+0x186/0x380
[  479.591012]  alloc_pages_mpol+0x7b/0x180
[  479.591787]  vma_alloc_folio_noprof+0x70/0xf0
[  479.592609]  alloc_anon_folio+0x1a0/0x3a0
[  479.593401]  do_anonymous_page+0x13f/0x4d0
[  479.594174]  ? pte_offset_map_rw_nolock+0x1f/0xa0
[  479.595035]  __handle_mm_fault+0x581/0x6c0
[  479.595799]  handle_mm_fault+0xcf/0x2a0
[  479.596539]  do_user_addr_fault+0x22b/0x6e0
[  479.597349]  exc_page_fault+0x67/0x170
[  479.598095]  asm_exc_page_fault+0x26/0x30

The idea is to run the test program in the VM and instead of using 
madvise to poison the location, I take the physical address of the 
location, and use Qemu 'gpa2hpa' address of the location,
so that I can inject the error on the hypervisor with the 
hwpoison-inject module (for example).
Let the test program finish and run a memory allocator (trying to take 
as much memory as possible)
You should end up on a panic of the VM.

 >>
 >> This revealed some mm limitations, as I would have expected that the
 >> check_new_pages() mechanism used by the __rmqueue functions would
 >> filter these pages out, but I noticed that this has been disabled by
 >> default in 2023 with:
 >> [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
 >> https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
 >
 > Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
 > and testing but didn't notice any WARNING on "bad page"; It is very
 > likely I was just lucky.
 >
 >>
 >>
 >> This problem seems to be avoided if we call take_page_off_buddy(page)
 >> in the filemap_offline_hwpoison_folio_hugetlb() function without
 >> testing if PageBuddy(page) is true first.
 >
 > Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
 > shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
 > not. take_page_off_buddy will check PageBuddy or not, on the page_head
 > of different page orders. So maybe somehow a known poisoned page is
 > not taken off from buddy allocator due to this?
 >
 > Let me try to fix it in v2, by the end of the week. If you could test
 > with your way of repro as well, that will be very helpful!


Of course, I'll run the test on your v2 version and let you know how it 
goes.


 >> But according to me it leaves a (small) race condition where a new
 >> page allocation could get a poisoned sub-page between the dissolve
 >> phase and the attempt to remove it from the buddy allocator.

I still think that the way we recycle the impacted large page still has 
a (much smaller) race condition where a memory allocation can get the 
poisoned page, as we don't have the checks to filter the poisoned page 
from the freelist.
I'm not sure we have a way to recycle the page without having a moment 
when the poison page is in the freelist.
(I'd be happy to be proven wrong ;) )


 >> If performance requires using Hugetlb pages, than maybe we could
 >> accept to loose a huge page after a memory impacted
 >> MFD_MF_KEEP_UE_MAPPED memfd segment is released ? If it can easily
 >> avoid some other corruption.

What I meant is: if we don't have a reliable way to recycle an impacted 
large page, we could start with a version of the code where we don't 
recycle it, just to avoid the risk...


 >
 > There is also another possible path if VMM can change to back VM
 > memory with *1G guest_memfd*, which wraps 1G hugetlbfs. In Ackerley's
 > work [1], guest_memfd can split the 1G page for conversions. If we
 > re-use the splitting for memory failure recovery, we can probably
 > achieve something generally similar to THP's memory failure recovery:
 > split 1G to 2M and 4k chunks, then unmap only 4k of poisoned page. We
 > still lose the 1G TLB size so VM may be subject to some performance
 > sacrifice.
 >
 > [1] 
https://lore.kernel.org/linux-mm/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com


Thanks for the pointer.
I personally think that splitting the large page into base pages, is 
just fine.
The main possibility I see in this project is to significantly increase 
the probability to survive a memory error on large pages backed VMs.

HTH.

Thanks a lot,
William.