linux-kernel - Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aQBqGupCN_v8ysMX@hyeyoo>
Date: Tue, 28 Oct 2025 16:00:42 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Jiaqi Yan <jiaqiyan@...gle.com>
Cc: “William Roche <william.roche@...cle.com>,
        Ackerley Tng <ackerleytng@...gle.com>, jgg@...dia.com,
        akpm@...ux-foundation.org, ankita@...dia.com,
        dave.hansen@...ux.intel.com, david@...hat.com, duenwen@...gle.com,
        jane.chu@...cle.com, jthoughton@...gle.com, linmiaohe@...wei.com,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, muchun.song@...ux.dev, nao.horiguchi@...il.com,
        osalvador@...e.de, peterx@...hat.com, rientjes@...gle.com,
        sidhartha.kumar@...cle.com, tony.luck@...el.com,
        wangkefeng.wang@...wei.com, willy@...radead.org
Subject: Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd

On Mon, Oct 27, 2025 at 09:17:31PM -0700, Jiaqi Yan wrote:
> On Wed, Oct 22, 2025 at 6:09 AM Harry Yoo <harry.yoo@...cle.com> wrote:
> >
> > On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
> > > On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@...cle.com> wrote:
> > > >
> > > > From: William Roche <william.roche@...cle.com>
> > > >
> > > > Hello,
> > > >
> > > > The possibility to keep a VM using large hugetlbfs pages running after a memory
> > > > error is very important, and the possibility described here could be a good
> > > > candidate to address this issue.
> > >
> > > Thanks for expressing interest, William, and sorry for getting back to
> > > you so late.
> > >
> > > >
> > > > So I would like to provide my feedback after testing this code with the
> > > > introduction of persistent errors in the address space: My tests used a VM
> > > > running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
> > > > test program provided with this project. But instead of injecting the errors
> > > > with madvise calls from this program, I get the guest physical address of a
> > > > location and inject the error from the hypervisor into the VM, so that any
> > > > subsequent access to the location is prevented directly from the hypervisor
> > > > level.
> > >
> > > This is exactly what VMM should do: when it owns or manages the VM
> > > memory with MFD_MF_KEEP_UE_MAPPED, it is then VMM's responsibility to
> > > isolate guest/VCPUs from poisoned memory pages, e.g. by intercepting
> > > such memory accesses.
> > >
> > > >
> > > > Using this framework, I realized that the code provided here has a problem:
> > > > When the error impacts a large folio, the release of this folio doesn't isolate
> > > > the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
> > > > a known poisoned page to get_page_from_freelist().
> > >
> > > Just curious, how exactly you can repro this leaking of a known poison
> > > page? It may help me debug my patch.
> > >
> > > >
> > > > This revealed some mm limitations, as I would have expected that the
> > > > check_new_pages() mechanism used by the __rmqueue functions would filter these
> > > > pages out, but I noticed that this has been disabled by default in 2023 with:
> > > > [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
> > > > https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> > >
> > > Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
> > > and testing but didn't notice any WARNING on "bad page"; It is very
> > > likely I was just lucky.
> > >
> > > >
> > > >
> > > > This problem seems to be avoided if we call take_page_off_buddy(page) in the
> > > > filemap_offline_hwpoison_folio_hugetlb() function without testing if
> > > > PageBuddy(page) is true first.
> > >
> > > Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
> > > shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
> > > not. take_page_off_buddy will check PageBuddy or not, on the page_head
> > > of different page orders. So maybe somehow a known poisoned page is
> > > not taken off from buddy allocator due to this?
> >
> > Maybe it's the case where the poisoned page is merged to a larger page,
> > and the PGTY_buddy flag is set on its buddy of the poisoned page, so
> > PageBuddy() returns false?:
> >
> >   [ free page A ][ free page B (poisoned) ]
> >
> > When these two are merged, then we set PGTY_buddy on page A but not on B.
> 
> Thanks Harry!
>
> It is indeed this case. I validate by adding some debug prints in
> take_page_off_buddy:
> 
> [ 193.029423] Memory failure: 0x2800200: [yjq] PageBuddy=0 after drain_all_pages
> [ 193.029426] 0x2800200: [yjq] order=0, page_order=0, PageBuddy(page_head)=0
> [ 193.029428] 0x2800200: [yjq] order=1, page_order=0, PageBuddy(page_head)=0
> [ 193.029429] 0x2800200: [yjq] order=2, page_order=0, PageBuddy(page_head)=0
> [ 193.029430] 0x2800200: [yjq] order=3, page_order=0, PageBuddy(page_head)=0
> [ 193.029431] 0x2800200: [yjq] order=4, page_order=0, PageBuddy(page_head)=0
> [ 193.029432] 0x2800200: [yjq] order=5, page_order=0, PageBuddy(page_head)=0
> [ 193.029434] 0x2800200: [yjq] order=6, page_order=0, PageBuddy(page_head)=0
> [ 193.029435] 0x2800200: [yjq] order=7, page_order=0, PageBuddy(page_head)=0
> [ 193.029436] 0x2800200: [yjq] order=8, page_order=0, PageBuddy(page_head)=0
> [ 193.029437] 0x2800200: [yjq] order=9, page_order=0, PageBuddy(page_head)=0
> [ 193.029438] 0x2800200: [yjq] order=10, page_order=10, PageBuddy(page_head)=1
> 
> In this case, page for 0x2800200 is hwpoisoned, and its buddy page is
> 0x2800000 with order 10.

Woohoo, I got it right!

> > But even after fixing that we need to fix the race condition.
> 
> What exactly is the race condition you are referring to?

When you free a high-order page, the buddy allocator doesn't not check
PageHWPoison() on the page and its subpages. It checks PageHWPoison()
only when you free a base (order-0) page, see free_pages_prepare().

AFAICT there is nothing that prevents the poisoned page to be
allocated back to users because the buddy doesn't check PageHWPoison()
on allocation as well (by default).

So rather than freeing the high-order page as-is in
dissolve_free_hugetlb_folio(), I think we have to split it to base pages
and then free them one by one.

That way, free_pages_prepare() will catch that it's poisoned and won't
add it back to the freelist. Otherwise there will always be a window
where the poisoned page can be allocated to users - before it's taken
off from the buddy.

-- 
Cheers,
Harry / Hyeonggon