[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aQxSSjyPsI0MT8mp@harry>
Date: Thu, 6 Nov 2025 16:53:30 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Jiaqi Yan <jiaqiyan@...gle.com>
Cc: Miaohe Lin <linmiaohe@...wei.com>,
“William Roche <william.roche@...cle.com>,
Ackerley Tng <ackerleytng@...gle.com>, jgg@...dia.com,
akpm@...ux-foundation.org, ankita@...dia.com,
dave.hansen@...ux.intel.com, david@...hat.com, duenwen@...gle.com,
jane.chu@...cle.com, jthoughton@...gle.com,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, muchun.song@...ux.dev, nao.horiguchi@...il.com,
osalvador@...e.de, peterx@...hat.com, rientjes@...gle.com,
sidhartha.kumar@...cle.com, tony.luck@...el.com,
wangkefeng.wang@...wei.com, willy@...radead.org, vbabka@...e.cz,
surenb@...gle.com, mhocko@...e.com, jackmanb@...gle.com,
hannes@...xchg.org, ziy@...dia.com
Subject: Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
On Mon, Nov 03, 2025 at 08:57:08AM -0800, Jiaqi Yan wrote:
> On Mon, Nov 3, 2025 at 12:53 AM Harry Yoo <harry.yoo@...cle.com> wrote:
> >
> > On Mon, Nov 03, 2025 at 05:16:33PM +0900, Harry Yoo wrote:
> > > On Thu, Oct 30, 2025 at 10:28:48AM -0700, Jiaqi Yan wrote:
> > > > On Thu, Oct 30, 2025 at 4:51 AM Miaohe Lin <linmiaohe@...wei.com> wrote:
> > > > > On 2025/10/28 15:00, Harry Yoo wrote:
> > > > > > On Mon, Oct 27, 2025 at 09:17:31PM -0700, Jiaqi Yan wrote:
> > > > > >> On Wed, Oct 22, 2025 at 6:09 AM Harry Yoo <harry.yoo@...cle.com> wrote:
> > > > > >>> On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
> > > > > >>>> On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@...cle.com> wrote:
> > > > > >>> But even after fixing that we need to fix the race condition.
> > > > > >>
> > > > > >> What exactly is the race condition you are referring to?
> > > > > >
> > > > > > When you free a high-order page, the buddy allocator doesn't not check
> > > > > > PageHWPoison() on the page and its subpages. It checks PageHWPoison()
> > > > > > only when you free a base (order-0) page, see free_pages_prepare().
> > > > >
> > > > > I think we might could check PageHWPoison() for subpages as what free_page_is_bad()
> > > > > does. If any subpage has HWPoisoned flag set, simply drop the folio. Even we could
> > > >
> > > > Agree, I think as a starter I could try to, for example, let
> > > > free_pages_prepare scan HWPoison-ed subpages if the base page is high
> > > > order. In the optimal case, HugeTLB does move PageHWPoison flag from
> > > > head page to the raw error pages.
> > >
> > > [+Cc page allocator folks]
> > >
> > > AFAICT enabling page sanity check in page alloc/free path would be against
> > > past efforts to reduce sanity check overhead.
> > >
> > > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> > > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> > > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> > >
> > > I'd recommend to check hwpoison flag before freeing it to the buddy
> > > when we know a memory error has occurred (I guess that's also what Miaohe
> > > suggested).
> > >
> > > > > do it better -- Split the folio and let healthy subpages join the buddy while reject
> > > > > the hwpoisoned one.
> > > > >
> > > > > >
> > > > > > AFAICT there is nothing that prevents the poisoned page to be
> > > > > > allocated back to users because the buddy doesn't check PageHWPoison()
> > > > > > on allocation as well (by default).
> > > > > >
> > > > > > So rather than freeing the high-order page as-is in
> > > > > > dissolve_free_hugetlb_folio(), I think we have to split it to base pages
> > > > > > and then free them one by one.
> > > > >
> > > > > It might not be worth to do that as this would significantly increase the overhead
> > > > > of the function while memory failure event is really rare.
> > > >
> > > > IIUC, Harry's idea is to do the split in dissolve_free_hugetlb_folio
> > > > only if folio is HWPoison-ed, similar to what Miaohe suggested
> > > > earlier.
> > >
> > > Yes, and if we do the check before moving HWPoison flag to raw pages,
> > > it'll be just a single folio_test_hwpoison() call.
> > >
> > > > BTW, I believe this race condition already exists today when
> > > > memory_failure handles HWPoison-ed free hugetlb page; it is not
> > > > something introduced via this patchset. I will fix or improve this in
> > > > a separate patchset.
> > >
> > > That makes sense.
> >
> > Wait, without this patchset, do we even free the hugetlb folio when
> > its subpage is hwpoisoned? I don't think we do, but I'm not expert at MFR...
>
> Based on my reading of try_memory_failure_hugetlb, me_huge_page, and
> __page_handle_poison, I think mainline kernel frees dissolved hugetlb
> folio to buddy allocator in two cases:
> 1. it was a free hugetlb page at the moment of try_memory_failure_hugetlb
Right.
> 2. it was an anonomous hugetlb page
Right.
Thanks. I think you're right that poisoned hugetlb folios can be freed
to the buddy even without this series (and poisoned pages allocated back to
users instead of being isolated due to missing PageHWPoison() checks on
alloc/free).
So the plan is to post RFC v2 of this series and the race condition fix
as a separate series, right? (that sounds good to me!)
I still think it'd be best to split the hugetlb folio to order-0 pages and
free them when we know the hugetlb folio is poisoned because:
- We don't have to implement a special version of __free_pages() that
knows how to handle freeing of a high-order page where its one or more
sub-pages are poisoned.
- We can avoid re-enabling page sanity checks (and introducing overhead)
all the time.
--
Cheers,
Harry / Hyeonggon
Powered by blists - more mailing lists