linux-kernel - Re: [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHbLzkqDt0FfNaG7-rxqz4y=3Wu8yiL38FQiH6hVEsAfRBRvuw@mail.gmail.com>
Date: Tue, 21 Oct 2025 12:07:49 -0700
From: Yang Shi <shy828301@...il.com>
To: Zi Yan <ziy@...dia.com>
Cc: David Hildenbrand <david@...hat.com>, linmiaohe@...wei.com, jane.chu@...cle.com, 
	kernel@...kajraghav.com, 
	syzbot+e6367ea2fdab6ed46056@...kaller.appspotmail.com, 
	syzkaller-bugs@...glegroups.com, akpm@...ux-foundation.org, mcgrof@...nel.org, 
	nao.horiguchi@...il.com, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
	Baolin Wang <baolin.wang@...ux.alibaba.com>, "Liam R. Howlett" <Liam.Howlett@...cle.com>, 
	Nico Pache <npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>, Dev Jain <dev.jain@....com>, 
	Barry Song <baohua@...nel.org>, Lance Yang <lance.yang@...ux.dev>, 
	"Matthew Wilcox (Oracle)" <willy@...radead.org>, Wei Yang <richard.weiyang@...il.com>, 
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org, 
	linux-mm@...ck.org
Subject: Re: [PATCH v2 2/3] mm/memory-failure: improve large block size folio handling.

On Tue, Oct 21, 2025 at 11:58 AM Zi Yan <ziy@...dia.com> wrote:
>
> On 21 Oct 2025, at 14:28, David Hildenbrand wrote:
>
> > On 21.10.25 17:55, Zi Yan wrote:
> >> On 21 Oct 2025, at 11:44, David Hildenbrand wrote:
> >>
> >>> On 21.10.25 03:23, Zi Yan wrote:
> >>>> On 20 Oct 2025, at 19:41, Yang Shi wrote:
> >>>>
> >>>>> On Mon, Oct 20, 2025 at 12:46 PM Zi Yan <ziy@...dia.com> wrote:
> >>>>>>
> >>>>>> On 17 Oct 2025, at 15:11, Yang Shi wrote:
> >>>>>>
> >>>>>>> On Wed, Oct 15, 2025 at 8:38 PM Zi Yan <ziy@...dia.com> wrote:
> >>>>>>>>
> >>>>>>>> Large block size (LBS) folios cannot be split to order-0 folios but
> >>>>>>>> min_order_for_folio(). Current split fails directly, but that is not
> >>>>>>>> optimal. Split the folio to min_order_for_folio(), so that, after split,
> >>>>>>>> only the folio containing the poisoned page becomes unusable instead.
> >>>>>>>>
> >>>>>>>> For soft offline, do not split the large folio if it cannot be split to
> >>>>>>>> order-0. Since the folio is still accessible from userspace and premature
> >>>>>>>> split might lead to potential performance loss.
> >>>>>>>>
> >>>>>>>> Suggested-by: Jane Chu <jane.chu@...cle.com>
> >>>>>>>> Signed-off-by: Zi Yan <ziy@...dia.com>
> >>>>>>>> Reviewed-by: Luis Chamberlain <mcgrof@...nel.org>
> >>>>>>>> ---
> >>>>>>>>    mm/memory-failure.c | 25 +++++++++++++++++++++----
> >>>>>>>>    1 file changed, 21 insertions(+), 4 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >>>>>>>> index f698df156bf8..443df9581c24 100644
> >>>>>>>> --- a/mm/memory-failure.c
> >>>>>>>> +++ b/mm/memory-failure.c
> >>>>>>>> @@ -1656,12 +1656,13 @@ static int identify_page_state(unsigned long pfn, struct page *p,
> >>>>>>>>     * there is still more to do, hence the page refcount we took earlier
> >>>>>>>>     * is still needed.
> >>>>>>>>     */
> >>>>>>>> -static int try_to_split_thp_page(struct page *page, bool release)
> >>>>>>>> +static int try_to_split_thp_page(struct page *page, unsigned int new_order,
> >>>>>>>> +               bool release)
> >>>>>>>>    {
> >>>>>>>>           int ret;
> >>>>>>>>
> >>>>>>>>           lock_page(page);
> >>>>>>>> -       ret = split_huge_page(page);
> >>>>>>>> +       ret = split_huge_page_to_list_to_order(page, NULL, new_order);
> >>>>>>>>           unlock_page(page);
> >>>>>>>>
> >>>>>>>>           if (ret && release)
> >>>>>>>> @@ -2280,6 +2281,7 @@ int memory_failure(unsigned long pfn, int flags)
> >>>>>>>>           folio_unlock(folio);
> >>>>>>>>
> >>>>>>>>           if (folio_test_large(folio)) {
> >>>>>>>> +               int new_order = min_order_for_split(folio);
> >>>>>>>>                   /*
> >>>>>>>>                    * The flag must be set after the refcount is bumped
> >>>>>>>>                    * otherwise it may race with THP split.
> >>>>>>>> @@ -2294,7 +2296,14 @@ int memory_failure(unsigned long pfn, int flags)
> >>>>>>>>                    * page is a valid handlable page.
> >>>>>>>>                    */
> >>>>>>>>                   folio_set_has_hwpoisoned(folio);
> >>>>>>>> -               if (try_to_split_thp_page(p, false) < 0) {
> >>>>>>>> +               /*
> >>>>>>>> +                * If the folio cannot be split to order-0, kill the process,
> >>>>>>>> +                * but split the folio anyway to minimize the amount of unusable
> >>>>>>>> +                * pages.
> >>>>>>>> +                */
> >>>>>>>> +               if (try_to_split_thp_page(p, new_order, false) || new_order) {
> >>>>>>>
> >>>>>>> folio split will clear PG_has_hwpoisoned flag. It is ok for splitting
> >>>>>>> to order-0 folios because the PG_hwpoisoned flag is set on the
> >>>>>>> poisoned page. But if you split the folio to some smaller order large
> >>>>>>> folios, it seems you need to keep PG_has_hwpoisoned flag on the
> >>>>>>> poisoned folio.
> >>>>>>
> >>>>>> OK, this means all pages in a folio with folio_test_has_hwpoisoned() should be
> >>>>>> checked to be able to set after-split folio's flag properly. Current folio
> >>>>>> split code does not do that. I am thinking about whether that causes any
> >>>>>> issue. Probably not, because:
> >>>>>>
> >>>>>> 1. before Patch 1 is applied, large after-split folios are already causing
> >>>>>> a warning in memory_failure(). That kinda masks this issue.
> >>>>>> 2. after Patch 1 is applied, no large after-split folios will appear,
> >>>>>> since the split will fail.
> >>>>>
> >>>>> I'm a little bit confused. Didn't this patch split large folio to
> >>>>> new-order-large-folio (new order is min order)? So this patch had
> >>>>> code:
> >>>>> if (try_to_split_thp_page(p, new_order, false) || new_order) {
> >>>>
> >>>> Yes, but this is Patch 2 in this series. Patch 1 is
> >>>> "mm/huge_memory: do not change split_huge_page*() target order silently."
> >>>> and sent separately as a hotfix[1].
> >>>
> >>> I'm confused now as well. I'd like to review, will there be a v3 that only contains patch #2+#3?
> >>
> >> Yes. The new V3 will have 3 patches:
> >> 1. a new patch addresses Yang’s concern on setting has_hwpoisoned on after-split
> >> large folios.
> >> 2. patch#2,
> >> 3. patch#3.
> >
> > Okay, I'll wait with the review until you resend :)
> >
> >>
> >> The plan is to send them out once patch 1 is upstreamed. Let me know if you think
> >> it is OK to send them out earlier as Andrew already picked up patch 1.
> >
> > It's in mm/mm-new + mm/mm-unstable, AFAIKT. So sure, send it against one of the tress (I prefer mm-unstable but usually we should target mm-new).
>
> Sure.
> >
> >>
> >> I also would like to get some feedback on my approach to setting has_hwpoisoned:
> >>
> >> folio's has_hwpoisoned flag needs to be preserved
> >> like what Yang described above. My current plan is to move
> >> folio_clear_has_hwpoisoned(folio) into __split_folio_to_order() and
> >> scan every page in the folio if the folio's has_hwpoisoned is set.
> >
> > Oh, that's nasty indeed ... will have to think about that a bit.
> >
> > Maybe we can keep it simple and always set folio_set_has_hwpoisoned() on all split folios? Essentially turning it into a "maybe_has" semantics.
> >
> > IIUC, the existing folio_stest_has_hwpoisoned users can deal with that?
>
> folio_test_has_hwpoisoned() direct users are fine. They are shmem.c
> and memory.c, where the former would copy data in PAGE_SIZE instead of folio size
> and the latter would not install PMD entry for the folio (impossible to hit
> this until we have > PMD mTHPs and split them to PMD THPs).
>
> The caller of folio_contain_hwpoisoned_page(), which calls
> folio_test_has_hwpoisoned(), would have issues:
>
> 1. shmem_write_begin() in shmem.c: it returns -EIO for shmem writes.
> 2. thp_underused() in huge_memory.c: it does not scan the folio.
> 3. shrink_folio_list() in vmscan.c: it does not reclaim large hwpoisoned folios.
> 4. do_migrate_range() in memory_hotplug.c: it skips the large hwpoisoned folios.
>
> These behaviors are fine for folios truly containing hwpoisoned pages,
> but might not be desirable for false positive cases. A scan to make sure
> hwpoisoned pages are indeed present is inevitable. Rather than making
> all callers to do the scan, scanning at split time might be better, IMHO.

Yeah, I was trying to figure out a simpler way too. For example, we
can defer to set this flag to page fault time when page fault sees the
poisoned page when installing PTEs. But it can't cover most of the
cases mentioned by Zi Yan above. We may run into them before any page
fault happens.

Thanks,
Yang

>
> Let me send a patchset with scanning at split time. Hopefully, more people
> can chime in to provide feedbacks.
>
>
> --
> Best Regards,
> Yan, Zi