linux-kernel - Re: [syzbot] [mm?] WARNING in memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <B0781266-D168-4DCB-BFCE-3EA01F43F184@nvidia.com>
Date: Wed, 24 Sep 2025 12:33:30 -0400
From: Zi Yan <ziy@...dia.com>
To: David Hildenbrand <david@...hat.com>,
 Luis Chamberlain <mcgrof@...nel.org>,
 "Pankaj Raghav (Samsung)" <kernel@...kajraghav.com>
Cc: syzbot <syzbot+e6367ea2fdab6ed46056@...kaller.appspotmail.com>,
 akpm@...ux-foundation.org, linmiaohe@...wei.com,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org, nao.horiguchi@...il.com,
 syzkaller-bugs@...glegroups.com
Subject: Re: [syzbot] [mm?] WARNING in memory_failure

On 24 Sep 2025, at 11:35, David Hildenbrand wrote:

> On 24.09.25 17:03, Zi Yan wrote:
>> On 24 Sep 2025, at 7:32, David Hildenbrand wrote:
>>
>>> On 23.09.25 18:22, syzbot wrote:
>>>> Hello,
>>>>
>>>> syzbot found the following issue on:
>>>>
>>>> HEAD commit:    b5db4add5e77 Merge branch 'for-next/core' into for-kernelci
>>>> git tree:       git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=10edb8e2580000
>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=d2ae34a0711ff2f1
>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=e6367ea2fdab6ed46056
>>>> compiler:       Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
>>>> userspace arch: arm64
>>>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=14160f12580000
>>>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1361627c580000
>>>>
>>>> Downloadable assets:
>>>> disk image: https://storage.googleapis.com/syzbot-assets/6eee2232d5c1/disk-b5db4add.raw.xz
>>>> vmlinux: https://storage.googleapis.com/syzbot-assets/a8b00f2f1234/vmlinux-b5db4add.xz
>>>> kernel image: https://storage.googleapis.com/syzbot-assets/fc0d466f156c/Image-b5db4add.gz.xz
>>>>
>>>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>>>> Reported-by: syzbot+e6367ea2fdab6ed46056@...kaller.appspotmail.com
>>>>
>>>> Injecting memory failure for pfn 0x104000 at process virtual address 0x20000000
>>>> ------------[ cut here ]------------
>>>> WARNING: CPU: 1 PID: 6700 at mm/memory-failure.c:2391 memory_failure+0x18ec/0x1db4 mm/memory-failure.c:2391
>>>> Modules linked in:
>>>> CPU: 1 UID: 0 PID: 6700 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT
>>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/30/2025
>>>> pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
>>>> pc : memory_failure+0x18ec/0x1db4 mm/memory-failure.c:2391
>>>> lr : memory_failure+0x18ec/0x1db4 mm/memory-failure.c:2391
>>>> sp : ffff8000a41478c0
>>>> x29: ffff8000a41479a0 x28: 05ffc00000200868 x27: ffff700014828f20
>>>> x26: 1fffffbff8620001 x25: 05ffc0000020086d x24: 1fffffbff8620000
>>>> x23: fffffdffc3100008 x22: fffffdffc3100000 x21: fffffdffc3100000
>>>> x20: 0000000000000023 x19: dfff800000000000 x18: 1fffe00033793888
>>>> x17: ffff80008f7ee000 x16: ffff80008052aa64 x15: 0000000000000001
>>>> x14: 1fffffbff8620000 x13: 0000000000000000 x12: 0000000000000000
>>>> x11: ffff7fbff8620001 x10: 0000000000ff0100 x9 : 0000000000000000
>>>> x8 : ffff0000d7eedb80 x7 : ffff800080428910 x6 : 0000000000000000
>>>> x5 : 0000000000000001 x4 : 0000000000000001 x3 : ffff800080cf5438
>>>> x2 : 0000000000000001 x1 : 0000000000000040 x0 : 0000000000000000
>>>> Call trace:
>>>>    memory_failure+0x18ec/0x1db4 mm/memory-failure.c:2391 (P)
>>>>    madvise_inject_error mm/madvise.c:1475 [inline]
>>>>    madvise_do_behavior+0x2c8/0x7c4 mm/madvise.c:1875
>>>>    do_madvise+0x190/0x248 mm/madvise.c:1978
>>>>    __do_sys_madvise mm/madvise.c:1987 [inline]
>>>>    __se_sys_madvise mm/madvise.c:1985 [inline]
>>>>    __arm64_sys_madvise+0xa4/0xc0 mm/madvise.c:1985
>>>>    __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
>>>>    invoke_syscall+0x98/0x254 arch/arm64/kernel/syscall.c:49
>>>>    el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132
>>>>    do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
>>>>    el0_svc+0x5c/0x254 arch/arm64/kernel/entry-common.c:744
>>>>    el0t_64_sync_handler+0x84/0x12c arch/arm64/kernel/entry-common.c:763
>>>>    el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:596
>>>
>>> We're running into the
>>>
>>>          WARN_ON(folio_test_large(folio));
>>>
>>> in memory_failure().
>>>
>>> Which is weird because we have the
>>>
>>>          if (folio_test_large(folio)) {
>>>                  /*
>>>                   * The flag must be set after the refcount is bumped
>>>                   * otherwise it may race with THP split.
>>>                   * And the flag can't be set in get_hwpoison_page() since
>>>                   * it is called by soft offline too and it is just called
>>>                   * for !MF_COUNT_INCREASED.  So here seems to be the best
>>>                   * place.
>>>                   *
>>>                   * Don't need care about the above error handling paths for
>>>                   * get_hwpoison_page() since they handle either free page
>>>                   * or unhandlable page.  The refcount is bumped iff the
>>>                   * page is a valid handlable page.
>>>                   */
>>>                  folio_set_has_hwpoisoned(folio);
>>>                  if (try_to_split_thp_page(p, false) < 0) {
>>>                          res = -EHWPOISON;
>>>                          kill_procs_now(p, pfn, flags, folio);
>>>                          put_page(p);
>>>                          action_result(pfn, MF_MSG_UNSPLIT_THP, MF_FAILED);
>>>                          goto unlock_mutex;
>>>                  }
>>>                  VM_BUG_ON_PAGE(!page_count(p), p);
>>>                  folio = page_folio(p);
>>>          }
>>>
>>> before it.
>>>
>>> But likely that's what I raised to Zi Yan recently: if try_to_split_thp_page()->split_huge_page()
>>> silently decided to split to something that is not a small folio (the min_order_for_split() bit),
>>> this changed the semantics of the function.
>>>
>>> Likely split_huge_page() should have failed if the min_order makes us not split to order-0,
>>> or there would have to be some "parameter" that tells split_huge_page() what expectation (order) the
>>> callers has.
>>>
>>> We can check folio_test_large() after the split, but really, we should just not be splitting at
>>> all if it doesn't serve our purpose.
>>
>> But LBS might want to split from a high order to fs min_order.
>
> Yes.
>
>>
>> What I can think of is:
>> 0. split code always does a split to allowed minimal order,
>>     namely max(fs_min_order, order_from_caller);
>
> Wouldn't max mean "allowed maximum order" ?
>
> I guess what you mean is "split to this order or smaller" -- min?

But LBS imposes a fs_min_order that is not 0. When a caller asks
to split to 0, folio split code needs to use fs_min_order instead of 0.
Thus the max.

>
>> 1. if split order cannot reach to order_from_caller, it just return fails,
>>     so most of the caller will know about it;
>
> Yes, I think this would be the case here: if we cannot split to order-0, we can just fail right away.
>
>> 2. for LBS code, when it sees a split failure, it should check the resulting
>>     folio order against fs min_order. If the orders match, it regards it as
>>     a success.
>>
>> At least, most of the code does not need to be LBS aware. WDYT?
>
> Is my understand correct that it's either that the caller wants to
>
> (a) Split to order-0 -- no larger folio afterwards.
>
> (b) Split to smallest order possible, which might be the mapping min order.

Right. IIRC, most of callers are (a), since folio split was originally
called by code that cannot handle THPs (now large folios). For (b),
I actually wonder if there exists such a caller.

> If so, we could keep the interface simpler than allowing to specify arbitrary orders as request.

We might just need (a), since there is no caller of (b) in kernel, except
split_folio_to_order() is used for testing. There might be future uses
when kernel wants to convert from THP to mTHP, but it seems that we are
not there yet.



+Luis and Pankaj for their opinions on how LBS is going to use split folio
to any order.

Hi Luis and Pankaj,

It seems that bumping split folio order from 0 to mapping_min_folio_order()
instead of simply failing the split folio call gives surprises to some
callers and causes issues like the one reported by this email. I cannot think
of any situation where failing a folio split does not work. If LBS code
wants to split, it should supply mapping_min_folio_order(), right? Does
such caller exist?

Thanks.


Best Regards,
Yan, Zi