lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d60770df-a967-498f-9e0e-25a6a145ff55@oracle.com>
Date: Thu, 2 Oct 2025 10:54:09 -0700
From: jane.chu@...cle.com
To: Zi Yan <ziy@...dia.com>
Cc: syzbot <syzbot+e6367ea2fdab6ed46056@...kaller.appspotmail.com>,
        syzkaller-bugs@...glegroups.com, akpm@...ux-foundation.org,
        david@...hat.com, kernel@...kajraghav.com, linmiaohe@...wei.com,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org, mcgrof@...nel.org,
        nao.horiguchi@...il.com
Subject: Re: [syzbot] [mm?] WARNING in memory_failure



On 10/2/2025 6:54 AM, Zi Yan wrote:
> On 2 Oct 2025, at 1:23, jane.chu@...cle.com wrote:
> 
>> On 10/1/2025 7:04 PM, Zi Yan wrote:
>>> On 1 Oct 2025, at 20:38, Zi Yan wrote:
>>>
>>>> On 1 Oct 2025, at 19:58, jane.chu@...cle.com wrote:
>>>>
>>>>> Hi, Zi Yan,
>>>>>
>>>>> On 9/30/2025 9:51 PM, syzbot wrote:
>>>>>> Hello,
>>>>>>
>>>>>> syzbot has tested the proposed patch but the reproducer is still triggering an issue:
>>>>>> lost connection to test machine
>>>>>>
>>>>>>
>>>>>>
>>>>>> Tested on:
>>>>>>
>>>>>> commit:         d8795075 mm/huge_memory: do not change split_huge_page..
>>>>>> git tree:       https://github.com/x-y-z/linux-dev.git fix_split_page_min_order-for-kernelci
>>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=17ce96e2580000
>>>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=714d45b6135c308e
>>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=e6367ea2fdab6ed46056
>>>>>> compiler:       Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
>>>>>> userspace arch: arm64
>>>>>>
>>>>>> Note: no patches were applied.
>>>>>>
>>>>>
>>>>
>>>> Thank you for looking into this.
>>>>
>>>>> My hunch is that
>>>>> https://github.com/x-y-z/linux-dev.git fix_split_page_min_order-for-kernelci
>>>>> alone is not enough.  Perhaps on ARM64, the page cache pages of /dev/nullb0 in
>>>> Yes, it only has the first patch, which fails a split if it cannot be
>>>> split to the intended order (order-0 in this case).
>>>>
>>>>
>>>>> the test case are probably with min_order > 0, therefore THP split fails, as the console message show:
>>>>> [  200.378989][T18221] Memory failure: 0x124d30: recovery action for unsplit thp: Failed
>>>>>
>>>>> With lots of poisoned THP pages stuck in the page cache, OOM could trigger too soon.
>>>>
>>>> That is my understanding too. Thanks for the confirmation.
>>>>
>>>>>
>>>>> I think it's worth to try add the additional changes I suggested earlier -
>>>>> https://lore.kernel.org/lkml/7577871f-06be-492d-b6d7-8404d7a045e0@oracle.com/
>>>>>
>>>>> So that in the madvise HWPOISON cases, large huge pages are splitted to smaller huge pages, and most of them remain usable in the page cache.
>>>>
>>>> Yep, I am going to incorporate your suggestion as the second patch and make
>>>> syzbot check it again.
>>>
>>>
>>> #syz test: https://github.com/x-y-z/linux-dev.git fix_split_page_min_order_and_opt_memory_failure-for-kernelci
>>>
>>
>> There is a bug here,
>>
>> 		if (try_to_split_thp_page(p, new_order, false) || new_order) {
>> 			res = -EHWPOISON;
>> 			kill_procs_now(p, pfn, flags, folio);  <---
>>
>> If try_to_split_thp_page() succeeded on min_order, 'folio' should be retaken:  folio = page_folio(page) before moving on to kill_procs_now().
> 
> Thank you for pointing it out. Let me fix it and let syzbot test it again.

Forgot to ask, even with your current patch, after splitting at 
min_order, the old 'folio' should be at min_order as well, just not 
necessarily the one where the raw hwpoisoned sub-page resides, right?
If yes, then 1) I am wondering about the value of the min_order?  2) 
perhaps the syzbot test need to reduce the number of fork()'ing,
as with each MADV_HWPOISON inject, one page cache page will be lost and 
stuck in the page cache, the difference is the size of the page cache 
page and the number of pages.

thanks,
-jane


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ