linux-kernel - Re: [syzbot] [mm?] kernel BUG in try_to_unmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <0eb768f1-03fc-7cfd-6e1b-594f76a830b3@huawei.com>
Date: Mon, 9 Jun 2025 16:35:09 +0800
From: Miaohe Lin <linmiaohe@...wei.com>
To: Jinjiang Tu <tujinjiang@...wei.com>, David Hildenbrand <david@...hat.com>
CC: syzbot <syzbot+3b220254df55d8ca8a61@...kaller.appspotmail.com>,
	<Liam.Howlett@...cle.com>, <akpm@...ux-foundation.org>,
	<harry.yoo@...cle.com>, <linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
	<lorenzo.stoakes@...cle.com>, <riel@...riel.com>,
	<syzkaller-bugs@...glegroups.com>, <vbabka@...e.cz>, Jens Axboe
	<axboe@...nel.dk>, Catalin Marinas <catalin.marinas@....com>
Subject: Re: [syzbot] [mm?] kernel BUG in try_to_unmap_one (2)

On 2025/6/7 9:29, Jinjiang Tu wrote:
> 
> 在 2025/6/6 15:56, David Hildenbrand 写道:
>> On 05.06.25 09:18, Jinjiang Tu wrote:
>>>
>>> 在 2025/6/5 14:37, David Hildenbrand 写道:
>>>> On 05.06.25 08:27, David Hildenbrand wrote:
>>>>> On 05.06.25 08:11, David Hildenbrand wrote:
>>>>>> On 05.06.25 07:38, syzbot wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> syzbot found the following issue on:
>>>>>>>
>>>>>>> HEAD commit:    d7fa1af5b33e Merge branch 'for-next/core' into
>>>>>>> for-kernelci
>>>>>>
>>>>>> Hmmm, another very odd page-table mapping related problem on that tree
>>>>>> found on arm64 only:
>>>>>
>>>>> In this particular reproducer we seem to be having MADV_HUGEPAGE and
>>>>> io_uring_setup() be racing with MADV_HWPOISON, MADV_PAGEOUT and
>>>>> io_uring_register(IORING_REGISTER_BUFFERS).
>>>>>
>>>>> I assume the issue is related to MADV_HWPOISON, MADV_PAGEOUT and
>>>>> io_uring_register racing, only. I suspect MADV_HWPOISON is trying to
>>>>> split a THP, while MADV_PAGEOUT tries paging it out.
>>>>>
>>>>> IORING_REGISTER_BUFFERS ends up in
>>>>> io_sqe_buffers_register->io_sqe_buffer_register where we GUP-fast and
>>>>> try coalescing buffers.
>>>>>
>>>>> And something about THPs is not particularly happy :)
>>>>>
>>>>
>>>> Not sure if realted to io_uring.
>>>>
>>>> unmap_poisoned_folio() calls try_to_unmap() without TTU_SPLIT_HUGE_PMD.
>>>>
>>>> When called from memory_failure(), we make sure to never call it on a
>>>> large folio: WARN_ON(folio_test_large(folio));
>>>>
>>>> However, from shrink_folio_list() we might call unmap_poisoned_folio()
>>>> on a large folio, which doesn't work if it is still PMD-mapped. Maybe
>>>> passing TTU_SPLIT_HUGE_PMD would fix it.
>>>>
>>> TTU_SPLIT_HUGE_PMD only converts the PMD-mapped THP to PTE-mapped THP, and may trigger the below WARN_ON_ONCE in try_to_unmap_one.
>>>
>>>     if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
>>>         ...
>>>     } else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
>>>         !userfaultfd_armed(vma)) {
>>>          ....
>>>     } else if (folio_test_anon(folio)) {
>>>         swp_entry_t entry = page_swap_entry(subpage);
>>>         pte_t swp_pte;
>>>         /*
>>>          * Store the swap location in the pte.
>>>          * See handle_pte_fault() ...
>>>         */
>>>         if (unlikely(folio_test_swapbacked(folio) !=
>>>             folio_test_swapcache(folio))) {
>>>             WARN_ON_ONCE(1);          // here. if the subpage isn't hwposioned, and we hasn't call add_to_swap() for the THP
>>>             goto walk_abort;
>>>          }
>>
>> This makes me wonder if we should start splitting up try_to_unmap(), to handle the individual cases more cleanly at some point ...
>>
>> Maybe for now something like:
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index b91a33fb6c694..995486a3ff4d2 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1566,6 +1566,14 @@ int unmap_poisoned_folio(struct folio *folio, unsigned long pfn, bool must_kill)
>>         enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_SYNC | TTU_HWPOISON;
>>         struct address_space *mapping;
>>
>> +       /*
>> +        * try_to_unmap() cannot deal with some subpages of an anon folio
>> +        * not being hwpoisoned: we cannot unmap them without swap.
>> +        */
>> +       if (folio_test_large(folio) && !folio_test_hugetlb(folio) &&
>> +           folio_test_anon(folio) && !folio_test_swapcache(folio))
>> +               return -EBUSY;
>> +
> 
> If the THP is in swapcache, we also have to split PMD-mapped to PTE-mapped first.
> 
>> if (folio_test_swapcache(folio)) {
>>                 pr_err("%#lx: keeping poisoned page in swap cache\n", pfn);
>>                 ttu &= ~TTU_HWPOISON;
>>
>>
>>
>>>
>>> If we want to unmap in shrink_folio_list, we have to try_to_split_thp_page() like memory_failure(). But it't too complicated, maybe just skip the
>>> hwpoisoned folio is enough? If the folio is accessed again, memory_failure will be trigerred again and kill the accessing process since the folio
>>> has be hwpoisoned.
>>
>>
>> Maybe we should try splitting in there? But staring at shrink_folio_list(), not that easy.
>>
>> We could return -E2BIG and let the caller try splitting, to then retry.
> 
> Since UCE is rare in real world, and could race with any subsystem, which is more race. Taking too much time to handle UCE in other subsystem is
> meaningless and complicated. Just skipping is enough. memory_failure() will handle it if the UCE is trigerred again.

IMHO, unmap_poisoned_folio() is designed for basic pages only, not for large folios. And above race should be really rare in real world,
so it might be better to ignore large folios in unmap_poisoned_folio() and memory_failure will handle all of these when UCE is re-triggered.

Thanks both.
.