linux-kernel - Re: [syzbot] [mm?] kernel BUG in try_to_unmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3c56a8ba-235b-3fe9-d988-dd750bc0feaa@huawei.com>
Date: Sat, 7 Jun 2025 09:29:38 +0800
From: Jinjiang Tu <tujinjiang@...wei.com>
To: David Hildenbrand <david@...hat.com>, syzbot
	<syzbot+3b220254df55d8ca8a61@...kaller.appspotmail.com>,
	<Liam.Howlett@...cle.com>, <akpm@...ux-foundation.org>,
	<harry.yoo@...cle.com>, <linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
	<lorenzo.stoakes@...cle.com>, <riel@...riel.com>,
	<syzkaller-bugs@...glegroups.com>, <vbabka@...e.cz>, Jens Axboe
	<axboe@...nel.dk>, Catalin Marinas <catalin.marinas@....com>, Miaohe Lin
	<linmiaohe@...wei.com>
Subject: Re: [syzbot] [mm?] kernel BUG in try_to_unmap_one (2)


在 2025/6/6 15:56, David Hildenbrand 写道:
> On 05.06.25 09:18, Jinjiang Tu wrote:
>>
>> 在 2025/6/5 14:37, David Hildenbrand 写道:
>>> On 05.06.25 08:27, David Hildenbrand wrote:
>>>> On 05.06.25 08:11, David Hildenbrand wrote:
>>>>> On 05.06.25 07:38, syzbot wrote:
>>>>>> Hello,
>>>>>>
>>>>>> syzbot found the following issue on:
>>>>>>
>>>>>> HEAD commit:    d7fa1af5b33e Merge branch 'for-next/core' into
>>>>>> for-kernelci
>>>>>
>>>>> Hmmm, another very odd page-table mapping related problem on that 
>>>>> tree
>>>>> found on arm64 only:
>>>>
>>>> In this particular reproducer we seem to be having MADV_HUGEPAGE and
>>>> io_uring_setup() be racing with MADV_HWPOISON, MADV_PAGEOUT and
>>>> io_uring_register(IORING_REGISTER_BUFFERS).
>>>>
>>>> I assume the issue is related to MADV_HWPOISON, MADV_PAGEOUT and
>>>> io_uring_register racing, only. I suspect MADV_HWPOISON is trying to
>>>> split a THP, while MADV_PAGEOUT tries paging it out.
>>>>
>>>> IORING_REGISTER_BUFFERS ends up in
>>>> io_sqe_buffers_register->io_sqe_buffer_register where we GUP-fast and
>>>> try coalescing buffers.
>>>>
>>>> And something about THPs is not particularly happy :)
>>>>
>>>
>>> Not sure if realted to io_uring.
>>>
>>> unmap_poisoned_folio() calls try_to_unmap() without TTU_SPLIT_HUGE_PMD.
>>>
>>> When called from memory_failure(), we make sure to never call it on a
>>> large folio: WARN_ON(folio_test_large(folio));
>>>
>>> However, from shrink_folio_list() we might call unmap_poisoned_folio()
>>> on a large folio, which doesn't work if it is still PMD-mapped. Maybe
>>> passing TTU_SPLIT_HUGE_PMD would fix it.
>>>
>> TTU_SPLIT_HUGE_PMD only converts the PMD-mapped THP to PTE-mapped 
>> THP, and may trigger the below WARN_ON_ONCE in try_to_unmap_one.
>>
>>     if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
>>         ...
>>     } else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
>>         !userfaultfd_armed(vma)) {
>>          ....
>>     } else if (folio_test_anon(folio)) {
>>         swp_entry_t entry = page_swap_entry(subpage);
>>         pte_t swp_pte;
>>         /*
>>          * Store the swap location in the pte.
>>          * See handle_pte_fault() ...
>>         */
>>         if (unlikely(folio_test_swapbacked(folio) !=
>>             folio_test_swapcache(folio))) {
>>             WARN_ON_ONCE(1);          // here. if the subpage isn't 
>> hwposioned, and we hasn't call add_to_swap() for the THP
>>             goto walk_abort;
>>          }
>
> This makes me wonder if we should start splitting up try_to_unmap(), 
> to handle the individual cases more cleanly at some point ...
>
> Maybe for now something like:
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index b91a33fb6c694..995486a3ff4d2 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1566,6 +1566,14 @@ int unmap_poisoned_folio(struct folio *folio, 
> unsigned long pfn, bool must_kill)
>         enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_SYNC | TTU_HWPOISON;
>         struct address_space *mapping;
>
> +       /*
> +        * try_to_unmap() cannot deal with some subpages of an anon folio
> +        * not being hwpoisoned: we cannot unmap them without swap.
> +        */
> +       if (folio_test_large(folio) && !folio_test_hugetlb(folio) &&
> +           folio_test_anon(folio) && !folio_test_swapcache(folio))
> +               return -EBUSY;
> +

If the THP is in swapcache, we also have to split PMD-mapped to PTE-mapped first.

> if (folio_test_swapcache(folio)) {
>                 pr_err("%#lx: keeping poisoned page in swap cache\n", 
> pfn);
>                 ttu &= ~TTU_HWPOISON;
>
>
>
>>
>> If we want to unmap in shrink_folio_list, we have to 
>> try_to_split_thp_page() like memory_failure(). But it't too 
>> complicated, maybe just skip the
>> hwpoisoned folio is enough? If the folio is accessed again, 
>> memory_failure will be trigerred again and kill the accessing process 
>> since the folio
>> has be hwpoisoned.
>
>
> Maybe we should try splitting in there? But staring at 
> shrink_folio_list(), not that easy.
>
> We could return -E2BIG and let the caller try splitting, to then retry.

Since UCE is rare in real world, and could race with any subsystem, which is more race. Taking too much time to handle UCE in other subsystem is
meaningless and complicated. Just skipping is enough. memory_failure() will handle it if the UCE is trigerred again.

CC Miaohe Lin