[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <92076c0e-1eee-66a4-6342-202989c32955@redhat.com>
Date: Mon, 30 Jan 2023 10:26:16 +0100
From: David Hildenbrand <david@...hat.com>
To: Hugh Dickins <hughd@...gle.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>
Cc: Matthew Wilcox <willy@...radead.org>,
Sanan Hasanov <sanan.hasanov@...ghts.ucf.edu>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"contact@...zz.com" <contact@...zz.com>,
"syzkaller@...glegroups.com" <syzkaller@...glegroups.com>,
Huang Ying <ying.huang@...el.com>
Subject: Re: kernel BUG in page_add_anon_rmap
On 30.01.23 10:03, David Hildenbrand wrote:
>>>>
>>>> I reproduced on next-20230127 (did not try upstream yet).
>>
>> Upstream's fine; on next-20230127 (with David's repro) it bisects to
>> 5ddaec50023e ("mm/mmap: remove __vma_adjust()"). I think I'd better
>> hand on to Liam, rather than delay you by puzzling over it further myself.
>>
>
> Thanks for identifying the problematic commit! ...
>
>>>>
>>>> I think two key things are that a) THP are set to "always" and b) we have a
>>>> NUMA setup [I assume].
>>>>
>>>> The relevant bits:
>>>>
>>>> [ 439.886738] page:00000000c4de9000 refcount:513 mapcount:2
>>>> mapping:0000000000000000 index:0x20003 pfn:0x14ee03
>>>> [ 439.893758] head:000000003d5b75a4 order:9 entire_mapcount:0
>>>> nr_pages_mapped:511 pincount:0
>>>> [ 439.899611] memcg:ffff986dc4689000
>>>> [ 439.902207] anon flags:
>>>> 0x17ffffc009003f(locked|referenced|uptodate|dirty|lru|active|head|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
>>>> [ 439.910737] raw: 0017ffffc0020000 ffffe952c53b8001 ffffe952c53b80c8
>>>> dead000000000400
>>>> [ 439.916268] raw: 0000000000000000 0000000000000000 0000000000000001
>>>> 0000000000000000
>>>> [ 439.921773] head: 0017ffffc009003f ffffe952c538b108 ffff986de35a0010
>>>> ffff98714338a001
>>>> [ 439.927360] head: 0000000000020000 0000000000000000 00000201ffffffff
>>>> ffff986dc4689000
>>>> [ 439.932341] page dumped because: VM_BUG_ON_PAGE(!first && (flags & ((
>>>> rmap_t)((((1UL))) << (0)))))
>>>>
>>>>
>>>> Indeed, the mapcount of the subpage is 2 instead of 1. The subpage is only
>>>> mapped into a single
>>>> page table (no fork() or similar).
>>
>> Yes, that mapcount:2 is weird; and what's also weird is the index:0x20003:
>> what is remove_migration_pte(), in an mbind(0x20002000,...), doing with
>> index:0x20003?
>
> I was assuming the whole folio would get migrated. As you raise below,
> it's all a bit unclear once THP get involved and dealing with mbind()
> and page migration.
>
>>>>
>>>> I created this reduced reproducer that triggers 100%:
>>
>> Very helpful, thank you.
>>
>>>>
>>>>
>>>> #include <stdint.h>
>>>> #include <unistd.h>
>>>> #include <sys/mman.h>
>>>> #include <numaif.h>
>>>>
>>>> int main(void)
>>>> {
>>>> mmap((void*)0x20000000ul, 0x1000000ul, PROT_READ|PROT_WRITE|PROT_EXEC,
>>>> MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0ul);
>>>> madvise((void*)0x20000000ul, 0x1000000ul, MADV_HUGEPAGE);
>>>>
>>>> *(uint32_t*)0x20000080 = 0x80000;
>>>> mlock((void*)0x20001000ul, 0x2000ul);
>>>> mlock((void*)0x20000000ul, 0x3000ul);
>>
>> It's not an mlock() issue in particular: quickly established by
>> substituting madvise(,, MADV_NOHUGEPAGE) for those mlock() calls.
>> Looks like a vma splitting issue now.
>
> Gah, should have tried something like that first before suspecting it's
> mlock related. :)
>
>>
>>>> mbind((void*)0x20002000ul, 0x1000ul, MPOL_LOCAL, NULL, 0x7fful,
>>>> MPOL_MF_MOVE);
>>
>> I guess it will turn out not to be relevant to this particular syzbug,
>> but what do we expect an mbind() of just 0x1000 of a THP to do?
>>
>> It's a subject I've wrestled with unsuccessfully in the past: I found
>> myself arriving at one conclusion (split THP) in one place, and a contrary
>> conclusion (widen range) in another place, and never had time to work out
>> one unified answer.
>
> I'm aware of a similar issue with long-term page pinning: we might want
> to pin a 4k portion of a THP, but will end up blocking the whole THP
> from getting migrated/swapped/split/freed/ ... until we unpin (ever?). I
> wrote a reproducer [1] a while ago to show how you can effectively steal
> most THP in the system using comparatively small memlock limit using
> io_uring ...
>
Correction, my reproducer already triggers a compund page split to
really only pin a 4k page, to then free the remaining 4k pages of the
previous THP. As a single 4k page is allocated and pinned, we cannot get
a THP at these physical memory locations until the page is unpinned.
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists