linux-kernel - Re: kernel BUG in page_add_anon

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <67dfd817-073e-9abb-316f-689ba8193965@redhat.com>
Date:   Mon, 30 Jan 2023 10:03:45 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Hugh Dickins <hughd@...gle.com>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>
Cc:     Matthew Wilcox <willy@...radead.org>,
        Sanan Hasanov <sanan.hasanov@...ghts.ucf.edu>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "contact@...zz.com" <contact@...zz.com>,
        "syzkaller@...glegroups.com" <syzkaller@...glegroups.com>,
        Huang Ying <ying.huang@...el.com>
Subject: Re: kernel BUG in page_add_anon_rmap

>>>
>>> I reproduced on next-20230127 (did not try upstream yet).
> 
> Upstream's fine; on next-20230127 (with David's repro) it bisects to
> 5ddaec50023e ("mm/mmap: remove __vma_adjust()").  I think I'd better
> hand on to Liam, rather than delay you by puzzling over it further myself.
> 

Thanks for identifying the problematic commit! ...

>>>
>>> I think two key things are that a) THP are set to "always" and b) we have a
>>> NUMA setup [I assume].
>>>
>>> The relevant bits:
>>>
>>> [  439.886738] page:00000000c4de9000 refcount:513 mapcount:2
>>> mapping:0000000000000000 index:0x20003 pfn:0x14ee03
>>> [  439.893758] head:000000003d5b75a4 order:9 entire_mapcount:0
>>> nr_pages_mapped:511 pincount:0
>>> [  439.899611] memcg:ffff986dc4689000
>>> [  439.902207] anon flags:
>>> 0x17ffffc009003f(locked|referenced|uptodate|dirty|lru|active|head|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
>>> [  439.910737] raw: 0017ffffc0020000 ffffe952c53b8001 ffffe952c53b80c8
>>> dead000000000400
>>> [  439.916268] raw: 0000000000000000 0000000000000000 0000000000000001
>>> 0000000000000000
>>> [  439.921773] head: 0017ffffc009003f ffffe952c538b108 ffff986de35a0010
>>> ffff98714338a001
>>> [  439.927360] head: 0000000000020000 0000000000000000 00000201ffffffff
>>> ffff986dc4689000
>>> [  439.932341] page dumped because: VM_BUG_ON_PAGE(!first && (flags & ((
>>> rmap_t)((((1UL))) << (0)))))
>>>
>>>
>>> Indeed, the mapcount of the subpage is 2 instead of 1. The subpage is only
>>> mapped into a single
>>> page table (no fork() or similar).
> 
> Yes, that mapcount:2 is weird; and what's also weird is the index:0x20003:
> what is remove_migration_pte(), in an mbind(0x20002000,...), doing with
> index:0x20003?

I was assuming the whole folio would get migrated. As you raise below, 
it's all a bit unclear once THP get involved and dealing with mbind() 
and page migration.

>>>
>>> I created this reduced reproducer that triggers 100%:
> 
> Very helpful, thank you.
> 
>>>
>>>
>>> #include <stdint.h>
>>> #include <unistd.h>
>>> #include <sys/mman.h>
>>> #include <numaif.h>
>>>
>>> int main(void)
>>> {
>>> 	mmap((void*)0x20000000ul, 0x1000000ul, PROT_READ|PROT_WRITE|PROT_EXEC,
>>> 	     MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0ul);
>>> 	madvise((void*)0x20000000ul, 0x1000000ul, MADV_HUGEPAGE);
>>>
>>> 	*(uint32_t*)0x20000080 = 0x80000;
>>> 	mlock((void*)0x20001000ul, 0x2000ul);
>>> 	mlock((void*)0x20000000ul, 0x3000ul);
> 
> It's not an mlock() issue in particular: quickly established by
> substituting madvise(,, MADV_NOHUGEPAGE) for those mlock() calls.
> Looks like a vma splitting issue now.

Gah, should have tried something like that first before suspecting it's 
mlock related. :)

> 
>>> 	mbind((void*)0x20002000ul, 0x1000ul, MPOL_LOCAL, NULL, 0x7fful,
>>> 	MPOL_MF_MOVE);
> 
> I guess it will turn out not to be relevant to this particular syzbug,
> but what do we expect an mbind() of just 0x1000 of a THP to do?
> 
> It's a subject I've wrestled with unsuccessfully in the past: I found
> myself arriving at one conclusion (split THP) in one place, and a contrary
> conclusion (widen range) in another place, and never had time to work out
> one unified answer.

I'm aware of a similar issue with long-term page pinning: we might want 
to pin a 4k portion of a THP, but will end up blocking the whole THP 
from getting migrated/swapped/split/freed/ ... until we unpin (ever?). I 
wrote a reproducer [1] a while ago to show how you can effectively steal 
most THP in the system using comparatively small memlock limit using 
io_uring ...

In theory, we could split the THP before long-term pinning only a 
subregion ... but what if we cannot split the THP because it's already 
pinned (previous pinning request that covered the whole THP)? Copying 
instead of splitting would also not be possible, if the page is already 
pinned ... so we'd never want to allow long-term pinning a THP ... but 
that means that we would have to fail pinning if splitting the THP fails 
and that there would be performance-consequences for THP users :/

Non-trivial ... just like mlocking only a part of a THP or mbinding 
different parts of a THP to different nodes ...

[1] 
https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/io_uring_thp.c

-- 
Thanks,

David / dhildenb