lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <67dfd817-073e-9abb-316f-689ba8193965@redhat.com> Date: Mon, 30 Jan 2023 10:03:45 +0100 From: David Hildenbrand <david@...hat.com> To: Hugh Dickins <hughd@...gle.com>, "Liam R. Howlett" <Liam.Howlett@...cle.com> Cc: Matthew Wilcox <willy@...radead.org>, Sanan Hasanov <sanan.hasanov@...ghts.ucf.edu>, "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>, "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "contact@...zz.com" <contact@...zz.com>, "syzkaller@...glegroups.com" <syzkaller@...glegroups.com>, Huang Ying <ying.huang@...el.com> Subject: Re: kernel BUG in page_add_anon_rmap >>> >>> I reproduced on next-20230127 (did not try upstream yet). > > Upstream's fine; on next-20230127 (with David's repro) it bisects to > 5ddaec50023e ("mm/mmap: remove __vma_adjust()"). I think I'd better > hand on to Liam, rather than delay you by puzzling over it further myself. > Thanks for identifying the problematic commit! ... >>> >>> I think two key things are that a) THP are set to "always" and b) we have a >>> NUMA setup [I assume]. >>> >>> The relevant bits: >>> >>> [ 439.886738] page:00000000c4de9000 refcount:513 mapcount:2 >>> mapping:0000000000000000 index:0x20003 pfn:0x14ee03 >>> [ 439.893758] head:000000003d5b75a4 order:9 entire_mapcount:0 >>> nr_pages_mapped:511 pincount:0 >>> [ 439.899611] memcg:ffff986dc4689000 >>> [ 439.902207] anon flags: >>> 0x17ffffc009003f(locked|referenced|uptodate|dirty|lru|active|head|swapbacked|node=0|zone=2|lastcpupid=0x1fffff) >>> [ 439.910737] raw: 0017ffffc0020000 ffffe952c53b8001 ffffe952c53b80c8 >>> dead000000000400 >>> [ 439.916268] raw: 0000000000000000 0000000000000000 0000000000000001 >>> 0000000000000000 >>> [ 439.921773] head: 0017ffffc009003f ffffe952c538b108 ffff986de35a0010 >>> ffff98714338a001 >>> [ 439.927360] head: 0000000000020000 0000000000000000 00000201ffffffff >>> ffff986dc4689000 >>> [ 439.932341] page dumped because: VM_BUG_ON_PAGE(!first && (flags & (( >>> rmap_t)((((1UL))) << (0))))) >>> >>> >>> Indeed, the mapcount of the subpage is 2 instead of 1. The subpage is only >>> mapped into a single >>> page table (no fork() or similar). > > Yes, that mapcount:2 is weird; and what's also weird is the index:0x20003: > what is remove_migration_pte(), in an mbind(0x20002000,...), doing with > index:0x20003? I was assuming the whole folio would get migrated. As you raise below, it's all a bit unclear once THP get involved and dealing with mbind() and page migration. >>> >>> I created this reduced reproducer that triggers 100%: > > Very helpful, thank you. > >>> >>> >>> #include <stdint.h> >>> #include <unistd.h> >>> #include <sys/mman.h> >>> #include <numaif.h> >>> >>> int main(void) >>> { >>> mmap((void*)0x20000000ul, 0x1000000ul, PROT_READ|PROT_WRITE|PROT_EXEC, >>> MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0ul); >>> madvise((void*)0x20000000ul, 0x1000000ul, MADV_HUGEPAGE); >>> >>> *(uint32_t*)0x20000080 = 0x80000; >>> mlock((void*)0x20001000ul, 0x2000ul); >>> mlock((void*)0x20000000ul, 0x3000ul); > > It's not an mlock() issue in particular: quickly established by > substituting madvise(,, MADV_NOHUGEPAGE) for those mlock() calls. > Looks like a vma splitting issue now. Gah, should have tried something like that first before suspecting it's mlock related. :) > >>> mbind((void*)0x20002000ul, 0x1000ul, MPOL_LOCAL, NULL, 0x7fful, >>> MPOL_MF_MOVE); > > I guess it will turn out not to be relevant to this particular syzbug, > but what do we expect an mbind() of just 0x1000 of a THP to do? > > It's a subject I've wrestled with unsuccessfully in the past: I found > myself arriving at one conclusion (split THP) in one place, and a contrary > conclusion (widen range) in another place, and never had time to work out > one unified answer. I'm aware of a similar issue with long-term page pinning: we might want to pin a 4k portion of a THP, but will end up blocking the whole THP from getting migrated/swapped/split/freed/ ... until we unpin (ever?). I wrote a reproducer [1] a while ago to show how you can effectively steal most THP in the system using comparatively small memlock limit using io_uring ... In theory, we could split the THP before long-term pinning only a subregion ... but what if we cannot split the THP because it's already pinned (previous pinning request that covered the whole THP)? Copying instead of splitting would also not be possible, if the page is already pinned ... so we'd never want to allow long-term pinning a THP ... but that means that we would have to fail pinning if splitting the THP fails and that there would be performance-consequences for THP users :/ Non-trivial ... just like mlocking only a part of a THP or mbinding different parts of a THP to different nodes ... [1] https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/io_uring_thp.c -- Thanks, David / dhildenb
Powered by blists - more mailing lists