linux-kernel - Re: [PATCH] mm: migrate: Fix THP's mapcount on isolation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87o7svreq0.fsf@nvidia.com>
Date:   Fri, 25 Nov 2022 11:58:46 +1100
From:   Alistair Popple <apopple@...dia.com>
To:     David Hildenbrand <david@...hat.com>
Cc:     Matthew Wilcox <willy@...radead.org>,
        Hugh Dickins <hughd@...gle.com>, Gavin Shan <gshan@...hat.com>,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        akpm@...ux-foundation.org, william.kucharski@...cle.com,
        ziy@...dia.com, kirill.shutemov@...ux.intel.com,
        zhenyzha@...hat.com, shan.gavin@...il.com, riel@...riel.com
Subject: Re: [PATCH] mm: migrate: Fix THP's mapcount on isolation


David Hildenbrand <david@...hat.com> writes:

> On 24.11.22 04:33, Matthew Wilcox wrote:
>> On Thu, Nov 24, 2022 at 12:06:56PM +1100, Alistair Popple wrote:
>>>
>>> David Hildenbrand <david@...hat.com> writes:
>>>
>>>> On 23.11.22 06:14, Hugh Dickins wrote:
>>>>> On Wed, 23 Nov 2022, Gavin Shan wrote:
>>>>>
>>>>>> The issue is reported when removing memory through virtio_mem device.
>>>>>> The transparent huge page, experienced copy-on-write fault, is wrongly
>>>>>> regarded as pinned. The transparent huge page is escaped from being
>>>>>> isolated in isolate_migratepages_block(). The transparent huge page
>>>>>> can't be migrated and the corresponding memory block can't be put
>>>>>> into offline state.
>>>>>>
>>>>>> Fix it by replacing page_mapcount() with total_mapcount(). With this,
>>>>>> the transparent huge page can be isolated and migrated, and the memory
>>>>>> block can be put into offline state.
>>>>>>
>>>>>> Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP")
>>>>>> Cc: stable@...r.kernel.org   # v5.8+
>>>>>> Reported-by: Zhenyu Zhang <zhenyzha@...hat.com>
>>>>>> Suggested-by: David Hildenbrand <david@...hat.com>
>>>>>> Signed-off-by: Gavin Shan <gshan@...hat.com>
>>>>> Interesting, good catch, looked right to me: except for the Fixes
>>>>> line
>>>>> and mention of v5.8.  That CoW change may have added a case which easily
>>>>> demonstrates the problem, but it would have been the wrong test on a THP
>>>>> for long before then - but only in v5.7 were compound pages allowed
>>>>> through at all to reach that test, so I think it should be
>>>>> Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for
>>>>> CMA allocations")
>>>>> Cc: stable@...r.kernel.org   # v5.7+
>>>>> Oh, no, stop: this is not so easy, even in the latest tree.
>>>>> Because at the time of that "admittedly racy check", we have no hold
>>>>> at all on the page in question: and if it's PageLRU or PageCompound
>>>>> at one instant, it may be different the next instant.  Which leaves it
>>>>> vulnerable to whatever BUG_ON()s there may be in the total_mapcount()
>>>>> path - needs research.  *Perhaps* there are no more BUG_ON()s in the
>>>>> total_mapcount() path than in the existing page_mapcount() path.
>>>>> I suspect that for this to be safe (before your patch and more so
>>>>> after),
>>>>> it will be necessary to shift the "admittedly racy check" down after the
>>>>> get_page_unless_zero() (and check the sequence of operations when a
>>>>> compound page is initialized).
>>>>
>>>> Grabbing a reference first sounds like the right approach to me.
>>>
>>> I think you're right. Without a page reference I don't think it is even
>>> safe to look at struct page, at least not without synchronisation
>>> against memory hot unplug which could remove the struct page. From a
>>> quick glance I didn't see anything here that obviously did that though.
>> Memory hotplug is the offending party here.  It has to make sure
>> that
>> everything else is definitely quiescent before removing the struct pages.
>> Otherwise you can't even try_get a refcount.

Sure, I might be missing something but how can memory hotplug possibly
synchronise against some process (eg. try_to_compact_pages) doing
try_get(pfn_to_page(random_pfn)) without that process either informing
memory hotplug that it needs pages to remain valid long enough to get a
reference or detecting that memory hotplug is in the process of
offlining pages?

> At least alloc_contig_range() and memory offlining are mutually
> exclusive due to MIGRATE_ISOLTAE. I recall that ordinary memory
> compaction similarly deals with isolated pageblocks (or some other
> mechanism I forgot) to not race with memory offlining. Wouldn't worry
> about that for now.

Sorry, agree - to be clear this discussion isn't relevant for this patch
but reviewing it got me thinking about how this works. I still don't see
how alloc_contig_range() is totally safe against memory offlining
though. From what I can see we have the following call path to set
MIGRATE_ISOLATE:

alloc_contig_range(pfn) -> start_isolate_page_range(pfn) ->
isolate_single_pageblock(pfn) -> page_zone(pfn_to_page(pfn))

There's nothing in that call stack that prevent offlining of the page,
hence the struct page may be freed by this point. Am I missing something
else here?

try_to_compact_pages() has a similar issue which is what I noticed
reviewing this patch:

try_to_compact_pages() -> compact_zone_order() -> compact_zone() ->
isolate_migratepages() -> isolate_migratepages_block() ->
PageHuge(pfn_to_page(pfn))

Again I couldn't see anything in that path that would hold the page
stable long enough to safely perform the pfn_to_page() and get a
reference. And after a bit of fiddling I was able to trigger the below
oops running the compaction_test selftest whilst offlining memory so I
don't think it is safe:

Entering kdb (current=0xffff8882452fb6c0, pid 5646) on processor 1 Oops: (null)
due to oops @ 0xffffffff81af6486
CPU: 1 PID: 5646 Comm: compaction_test Not tainted 6.0.0-01424-gf3ec7e734795 #110
Hardware name: Gigabyte Technology Co., Ltd. B150M-D3H/B150M-D3H-CF, BIOS F24 04/11/2018
RIP: 0010:PageHuge+0xa6/0x180
Code: 00 0f 85 d0 00 00 00 48 8b 43 08 a8 01 0f 85 97 00 00 00 66 90 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 50 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 02 7e 7f 31 c0 80 7b 50 02 0f 94 c0 5b 41 5c
RSP: 0000:ffffc9001252efa8 EFLAGS: 00010207
RAX: dffffc0000000000 RBX: fffffffffffffffe RCX: ffffffff81af63f9
RDX: 0000000000000009 RSI: 0000000000000008 RDI: 000000000000004e
RBP: ffffc9001252efb8 R08: 0000000000000000 R09: ffffea000f690007
R10: fffff94001ed2000 R11: 0000000000000001 R12: ffffea000f690008
R13: 0000000000000ab3 R14: ffffea000f690000 R15: 00000000003da400
FS:  00007f83e08b7740(0000) GS:ffff88823bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb6e1cbb3e0 CR3: 0000000344044003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 isolate_migratepages_block+0x43c/0x3dc0
 ? reclaim_throttle+0x7a0/0x7a0
 ? __reset_isolation_suitable+0x350/0x350
 compact_zone+0xb73/0x31f0
 ? compaction_suitable+0x1e0/0x1e0
 compact_zone_order+0x127/0x240
 ? compact_zone+0x31f0/0x31f0
 ? __this_cpu_preempt_check+0x13/0x20
 ? cpuusage_write+0x380/0x380
 ? __kasan_check_read+0x11/0x20
 try_to_compact_pages+0x23b/0x770
 ? psi_task_change+0x201/0x300
 __alloc_pages_direct_compact+0x15d/0x650
 ? get_page_from_freelist+0x3fa0/0x3fa0
 ? psi_task_change+0x201/0x300
 ? _raw_spin_unlock+0x19/0x40
 __alloc_pages_slowpath.constprop.0+0x9e1/0x2260
 ? warn_alloc+0x1a0/0x1a0
 ? __zone_watermark_ok+0x430/0x430
 ? prepare_alloc_pages+0x40b/0x620
 __alloc_pages+0x42f/0x540
 ? __alloc_pages_slowpath.constprop.0+0x2260/0x2260
 alloc_buddy_huge_page.isra.0+0x7c/0x130
 alloc_fresh_huge_page+0x1f1/0x4e0
 alloc_pool_huge_page+0xab/0x2d0
 __nr_hugepages_store_common+0x37a/0xed0
 ? return_unused_surplus_pages+0x330/0x330
 ? __kasan_check_write+0x14/0x20
 ? _raw_spin_lock_irqsave+0x99/0x100
 ? proc_doulongvec_minmax+0x54/0x80
 hugetlb_sysctl_handler_common+0x247/0x320
 ? nr_hugepages_store+0xf0/0xf0
 ? alloc_huge_page+0xbf0/0xbf0
 hugetlb_sysctl_handler+0x20/0x30
 proc_sys_call_handler+0x451/0x650
 ? unregister_sysctl_table+0x1c0/0x1c0
 ? apparmor_file_permission+0x124/0x280
 ? security_file_permission+0x72/0x90
 proc_sys_write+0x13/0x20
 vfs_write+0x7ca/0xd80
 ? kernel_write+0x4d0/0x4d0
 ? do_sys_openat2+0x114/0x450
 ? __kasan_check_write+0x14/0x20
 ? down_write+0xb4/0x130
 ksys_write+0x116/0x220
 ? __kasan_check_write+0x14/0x20
 ? __ia32_sys_read+0xb0/0xb0
 ? debug_smp_processor_id+0x17/0x20
 ? fpregs_assert_state_consistent+0x52/0xc0
 __x64_sys_write+0x73/0xb0
 ? syscall_exit_to_user_mode+0x26/0x50
 do_syscall_64+0x38/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f83e09b2190
Code: 40 00 48 8b 15 71 9c 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d 51 24 0e 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
RSP: 002b:00007ffe4c08e478 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007ffe4c08e648 RCX: 00007f83e09b2190
RDX: 0000000000000006 RSI: 000055d7575611f8 RDI: 0000000000000003
RBP: 00007ffe4c08e4c0 R08: 00007f83e0a8cc60 R09: 0000000000000000
R10: 00007f83e08d40b8 R11: 0000000000000202 R12: 0000000000000000
R13: 00007ffe4c08e658 R14: 000055d757562df0 R15: 00007f83e0adc020
 </TASK>