linux-kernel - Re: [PATCH] mm: migrate: Fix THP's mapcount on isolation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2541344f-61d8-96d1-10e9-ba7e1743a299@redhat.com>
Date:   Fri, 25 Nov 2022 09:54:12 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Alistair Popple <apopple@...dia.com>
Cc:     Matthew Wilcox <willy@...radead.org>,
        Hugh Dickins <hughd@...gle.com>, Gavin Shan <gshan@...hat.com>,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        akpm@...ux-foundation.org, william.kucharski@...cle.com,
        ziy@...dia.com, kirill.shutemov@...ux.intel.com,
        zhenyzha@...hat.com, shan.gavin@...il.com, riel@...riel.com
Subject: Re: [PATCH] mm: migrate: Fix THP's mapcount on isolation

>>>> I think you're right. Without a page reference I don't think it is even
>>>> safe to look at struct page, at least not without synchronisation
>>>> against memory hot unplug which could remove the struct page. From a
>>>> quick glance I didn't see anything here that obviously did that though.
>>> Memory hotplug is the offending party here.  It has to make sure
>>> that
>>> everything else is definitely quiescent before removing the struct pages.
>>> Otherwise you can't even try_get a refcount.
> 
> Sure, I might be missing something but how can memory hotplug possibly
> synchronise against some process (eg. try_to_compact_pages) doing
> try_get(pfn_to_page(random_pfn)) without that process either informing
> memory hotplug that it needs pages to remain valid long enough to get a
> reference or detecting that memory hotplug is in the process of
> offlining pages?

It currently doesn't, and it's been mostly a known theoretical problem 
so far. We've been ignoring these kind of problems because they are not 
easy to sort out and so far never popped up in practice.

First, the correct approach is using pfn_to_online_page() instead of 
pfn_to_page() when in doubt whether the page might already be offline. 
While we're using pfn_to_online_page() already in quite some PFN 
walkers, changes are good that we're still missing some.

Second, essentially anybody (PFN walker) doing a pfn_to_online_page() is 
prone to races with memory offlining. I've discussed this in the past 
with Michal and Dan and one idea was to use RCU to synchronize PFN 
walkers and pfn_to_online_page(), essentially synchronizing clearing of 
the SECTION_IS_ONLINE flag.

Nobody worked on that so far because we've never had actual BUG reports. 
These kind of races are rare, but they are theoretically possible.

> 
>> At least alloc_contig_range() and memory offlining are mutually
>> exclusive due to MIGRATE_ISOLTAE. I recall that ordinary memory
>> compaction similarly deals with isolated pageblocks (or some other
>> mechanism I forgot) to not race with memory offlining. Wouldn't worry
>> about that for now.
> 
> Sorry, agree - to be clear this discussion isn't relevant for this patch
> but reviewing it got me thinking about how this works. I still don't see
> how alloc_contig_range() is totally safe against memory offlining
> though. From what I can see we have the following call path to set
> MIGRATE_ISOLATE:
> 
> alloc_contig_range(pfn) -> start_isolate_page_range(pfn) ->
> isolate_single_pageblock(pfn) -> page_zone(pfn_to_page(pfn))
> 
> There's nothing in that call stack that prevent offlining of the page,
> hence the struct page may be freed by this point. Am I missing something
> else here?

Good point. I even had at some point a patch that converted some 
pfn_to_online_page() calls in there, but discarded it.

IIRC, two alloc_contig_range() users are safe because:

1) virtio-mem synchonizes against memory offlining using the memory 
notifier. While memory offlining is in progress, it won't call 
alloc_contig_range().

2) CMA regions will always fail to offline because they have MIGRATE_CMA 
set.

What remains is alloc_contig_pages(), for example, used for memtrace on 
ppc, kfence, and allocation of gigantic pages. That might indeed be 
racy. At least from kfence_init_late() it's unlikely (impossible?) that 
we'll have concurrent memory offlining.

> 
> try_to_compact_pages() has a similar issue which is what I noticed
> reviewing this patch:

Yes, I can spot pfn_to_online_page() in __reset_isolation_pfn(), but of 
course, are likely missing proper synchronization and checks later. I 
wonder if we could use the memory notifier to temporarily pause any 
running compaction and to restart it once memory offlining succeeded.

> 
> try_to_compact_pages() -> compact_zone_order() -> compact_zone() ->
> isolate_migratepages() -> isolate_migratepages_block() ->
> PageHuge(pfn_to_page(pfn))
> 
> Again I couldn't see anything in that path that would hold the page
> stable long enough to safely perform the pfn_to_page() and get a
> reference. And after a bit of fiddling I was able to trigger the below
> oops running the compaction_test selftest whilst offlining memory so I
> don't think it is safe:

Thanks for finding a reproducer. This is exactly the kind of BUG that we 
speculated about in the past but that we haven't seen in practice yet.


Having that said, I'd be very happy if someone would have time to work 
on proper synchronization and I'd be happy to help 
brainstorming/reviewing :)

-- 
Thanks,

David / dhildenb