linux-kernel - Re: Memory hotplug locking issue: Useless (?) zone span seqlock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f93ef55c-250d-445f-9d77-b619685c419b@redhat.com>
Date: Fri, 9 May 2025 11:38:04 +0200
From: David Hildenbrand <david@...hat.com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
 Andrew Morton <akpm@...ux-foundation.org>, Oscar Salvador <osalvador@...e.de>
Cc: Michal Hocko <mhocko@...e.com>,
 Anshuman Khandual <anshuman.khandual@....com>,
 Vlastimil Babka <vbabka@...e.cz>, Pavel Tatashin
 <pasha.tatashin@...een.com>, Linus Torvalds <torvalds@...ux-foundation.org>,
 linux-kernel <linux-kernel@...r.kernel.org>, linux-mm <linux-mm@...ck.org>
Subject: Re: Memory hotplug locking issue: Useless (?) zone span seqlock

On 08.05.25 15:11, Mathieu Desnoyers wrote:
> On 2025-05-08 06:45, David Hildenbrand wrote:
>> On 07.03.25 21:22, Mathieu Desnoyers wrote:
>>> I'm currently perfecting my understanding of the mm code and reviewing
>>> pieces of it as I go, and stumbled on this:
>>>
>>> commit 27cacaad16c5 ("mm,memory_hotplug: drop unneeded locking")
>>>
>>> This commit removes all users of zone_span_writelock(), thus making
>>> the inline useless, but leaves the now useless
>>> zone_span_seqbegin()/zone_span_seqretry() in place within
>>> page_outside_zone_boundaries().
>>>
>>> So I'm confused. What's going on ?
>>>
>>> And if this commit got things very wrong when removing the
>>> seqlock, I wonder if there are cases where its partial
>>> pgdat_resize_lock() removal can be an issue as well.
>>
>> I stumbled over that myself recently as well. I think I mentioned in the
>> past that we should just store
>>
>> start_pfn + end_pfn
>>
>> instead of
>>
>> start_pfn + nr_pages
>>
>>
>> Then, concurrent resizing could happen (and we could atomically read
>> start_pfn / end_pfn).
>>
>> Right now, when adjusting start_pfn, we always also have to adjust
>> nr_pages. A concurrent reader calculating end_pfn manually could see
>> some crappy result.
>>
>> Having that said, I am not aware of issues in that area, but it all
>> looks like only a partial cleanup to me.
> 
> I wonder if all callers to zone_spans_pfn() prevent concurrent modification
> of the zone start_pfn and nr_pages ?

I think compaction does not protect against that (e.g., no 
mem_hotplug_begin()).

My understanding was that compaction can mostly tolerate it (e.g., 
recheck pfn_to_online_page() + page_zone() check)

> 
> For instance set_pfnblock_flags_mask() (called by set_pageblock_migratetype)
> does:
> 
>          VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> 

Yes, that might indeed trigger in some weird races.

> If we look at zone_spans_pfn():
> 
> static inline unsigned long zone_end_pfn(const struct zone *zone)
> {
>           return zone->zone_start_pfn + zone->spanned_pages;
> }
> 
> static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
> {
>           return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
> }
> 
> A concurrent updater which shrinks a zone by moving its start would have
> to increment zone_start_pfn *and* decrement spanned_pages. If this happens
> concurrently with the loads from zone_spans_pfn(), then it can observe
> an intermediate state (only nr pages reduced or only zone start moved).
> The first scenario at least can lead to false positive VM_BUG_ON().

Jup.

> 
> Likewise if the updater expands the zone by moving its start left. If
> the observer loads an updated start pfn without observing the nr pages
> update, it can lead to false positive VM_BUG_ON().
> 
> I notice that zone_intersects() also uses zone_end_pfn(). It is used for
> instance by default_kernel_zone_for_pfn() without locks. In this case,
> reading both nr pages and start pfn concurrently with update could cause
> a false-positive match, for instance if the start of the range is moved
> but the nr pages prior value is loaded (concurrent shrink). This could
> match a zone outside of the function parameter range.
> 
> Another reader of those fields is update_pgdat_span(), which appears to
> be called only from remove_pfn_range_from_zone with mem_hotplug_lock
> held in write mode. So that one should be fine.
> 
> AFAIU, updates to nr pages and zone start pfn are done by:
> 
> - remove_pfn_range_from_zone (AFAIU always called with mem_hotplug_lock
>     in write mode),
> - shrink_zone_span (called from remove_pfn_range_from_zone),
> - resize_zone_range (__meminit function), called from move_pfn_range_to_zone()
>     also called with mem_hotplug_lock in write mode.

Yes, I refactored that at some point. mem_hotplug_begin() could protect 
against it, but we really don't want to grab that all over the place.

> 
> So I think your idea of keeping track of both zone_start_pfn and zone_end_pfn
> would solve this specific issue, however we'd have to carefully consider what
> happens to users of spanned_pages (e.g. zone_is_empty()) callers, because it
> would then require to calculate the spanned_pages from end - start, which
> then can have similar issues wrt atomicity against concurrent updates.

We could easily just store both things (spanned pages + end_pfn), such 
that we can read both atomically.

-- 
Cheers,

David / dhildenb