linux-kernel - Re: Memory hotplug locking issue: Useless (?) zone span seqlock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8a4a5d6e-d2cf-420d-91f3-2797618e7255@efficios.com>
Date: Thu, 8 May 2025 09:11:50 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: David Hildenbrand <david@...hat.com>,
 Andrew Morton <akpm@...ux-foundation.org>, Oscar Salvador <osalvador@...e.de>
Cc: Michal Hocko <mhocko@...e.com>,
 Anshuman Khandual <anshuman.khandual@....com>,
 Vlastimil Babka <vbabka@...e.cz>, Pavel Tatashin
 <pasha.tatashin@...een.com>, Linus Torvalds <torvalds@...ux-foundation.org>,
 linux-kernel <linux-kernel@...r.kernel.org>, linux-mm <linux-mm@...ck.org>
Subject: Re: Memory hotplug locking issue: Useless (?) zone span seqlock

On 2025-05-08 06:45, David Hildenbrand wrote:
> On 07.03.25 21:22, Mathieu Desnoyers wrote:
>> I'm currently perfecting my understanding of the mm code and reviewing
>> pieces of it as I go, and stumbled on this:
>>
>> commit 27cacaad16c5 ("mm,memory_hotplug: drop unneeded locking")
>>
>> This commit removes all users of zone_span_writelock(), thus making
>> the inline useless, but leaves the now useless
>> zone_span_seqbegin()/zone_span_seqretry() in place within
>> page_outside_zone_boundaries().
>>
>> So I'm confused. What's going on ?
>>
>> And if this commit got things very wrong when removing the
>> seqlock, I wonder if there are cases where its partial
>> pgdat_resize_lock() removal can be an issue as well.
> 
> I stumbled over that myself recently as well. I think I mentioned in the 
> past that we should just store
> 
> start_pfn + end_pfn
> 
> instead of
> 
> start_pfn + nr_pages
> 
> 
> Then, concurrent resizing could happen (and we could atomically read 
> start_pfn / end_pfn).
> 
> Right now, when adjusting start_pfn, we always also have to adjust 
> nr_pages. A concurrent reader calculating end_pfn manually could see 
> some crappy result.
> 
> Having that said, I am not aware of issues in that area, but it all 
> looks like only a partial cleanup to me.

I wonder if all callers to zone_spans_pfn() prevent concurrent modification
of the zone start_pfn and nr_pages ?

For instance set_pfnblock_flags_mask() (called by set_pageblock_migratetype)
does:

        VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

If we look at zone_spans_pfn():

static inline unsigned long zone_end_pfn(const struct zone *zone)
{
         return zone->zone_start_pfn + zone->spanned_pages;
}

static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
{
         return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
}

A concurrent updater which shrinks a zone by moving its start would have
to increment zone_start_pfn *and* decrement spanned_pages. If this happens
concurrently with the loads from zone_spans_pfn(), then it can observe
an intermediate state (only nr pages reduced or only zone start moved).
The first scenario at least can lead to false positive VM_BUG_ON().

Likewise if the updater expands the zone by moving its start left. If
the observer loads an updated start pfn without observing the nr pages
update, it can lead to false positive VM_BUG_ON().

I notice that zone_intersects() also uses zone_end_pfn(). It is used for
instance by default_kernel_zone_for_pfn() without locks. In this case,
reading both nr pages and start pfn concurrently with update could cause
a false-positive match, for instance if the start of the range is moved
but the nr pages prior value is loaded (concurrent shrink). This could
match a zone outside of the function parameter range.

Another reader of those fields is update_pgdat_span(), which appears to
be called only from remove_pfn_range_from_zone with mem_hotplug_lock
held in write mode. So that one should be fine.

AFAIU, updates to nr pages and zone start pfn are done by:

- remove_pfn_range_from_zone (AFAIU always called with mem_hotplug_lock
   in write mode),
- shrink_zone_span (called from remove_pfn_range_from_zone),
- resize_zone_range (__meminit function), called from move_pfn_range_to_zone()
   also called with mem_hotplug_lock in write mode.

So I think your idea of keeping track of both zone_start_pfn and zone_end_pfn
would solve this specific issue, however we'd have to carefully consider what
happens to users of spanned_pages (e.g. zone_is_empty()) callers, because it
would then require to calculate the spanned_pages from end - start, which
then can have similar issues wrt atomicity against concurrent updates.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com