linux-kernel - Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <X77BgHiTR3R7biho@redhat.com>
Date:   Wed, 25 Nov 2020 15:41:36 -0500
From:   Andrea Arcangeli <aarcange@...hat.com>
To:     David Hildenbrand <david@...hat.com>
Cc:     Vlastimil Babka <vbabka@...e.cz>, Mel Gorman <mgorman@...e.de>,
        Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        Qian Cai <cai@....pw>, Michal Hocko <mhocko@...nel.org>,
        linux-kernel@...r.kernel.org, Mike Rapoport <rppt@...ux.ibm.com>,
        Baoquan He <bhe@...hat.com>
Subject: Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set
 pageblock_skip on reserved pages

On Wed, Nov 25, 2020 at 08:27:21PM +0100, David Hildenbrand wrote:
> On 25.11.20 19:28, Andrea Arcangeli wrote:
> > On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
> >> Before that change, the memmap of memory holes were only zeroed
> >> out. So the zones/nid was 0, however, pages were not reserved and
> >> had a refcount of zero - resulting in other issues.
> > 
> > So maybe that "0,0" zoneid/nid was not actually the thing that
> > introduced the regression? Note: I didn't bisect anything yet, it was
> > just a guess.
> 
> I guess 0/0 is the issue, but that existed before when we had a simple
> memmset(0). The root issue should be what Mike said:

Yes, the second stage must have stopped running somehow.

Is there anything we can do to induce a deterministically reproducible
kernel crashing behavior if the second stage doesn't run?

Why did we start doing a more graceful initialization in the first
stage, instead of making a less graceful by setting it to 0xff instead
of 0x00?

> 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather
> that check each PFN")

So if that's not intentional, are you suggesting nodeid/nid was a bug
if it was set to 0,0 for a non-RAM valid pfn?

> "correct" is problematic. If you have an actual memory hole, there is
> not always a right answer - unless I am missing something important.
> 
> 
> Assume you have a layout like this
> 
> [  zone X ] [ hole ] [ zone Y ]
> 
> If either X and or Y starts within a memory section, you have a valid
> memmap for X - but what would be the right node/zone?
> 
> 
> Assume you have a layout like this
> 
> [ zone X ]
> 
> whereby X ends inside a memory section. The you hotplug memory. Assume
> it goes to X
> 
> [ zone X ][ hole in X ][ zone X]
> 
> or it goes to y
> 
> [ zone X ][ hole ][ zone Y ]
> 
> This can easily be reproduced by starting a VM in qemu with a memory
> size not aligned to 128 MB (e.g., -M 4000) and hotplugging memory.

I don't get what the problem is sorry.

You have a pfn, if pfn_valid() is true, pfn_to_page returns a page
deterministically.

It's up to the kernel to decide which page structure blongs to any pfn
in the pfn_to_page function.

Now if the pfn_to_page(pfn) function returns a page whose nid/zone_id
in page->flags points to a node->zone whose zone_start_pfn -
end_zone_pfn range doesn't contain "pfn" that is a bug in
page_alloc.c.

I don't see how is it not possible to deterministically enforce the
above never happens. Only then it would be true that there's not
always a right answer.

zone can overlap, but it can't be that you do pfn_to_page of a
pfn_valid and you obtain a page whose zone doesn't contain that
pfn. Which is what is currently crashing compaction.

I don't see how this is an unsolvable problem and why we should accept
to live with a bogus page->flags for reserved pages.

> We can't. The general rule is (as I was once told by Michal IIRC) that

The fact we can't kernel crash reliably when somebody uses the wrong
0,0 uninitialized value by not adding an explicit PageReserved check,
is my primary concern in keeping those nodeid/nid uninitialized, but
non-kernel-crashing, since it already created this unreproducible bug.

> I'm not rooting for "keep this at 0/0" - I'm saying that I think there
> are corner cases where it might not be that easy.

I'm not saying it's easy. What I don't see is how you don't always
have the right answer and why it would be an unsolvable problem.

It is certainly problematic and difficult to solve in the mem_map
iniitalization logic, but to me having pfn_valid() &&
page_zone(pfn_to_page(pfn)) randomly returning the DMA zone on first
node also looks problematic and difficult to handle across all VM
code, so overall it looks preferable to keep the complexity of the
mem_map initialization self contained and not spilling over the rest
of the VM.

> Yes, but there is a "Some of these" :)
> 
> Boot a VM with "-M 4000" and observe the memmap in the last section -
> they won't get initialized a second time.

Is the beyond the end of the zone yet another case? I guess that's
less likely to give us problems because it's beyond the end of the
zone. Would pfn_valid return true for those pfn? If pfn_valid is not
true it's not really a concern but the again I'd rather prefer if
those struct pages beyond the end of the zone were kernel crashing set
to 0xff.

In other words I just don't see why we should ever prefer to leave
some pages at a graceful and erroneous nid 0 nodeid 0 that wouldn't
easily induce a crash if used.

> AFAIK, the mem_map array might have multiple NIDs - and it's set when
> initializing the zones.

Well because there's no mem_map array with SPARSEMEM, but it's not
conceptually too different than if there was one. Even with flatmem
there could be multiple page struct for each pfn, the disambiguation
has to be handled by pfn_to_page regardless of SPARSEMEM or not.

The point is that if zone_page(pfn_to_page(pfn)) points to DMA zone of
first node, and the pfn isn't part of the DMA of first node that looks
a bug and it can be enforced it doesn't happen.

> Well, "reserved" is not a good indication "what" something actually is.
> 
> I documented that a while ago in include/linux/page-flags.h
> 
> "PG_reserved is set for special pages. The "struct page" of such a page
>  should in general not be touched (e.g. set dirty) except by its owner.
>  Pages marked as PG_reserved include:."
> 
> I suggest looking at that.
> 
> AFAIR, we have been setting *most* memmap in memory holes/non-ram
> reserved for a long time - long before I added the __init_single_page -
> see init_reserved_page() for example.

Sure, non-RAM with valid page struct always has been marked
PG_reserved. I wasn't suggesting that it shouldn't be PG_reserved.

I was pointing out that RAM can also be marked PG_reserved later by
the kernel, long after boot, as you mentioned for all other cases of
PG_reserved, the most notable are drivers doing PG_reserved after
allocating RAM either vmalloc or GART swapping RAM around at other
alias physical address.

That is all born as RAM at boot, it gets page->flags done right, with
the right zoneid, and it becomes PG_reserved later.

So I was suggesting physical ranges "pfn" of non-RAM (be those holes
withtin zones, or in between zones doesn't matter) with a pfn_valid
returning true and a pfn_to_page pointing deterministically to one and
only one struct page, should have such struct page initialized exactly
the same as if it was RAM.

Either that or we can define a new NO_ZONE NO_ID id and crash in
page_zonenum or page_to_nid if it is ever called on such a page
struct.

Thanks,
Andrea