linux-kernel - Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201203105107.GR123287@linux.ibm.com>
Date:   Thu, 3 Dec 2020 12:51:07 +0200
From:   Mike Rapoport <rppt@...ux.ibm.com>
To:     Andrea Arcangeli <aarcange@...hat.com>
Cc:     David Hildenbrand <david@...hat.com>,
        Vlastimil Babka <vbabka@...e.cz>, Mel Gorman <mgorman@...e.de>,
        Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        Qian Cai <cai@....pw>, Michal Hocko <mhocko@...nel.org>,
        linux-kernel@...r.kernel.org, Baoquan He <bhe@...hat.com>
Subject: Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set
 pageblock_skip on reserved pages

On Thu, Dec 03, 2020 at 01:23:02AM -0500, Andrea Arcangeli wrote:
> On Wed, Dec 02, 2020 at 07:39:23PM +0200, Mike Rapoport wrote:
> > Hmm, do you have any logs? It worked on my box and with various memory
> > configurations in qemu.
> 
> It crashes in reserve_bootmem_region so no logs at all since the
> serial console isn't active yet.
> 
> > I believe that the proper solution would be to have all the memory
> > reserved by the firmware added to memblock.memory, like all other
> > architectures do, but I'm not sure if we can easily and reliably
> > determine what of E820 reserved types are actually backed by RAM.
> > So this is not feasible as a short term solution.
> > 
> > My patch [1], though, is an immediate improvement and I think it's worth
> > trying to fix off-by-one's that prevent your system from booting.
> > 
> > [1] https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org
> 
> Yes that's the patch I applied.
> 
> Relevant points after debugging:
> 
> 1) can you try again with DEBUG_VM=y?

Mea cupla :)

> 2) DEBUG_VM=y already enforces the memset(page, 0xff, sizeof(struct
>    page)) on all struct pages allocated, exactly I was suggesting to
>    do in the previous email. I wonder why we're not doing that even
>    with DEBUG_VM=n, perhaps it's too slow for TiB systems. See
>    page_init_poison().
> 
> 3) given 2), your patch below by deleting "0,0" initialization
>    achieves the debug feature to keep PagePoison forever on all
>    uninitialized pages, imagine PagePoison is really
>    NO_NODEID/NO_ZONEID and doesn't need handling other than a crash.
>   
> -		__init_single_page(pfn_to_page(pfn), pfn, 0, 0);
> 
> 4) because of the feature in 3) I now hit the PagePoison VM_BUG_ON
>    because pfn 0 wasn't initialized when reserve_bootmem_region tries
>    to call __SetPageReserved on a PagePoison page.

So this is the off-by-one a was talking about. My patch removed
initializaion of the hole before the memblock.memory

> 5) pfn 0 is the classical case where pfn 0 is in a reserved zone in
>    memblock.reserve that doesn't overlap any memblock.memory zone.

And, IMHO, this is the fundamental BUG.

There is RAM at address 0, there is a DIMM there, so why on earth this
should not be a part of memblock.memory?

>    The memblock_start_of_DRAM moves the start of the DMA zone above
>    the pfn 0, so then pfn 0 already ends up in the no zone land David
>    mentioned where it's not trivial to decide to give it a zoneid if
>    it's not even spanned in the zone.
> 
>    Not just the zoneid, there's not even a nid.
> 
>    So we have a pfn with no zoneid, and no nid, and not part of the
>    zone ranges, not part of the nid ranges not part of the
>    memory.memblock.

We have a pfn that should have been in memblock.memory, in ZONE_DMA and
in the first node with memory. If it was not trimmed from there 

>    We can't really skip the initialization of the the pfn 0, it has to
>    get the zoneid 0 treatment or if pfn 1 ends up in compaction code,
>    we'd crash again. (although we'd crash with a nice PagePoison(page)
>    == true behavior)

Agree. Struct page for pfn should get the same zoneid and nodeid as pfn 1.

For me the below fixup worked (both with and without DEBUG_VM=y).

>From 84a1c2531374706f3592a638523278aa29aaa448 Mon Sep 17 00:00:00 2001
From: Mike Rapoport <rppt@...ux.ibm.com>
Date: Thu, 3 Dec 2020 11:40:17 +0200
Subject: [PATCH] fixup for "mm: refactor initialization of stuct page for holes"

Signed-off-by: Mike Rapoport <rppt@...ux.ibm.com>
---
 mm/page_alloc.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ce2bdaabdf96..86fde4424e87 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6227,7 +6227,8 @@ void __init __weak memmap_init(unsigned long size, int nid,
 			       unsigned long zone,
 			       unsigned long range_start_pfn)
 {
-	unsigned long start_pfn, end_pfn, next_pfn = 0;
+	static unsigned long hole_start_pfn;
+	unsigned long start_pfn, end_pfn;
 	unsigned long range_end_pfn = range_start_pfn + size;
 	u64 pgcnt = 0;
 	int i;
@@ -6235,7 +6236,6 @@ void __init __weak memmap_init(unsigned long size, int nid,
 	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
 		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
 		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
-		next_pfn = clamp(next_pfn, range_start_pfn, range_end_pfn);
 
 		if (end_pfn > start_pfn) {
 			size = end_pfn - start_pfn;
@@ -6243,10 +6243,10 @@ void __init __weak memmap_init(unsigned long size, int nid,
 					 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
 		}
 
-		if (next_pfn < start_pfn)
-			pgcnt += init_unavailable_range(next_pfn, start_pfn,
-							zone, nid);
-		next_pfn = end_pfn;
+		if (hole_start_pfn < start_pfn)
+			pgcnt += init_unavailable_range(hole_start_pfn,
+							start_pfn, zone, nid);
+		hole_start_pfn = end_pfn;
 	}
 
 	/*
@@ -6256,8 +6256,8 @@ void __init __weak memmap_init(unsigned long size, int nid,
 	 * considered initialized. Make sure that memmap has a well defined
 	 * state.
 	 */
-	if (next_pfn < range_end_pfn)
-		pgcnt += init_unavailable_range(next_pfn, range_end_pfn,
+	if (hole_start_pfn < range_end_pfn)
+		pgcnt += init_unavailable_range(hole_start_pfn, range_end_pfn,
 						zone, nid);
 
 	if (pgcnt)
-- 
2.28.0

...

> From 89469f063c192ae70aea0bd6e1e2a7e99894e5b6 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@...hat.com>
> Date: Wed, 2 Dec 2020 23:23:26 -0500
> Subject: [PATCH 1/1] mm: initialize struct pages in reserved regions outside
>  of the zone ranges
> 
> pfn 0 wasn't initialized and the PagePoison remained set when
> reserve_bootmem_region() called __SetPageReserved, inducing a silent
> boot failure with DEBUG_VM (and correctly so, because the crash
> signaled the nodeid/nid of pfn 0 would be again wrong).
> 
> Without this change, the pfn 0 part of a memblock.reserved range,
> isn't in any zone spanned range, nor in any nid spanned range, when we
> initialize the memblock.memory holes pfn 0 won't be included because
> it has no nid and no zoneid.
> 
> There's no enforcement that all memblock.reserved ranges must overlap
> memblock.memory ranges, so the memblock.reserved ranges also require
> an explicit initialization and the zones and nid ranges need to be
> extended to include all memblock.reserved ranges with struct pages, or
> they'll be left uninitialized with PagePoison as it happened to pfn 0.
 
I rather think that we should enforce the overlap between
memblock.reserved and memblock.memory. If there are regions that cannot
be mapped, we have other ways to deal with that (e.g. MEMBLOCK_NOMAP)
instead of dropping these ranges from memblock.memory.

> Signed-off-by: Andrea Arcangeli <aarcange@...hat.com>
> ---
>  include/linux/memblock.h | 17 +++++++++---
>  mm/debug.c               |  3 ++-
>  mm/memblock.c            |  4 +--
>  mm/page_alloc.c          | 57 ++++++++++++++++++++++++++++++++--------
>  4 files changed, 63 insertions(+), 18 deletions(-)
> 

-- 
Sincerely yours,
Mike.