[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <X9dvfDcSrlEj5y6K@redhat.com>
Date: Mon, 14 Dec 2020 08:58:20 -0500
From: Andrea Arcangeli <aarcange@...hat.com>
To: David Hildenbrand <david@...hat.com>
Cc: Mike Rapoport <rppt@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Baoquan He <bhe@...hat.com>, Mel Gorman <mgorman@...e.de>,
Michal Hocko <mhocko@...nel.org>,
Mike Rapoport <rppt@...ux.ibm.com>, Qian Cai <cai@....pw>,
Vlastimil Babka <vbabka@...e.cz>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, stable@...r.kernel.org,
Dan Williams <dan.j.williams@...el.com>
Subject: Re: [PATCH v2 1/2] mm: memblock: enforce overlap of memory.memblock
and memory.reserved
On Mon, Dec 14, 2020 at 12:18:07PM +0100, David Hildenbrand wrote:
> On 14.12.20 12:12, Mike Rapoport wrote:
> > On Mon, Dec 14, 2020 at 11:11:35AM +0100, David Hildenbrand wrote:
> >> On 09.12.20 22:43, Mike Rapoport wrote:
> >>> From: Mike Rapoport <rppt@...ux.ibm.com>
> >>>
> >>> memblock does not require that the reserved memory ranges will be a subset
> >>> of memblock.memory.
> >>>
> >>> As the result there maybe reserved pages that are not in the range of any
> >>> zone or node because zone and node boundaries are detected based on
> >>> memblock.memory and pages that only present in memblock.reserved are not
> >>> taken into account during zone/node size detection.
> >>>
> >>> Make sure that all ranges in memblock.reserved are added to memblock.memory
> >>> before calculating node and zone boundaries.
> >>>
> >>> Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
> >>> Reported-by: Andrea Arcangeli <aarcange@...hat.com>
> >>> Signed-off-by: Mike Rapoport <rppt@...ux.ibm.com>
> >>> ---
> >>> include/linux/memblock.h | 1 +
> >>> mm/memblock.c | 24 ++++++++++++++++++++++++
> >>> mm/page_alloc.c | 7 +++++++
> >>> 3 files changed, 32 insertions(+)
> >>>
> >>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> >>> index ef131255cedc..e64dae2dd1ce 100644
> >>> --- a/include/linux/memblock.h
> >>> +++ b/include/linux/memblock.h
> >>> @@ -120,6 +120,7 @@ int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
> >>> unsigned long memblock_free_all(void);
> >>> void reset_node_managed_pages(pg_data_t *pgdat);
> >>> void reset_all_zones_managed_pages(void);
> >>> +void memblock_enforce_memory_reserved_overlap(void);
> >>>
> >>> /* Low level functions */
> >>> void __next_mem_range(u64 *idx, int nid, enum memblock_flags flags,
> >>> diff --git a/mm/memblock.c b/mm/memblock.c
> >>> index b68ee86788af..9277aca642b2 100644
> >>> --- a/mm/memblock.c
> >>> +++ b/mm/memblock.c
> >>> @@ -1857,6 +1857,30 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
> >>> }
> >>> }
> >>>
> >>> +/**
> >>> + * memblock_enforce_memory_reserved_overlap - make sure every range in
> >>> + * @memblock.reserved is covered by @memblock.memory
> >>> + *
> >>> + * The data in @memblock.memory is used to detect zone and node boundaries
> >>> + * during initialization of the memory map and the page allocator. Make
> >>> + * sure that every memory range present in @memblock.reserved is also added
> >>> + * to @memblock.memory even if the architecture specific memory
> >>> + * initialization failed to do so
> >>> + */
> >>> +void __init memblock_enforce_memory_reserved_overlap(void)
> >>> +{
> >>> + phys_addr_t start, end;
> >>> + int nid;
> >>> + u64 i;
> >>> +
> >>> + __for_each_mem_range(i, &memblock.reserved, &memblock.memory,
> >>> + NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, &nid) {
> >>> + pr_warn("memblock: reserved range [%pa-%pa] is not in memory\n",
> >>> + &start, &end);
> >>> + memblock_add_node(start, (end - start), nid);
> >>> + }
> >>> +}
> >>> +
> >>> void __init_memblock memblock_set_current_limit(phys_addr_t limit)
> >>> {
> >>> memblock.current_limit = limit;
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index eaa227a479e4..dbc57dbbacd8 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -7436,6 +7436,13 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> >>> memset(arch_zone_highest_possible_pfn, 0,
> >>> sizeof(arch_zone_highest_possible_pfn));
> >>>
> >>> + /*
> >>> + * Some architectures (e.g. x86) have reserved pages outside of
> >>> + * memblock.memory. Make sure these pages are taken into account
> >>> + * when detecting zone and node boundaries
> >>> + */
> >>> + memblock_enforce_memory_reserved_overlap();
> >>> +
> >>> start_pfn = find_min_pfn_with_active_regions();
> >>> descending = arch_has_descending_max_zone_pfns();
> >>>
> >>>
> >>
> >> CCing Dan.
> >>
> >> This implies that any memory that is E820_TYPE_SOFT_RESERVED that was
> >> reserved via memblock_reserve() will be added via memblock_add_node() as
> >> well, resulting in all such memory getting a memmap allocated right when
> >> booting up, right?
> >>
> >> IIRC, there are use cases where that is absolutely not desired.
> >
> > Hmm, if this is the case we need entirely different solution to ensure
> > that we don't have partial pageblocks in a zone and we have all the
> > memory map initialized to a known state.
> >
> >> Am I missing something? (@Dan?)
> >
> > BTW, @Dan, why did you need to memblock_reserve(E820_TYPE_SOFT_RESERVED)
> > without memblock_add()ing it?
>
> I suspect to cover cases where it might partially span memory sections
> (or even sub-sections). Maybe we should focus on initializing that part
> only - meaning, not adding all memory to .memory but only !section
> aligned pieces.
We had that information left in the memblock data structure with the
previous implementation in -mm (before adding all memblock.reserved to
memblock.memory). To avoid destroying that information we'll need a
new flag for each range that is not originally in memblock.memory:
===
What you suggest would require adding extra information to flag which
ranges must not have a direct mapping, but that information is already
in memblock today, for each range in memblock_reserved but not in
memblock.memory or did I misunderstand how that no-direct-map detail works?
===
I guess I was too optimistic that this was already implemented, thanks
for noticing.
For the record, I didn't have time to test the new implementation
yet. Since I'm running the "hack" on all machines things have been
stable on v5.9. I'm actually curious if the hack would also fail boot
on the CI system or not, that would help localize the issue into the
implicit memblock_add at least. The memblock debug output won't give
us a direct reproducer, but we can try to generate one by reproducing
the same e820 map in seabios.
Andrea
Powered by blists - more mailing lists