[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20200601133159.GM26955@MiWiFi-R3L-srv>
Date: Mon, 1 Jun 2020 21:31:59 +0800
From: Baoquan He <bhe@...hat.com>
To: Mike Rapoport <rppt@...ux.ibm.com>
Cc: Steve Wahl <steve.wahl@....com>,
David Hildenbrand <david@...hat.com>, mgorman@...e.de,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
akpm@...ux-foundation.org, cai@....pw, mhocko@...nel.org
Subject: Re: [PATCH] mm/compaction: Fix the incorrect hole in
fast_isolate_freepages()
On 06/01/20 at 02:42pm, Mike Rapoport wrote:
> On Thu, May 28, 2020 at 10:15:10AM -0500, Steve Wahl wrote:
> > On Thu, May 28, 2020 at 05:07:31PM +0800, Baoquan He wrote:
> > > On 05/26/20 at 01:49pm, David Hildenbrand wrote:
> > > > On 26.05.20 13:32, Mike Rapoport wrote:
> > > > > Hello Baoquan,
> > > > >
> > > > > I do not object to your original fix with careful check for pfn validity.
> > > > >
> > > > > But I still think that the memory reserved by the firmware is still
> > > > > memory and it should be added to memblock.memory. This way the memory
> > > >
> > > > If it's really memory that could be read/written, I think I agree.
> > >
> > > I would say some of them may not be allowed to be read/written, if I
> > > understand it correctly. I roughly went through the x86 init code, there
> > > are some places where mem region is marked as E820_TYPE_RESERVED so that
> > > they are not touched after initialization. E.g:
> > >
> > > 1) pfn 0
> > > In trim_bios_range(), we set the pfn 0 as E820_TYPE_RESERVED. You can
> > > see the code comment, this is a BIOS owned area, but not kernel RAM.
> > >
> > > 2)GART reserved region
> > > In early_gart_iommu_check(), GART IOMMU firmware will reserve a region
> > > in an area, firmware designer won't map system RAM into that area.
> > >
> > > And also intel_graphics_stolen(), arch_rmrr_sanity_check(), these
> > > regions are not system RAM backed area, reading from or writting into
> > > these area may cause error.
> > >
> > > Futhermore, there's a KASLR bug found by HPE, its triggering and root
> > > cause are written into below commit log. You can see that accessing to
> > > firmware reserved region caused BIOS to halt system when cpu doing
> > > speculative.
> > >
> > > commit 2aa85f246c181b1fa89f27e8e20c5636426be624
> > > Author: Steve Wahl <steve.wahl@....com>
> > > Date: Tue Sep 24 16:03:55 2019 -0500
> > >
> > > x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area
> > >
> > > Our hardware (UV aka Superdome Flex) has address ranges marked
> > > reserved by the BIOS. Access to these ranges is caught as an error,
> > > causing the BIOS to halt the system.
> >
> > Thank you for CC'ing me. Coming into the middle of the conversation,
> > I am trying to understand the implications of what's being discussed
> > here, but not being very successful at it.
> >
> > For the HPE UV hardware, the addresses listed in the reserved range
> > must never be accessed, or (as we discovered) even be reachable by an
> > active page table entry. Any active page table entry covering the
> > region allows the processor to issue speculative accesses to the
> > region, resulting in the BIOS halt mentioned.
> >
> > I'm not sure if the discussion above about adding the region to
> > memblock.memory would result in an active mapping of the region or
> > not. If it did, we'd have problems.
>
> The discussion is whether regions marked as E820_RESERVED should be
> considered as RAM or not.
IMHO, the discussion is about whether firmware reserved regions like
E820_TYPE_RESERVED should be added into memory allocators.
>
> For the hardware that cannot tolerate acceses to these areas like HPE
> UV, it should not :)
>
> I still think that keeping them only in memblock.reserved creates more
> problems than it solves, but simply adding E820_RESERVED areas to
> memblock.memory just won't work.
Currently, we hardly add E820_RESERVED into memblock, including
memblock.memmory and memblock.reserved. But, in several places, people
add them to memblock.reserved when they have marked them as
E820_TYPE_RESERVED.
I agree with David that these pages aren't available for any later
memory allocator. "the node/zone is just completely irrelevant for
these pages". Keeping them out of the memory management system is safer
so that nobody won't touch them. I tried to mark them out with a new
type PageUnavail as David suggested, am still struggling to think of
where I should detect and exclude them.
Below is a draft patch about my attempt on marking out unavailable pages.
Any comment or suggestions are welcomed. Of course for any better idea
with code change, I would like to test and review. Hope we can finally fix
this issue after its exposure.
>From 25ca041258a40bb315f94a23f0465b510b1614a0 Mon Sep 17 00:00:00 2001
From: Baoquan He <bhe@...hat.com>
Date: Fri, 29 May 2020 20:23:13 +0800
Subject: [RFC PATCH] mm: Add a new page type to mark unavailable pages
Firstly, the last user of early_pfn_valid() is memmap_init_zone(), but that code
has been optimized away. Let's remove it.
Secondly, add a new page type to mark unavailale pages.
Thirdly, add a new early_pfn_valid() so that init_unavailable_range()
can initialize those pages in memblock.reserved.
Signed-off-by: Baoquan He <bhe@...hat.com>
---
arch/x86/include/asm/mmzone_32.h | 2 --
include/linux/mmzone.h | 18 ++++++++++--------
include/linux/page-flags.h | 8 ++++++++
mm/page_alloc.c | 3 ++-
4 files changed, 20 insertions(+), 11 deletions(-)
diff --git a/arch/x86/include/asm/mmzone_32.h b/arch/x86/include/asm/mmzone_32.h
index 73d8dd14dda2..4da45eff2942 100644
--- a/arch/x86/include/asm/mmzone_32.h
+++ b/arch/x86/include/asm/mmzone_32.h
@@ -49,8 +49,6 @@ static inline int pfn_valid(int pfn)
return 0;
}
-#define early_pfn_valid(pfn) pfn_valid((pfn))
-
#endif /* CONFIG_DISCONTIGMEM */
#endif /* _ASM_X86_MMZONE_32_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e8d7d0b0acf4..47d09be73dca 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1315,20 +1315,23 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
#endif
#ifndef CONFIG_HAVE_ARCH_PFN_VALID
-static inline int pfn_valid(unsigned long pfn)
+static inline int early_pfn_valid(unsigned long pfn)
{
- struct mem_section *ms;
-
if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
return 0;
- ms = __nr_to_section(pfn_to_section_nr(pfn));
- if (!valid_section(ms))
+ return valid_section(__pfn_to_section(pfn))
+}
+
+static inline int pfn_valid(unsigned long pfn)
+{
+ if (!early_pfn_valid(pfn))
return 0;
/*
* Traditionally early sections always returned pfn_valid() for
* the entire section-sized span.
*/
- return early_section(ms) || pfn_section_valid(ms, pfn);
+ return (early_section(ms) && PageUnavail(pfn_to_page(pfn))) ||
+ pfn_section_valid(ms, pfn);
}
#endif
@@ -1364,7 +1367,6 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr)
#define pfn_to_nid(pfn) (0)
#endif
-#define early_pfn_valid(pfn) pfn_valid(pfn)
void sparse_init(void);
#else
#define sparse_init() do {} while (0)
@@ -1385,7 +1387,7 @@ struct mminit_pfnnid_cache {
};
#ifndef early_pfn_valid
-#define early_pfn_valid(pfn) (1)
+#define early_pfn_valid(pfn) pfn_valid(pfn)
#endif
void memory_present(int nid, unsigned long start, unsigned long end);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 222f6f7b2bb3..1c011008100c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -740,6 +740,7 @@ PAGEFLAG_FALSE(DoubleMap)
#define PG_kmemcg 0x00000200
#define PG_table 0x00000400
#define PG_guard 0x00000800
+#define PG_unavail 0x00001000
#define PageType(page, flag) \
((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
@@ -796,6 +797,13 @@ PAGE_TYPE_OPS(Table, table)
*/
PAGE_TYPE_OPS(Guard, guard)
+/*
+ * Mark pages which are unavailable to memory allocators after boot.
+ * They includes pages reserved by firmware pre-boot or during boot,
+ * holes which share the same setcion with usable RAM.
+ */
+PAGE_TYPE_OPS(Unavail, unavail)
+
extern bool is_free_buddy_page(struct page *page);
__PAGEFLAG(Isolated, isolated, PF_ANY);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 940cdce96864..e9ebef6ffbec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6972,7 +6972,8 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
* (in memblock.reserved but not in memblock.memory) will
* get re-initialized via reserve_bootmem_region() later.
*/
- __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
+ mm_zero_struct_page(pfn_to_page(pfn));
+ __SetPageUnavail(pfn_to_page(pfn));
__SetPageReserved(pfn_to_page(pfn));
pgcnt++;
}
--
2.17.2
Powered by blists - more mailing lists