lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 25 Nov 2020 23:04:14 +0200
From:   Mike Rapoport <rppt@...ux.ibm.com>
To:     David Hildenbrand <david@...hat.com>
Cc:     Andrea Arcangeli <aarcange@...hat.com>,
        Vlastimil Babka <vbabka@...e.cz>, Mel Gorman <mgorman@...e.de>,
        Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        Qian Cai <cai@....pw>, Michal Hocko <mhocko@...nel.org>,
        linux-kernel@...r.kernel.org, Baoquan He <bhe@...hat.com>
Subject: Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set
 pageblock_skip on reserved pages

On Wed, Nov 25, 2020 at 08:27:21PM +0100, David Hildenbrand wrote:
> On 25.11.20 19:28, Andrea Arcangeli wrote:
> > On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
> >
> > What would need to call pfn_zone in between first and second stage?
> > 
> > If something calls pfn_zone in between first and second stage isn't it
> > a feature if it crashes the kernel at boot?
> > 
> > Note: I suggested 0xff kernel crashing "until the second stage comes
> > around" during meminit at boot, not permanently.
> 
> Yes, then it makes sense - if we're able to come up with a way to
> initialize any memmap we might have - including actual memory holes that
> have a memmap.
> 
> > 
> > 		/*
> > 		 * Use a fake node/zone (0) for now. Some of these pages
> > 		 * (in memblock.reserved but not in memblock.memory) will
> > 		 * get re-initialized via reserve_bootmem_region() later.
> > 		 */
> > 
> > Specifically I relied on the comment "get re-initialized via
> > reserve_bootmem_region() later".
> 
> Yes, but there is a "Some of these" :)
> 
> Boot a VM with "-M 4000" and observe the memmap in the last section -
> they won't get initialized a second time.
> 
> > 
> > I assumed the second stage overwrites the 0,0 to the real zoneid/nid
> > value, which is clearly not happening, hence it'd be preferable to get
> > a crash at boot reliably.
> > 
> > Now I have CONFIG_DEFERRED_STRUCT_PAGE_INIT=n so the second stage
> > calling init_reserved_page(start_pfn) won't do much with
> > CONFIG_DEFERRED_STRUCT_PAGE_INIT=n but I already tried to enable
> > CONFIG_DEFERRED_STRUCT_PAGE_INIT=y yesterday and it didn't help, the
> > page->flags were still wrong for reserved pages in the "Unknown E820
> > type" region.

I think the very root cause is how e820__memblock_setup() registers
memory with memblock:

		if (entry->type == E820_TYPE_SOFT_RESERVED)
			memblock_reserve(entry->addr, entry->size);

		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
			continue;

		memblock_add(entry->addr, entry->size);

>From that point the system has inconsistent view of RAM in both
memblock.memory and memblock.reserved and, which is then translated to
memmap etc.

Unfortunately, simply adding all RAM to memblock is not possible as
there are systems that for them "the addresses listed in the reserved
range must never be accessed, or (as we discovered) even be reachable by
an active page table entry" [1].

[1] https://lore.kernel.org/lkml/20200528151510.GA6154@raspberrypi/

-- 
Sincerely yours,
Mike.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ