[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250709131657.5660-1-harry.yoo@oracle.com>
Date: Wed, 9 Jul 2025 22:16:54 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Andy Lutomirski <luto@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Andrey Ryabinin <ryabinin.a.a@...il.com>,
Arnd Bergmann <arnd@...db.de>,
Andrew Morton <akpm@...ux-foundation.org>,
Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>,
Christoph Lameter <cl@...two.org>
Cc: "H . Peter Anvin" <hpa@...or.com>, Alexander Potapenko <glider@...gle.com>,
Andrey Konovalov <andreyknvl@...il.com>,
Dmitry Vyukov <dvyukov@...gle.com>,
Vincenzo Frascino <vincenzo.frascino@....com>,
Juergen Gross <jgross@...e.com>, Kevin Brodsky <kevin.brodsky@....com>,
Muchun Song <muchun.song@...ux.dev>,
Oscar Salvador <osalvador@...e.de>,
Joao Martins <joao.m.martins@...cle.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Jane Chu <jane.chu@...cle.com>, Alistair Popple <apopple@...dia.com>,
Mike Rapoport <rppt@...nel.org>,
Gwan-gyeong Mun <gwan-gyeong.mun@...el.com>,
"Aneesh Kumar K . V" <aneesh.kumar@...ux.ibm.com>, x86@...nel.org,
linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org,
linux-mm@...ck.org, Harry Yoo <harry.yoo@...cle.com>
Subject: [RFC V1 PATCH mm-hotfixes 0/3] mm, arch: A more robust approach to sync top level kernel page tables
Hi all,
During our internal testing, we started observing intermittent boot
failures when the machine uses 4-level paging and has a large amount
of persistent memory:
BUG: unable to handle page fault for address: ffffe70000000034
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] SMP NOPTI
RIP: 0010:__init_single_page+0x9/0x6d
Call Trace:
<TASK>
__init_zone_device_page+0x17/0x5d
memmap_init_zone_device+0x154/0x1bb
pagemap_range+0x2e0/0x40f
memremap_pages+0x10b/0x2f0
devm_memremap_pages+0x1e/0x60
dev_dax_probe+0xce/0x2ec [device_dax]
dax_bus_probe+0x6d/0xc9
[... snip ...]
</TASK>
It turns out that the kernel panics while initializing vmemmap
(struct page array) when the vmemmap region spans two PGD entries,
because the new PGD entry is only installed in init_mm.pgd,
but not in the page tables of other tasks.
And looking at __populate_section_memmap():
if (vmemmap_can_optimize(altmap, pgmap))
// does not sync top level page tables
r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
else
// sync top level page tables in x86
r = vmemmap_populate(start, end, nid, altmap);
In the normal path, vmemmap_populate() in arch/x86/mm/init_64.c
synchronizes the top level page table (See commit 9b861528a801
("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping
changes")) so that all tasks in the system can see the new vmemmap area.
However, when vmemmap_can_optimize() returns true, the optimized path
skips synchronization of top-level page tables. This is because
vmemmap_populate_compound_pages() is implemented in core MM code, which
does not handle synchronization of the top-level page tables. Instead,
the core MM has historically relied on each architecture to perform this
synchronization manually.
We're not the first party to encounter a crash caused by not-sync'd
top level page tables: earlier this year, Gwan-gyeong Mun attempted to
address the issue [1] [2] after hitting a kernel panic when x86 code
accessed the vmemmap area before the corresponding top-level entries
were synced. At that time, the issue was believed to be triggered
only when struct page was enlarged for debugging purposes, and the patch
did not get further updates.
It turns out that current approach of relying on each arch to handle
the page table sync manually is fragile because 1) it's easy to forget
to sync the top level page table, and 2) it's also easy to overlook that
the kernel should not access vmemmap / direct mapping area before the sync.
To address this, Dave Hansen suggested [3] [4] introducing
{pgd,p4d}_populate_kernel() for updating kernel portion
of the page tables and allow each architecture to explicitly perform
synchronization when installing top-level entries. With this approach,
we no longer need to worry about missing the sync step, reducing the risk
of future regressions.
This patch series implements Dave Hansen's suggestion and hence added
Suggested-by: Dave Hansen.
This is an RFC primarily because this involves non-trivial change and
I would like to fix it in a way that aligns with community consensus.
Cc stable because lack of this series opens the door to intermittent
boot failures.
[1] https://lore.kernel.org/linux-mm/20250220064105.808339-1-gwan-gyeong.mun@intel.com
[2] https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@intel.com
[3] https://lore.kernel.org/linux-mm/d1da214c-53d3-45ac-a8b6-51821c5416e4@intel.com
[4] https://lore.kernel.org/linux-mm/4d800744-7b88-41aa-9979-b245e8bf794b@intel.com
Harry Yoo (3):
mm: introduce and use {pgd,p4d}_populate_kernel()
x86/mm: define p*d_populate_kernel() and top level page table sync
x86/mm: convert {pgd,p4d}_populate{,_init} to _kernel variant
arch/x86/include/asm/pgalloc.h | 41 +++++++++
arch/x86/mm/init_64.c | 147 ++++++++++++++++-----------------
arch/x86/mm/kasan_init_64.c | 8 +-
include/asm-generic/pgalloc.h | 4 +
include/linux/pgalloc.h | 0
mm/kasan/init.c | 10 +--
mm/percpu.c | 4 +-
mm/sparse-vmemmap.c | 4 +-
8 files changed, 130 insertions(+), 88 deletions(-)
create mode 100644 include/linux/pgalloc.h
--
2.43.0
Powered by blists - more mailing lists