lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aH4ofKPyQXoBnLnL@harry>
Date: Mon, 21 Jul 2025 20:46:04 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
        Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Andy Lutomirski <luto@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Andrey Ryabinin <ryabinin.a.a@...il.com>,
        Arnd Bergmann <arnd@...db.de>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>,
        Christoph Lameter <cl@...two.org>
Cc: "H . Peter Anvin" <hpa@...or.com>, Alexander Potapenko <glider@...gle.com>,
        Andrey Konovalov <andreyknvl@...il.com>,
        Dmitry Vyukov <dvyukov@...gle.com>,
        Vincenzo Frascino <vincenzo.frascino@....com>,
        Juergen Gross <jgross@...e.com>, Kevin Brodsky <kevin.brodsky@....com>,
        Muchun Song <muchun.song@...ux.dev>,
        Oscar Salvador <osalvador@...e.de>,
        Joao Martins <joao.m.martins@...cle.com>,
        Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
        Jane Chu <jane.chu@...cle.com>, Alistair Popple <apopple@...dia.com>,
        Mike Rapoport <rppt@...nel.org>, David Hildenbrand <david@...hat.com>,
        Gwan-gyeong Mun <gwan-gyeong.mun@...el.com>,
        Joerg Roedel <joro@...tes.org>, Uladzislau Rezki <urezki@...il.com>,
        "Aneesh Kumar K . V" <aneesh.kumar@...ux.ibm.com>, x86@...nel.org,
        linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org,
        linux-mm@...ck.org
Subject: Re: [PATCH v2 mm-hotfixes 0/5] mm, arch: a more robust approach to
 sync top level kernel page tables

On Mon, Jul 21, 2025 at 08:57:19AM +0900, Harry Yoo wrote:
> On Mon, Jul 21, 2025 at 08:41:58AM +0900, Harry Yoo wrote:
> > RFC v1: https://lore.kernel.org/linux-mm/20250709131657.5660-1-harry.yoo@oracle.com
> > 
> > RFC v1 -> v2:
> > - Dropped RFC tag.
> > - Exposed page table sync code to common code (Mike Rapoport).
> > - Used only one Fixes: tag in patch 3 instead of two,
> >   to avoid confusion (Andrew Morton)
> > - Reused existing ARCH_PAGE_TABLE_SYNC_MASK and
> >   arch_sync_kernel_mappings() facility (currently used by vmalloc and
> >   ioremap) forpage table sync instead of introducing a new interface.
> > 
> > A quick question: Technically, patch 4 and 5 don't necessarily need to be
> > backported. Does it make sense to backport only patch 1-3?
> >
> > # The problem: It is easy to miss/overlook page table synchronization
> > 
> > Hi all,
> 
> Looks like I forgot to Cc: Uladzislau and Joerg.. adding them to Cc.

Apologizes for kernel bot reports! I should have tested it on non x86-64
architectures.

Looking at the kernel test robot reports it seems:

- ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() should be
  moved to <linux/pgtable.h> instead of <asm/pgalloc.h> because x86-32
  and arm exposes ARCH_PAGE_TABLE_SYNC_MASK via <linux/pgtable.h> which
  in turn includes <asm/pgtable.h>

  I'd keep p*d_populate_kernel() in include/asm-generic/pgalloc.h, but
  move the others to <linux/pgtable.h> and include it in
  include/asm-generic/pgalloc.h.

- On x86-64, ARCH_PAGE_TABLE_SYNC_MASK should be defined in
  arch/x86/include/asm/pgtable_64_types.h to align with x86-32.

Will repost v3 with changes mentioned above hopefully in a few days.

If you have any further feedback, please let me know!

-- 
Cheers,
Harry / Hyeonggon

> > During our internal testing, we started observing intermittent boot
> > failures when the machine uses 4-level paging and has a large amount
> > of persistent memory:
> > 
> >   BUG: unable to handle page fault for address: ffffe70000000034
> >   #PF: supervisor write access in kernel mode
> >   #PF: error_code(0x0002) - not-present page
> >   PGD 0 P4D 0 
> >   Oops: 0002 [#1] SMP NOPTI
> >   RIP: 0010:__init_single_page+0x9/0x6d
> >   Call Trace:
> >    <TASK>
> >    __init_zone_device_page+0x17/0x5d
> >    memmap_init_zone_device+0x154/0x1bb
> >    pagemap_range+0x2e0/0x40f
> >    memremap_pages+0x10b/0x2f0
> >    devm_memremap_pages+0x1e/0x60
> >    dev_dax_probe+0xce/0x2ec [device_dax]
> >    dax_bus_probe+0x6d/0xc9
> >    [... snip ...]
> >    </TASK>
> > 
> > It turns out that the kernel panics while initializing vmemmap
> > (struct page array) when the vmemmap region spans two PGD entries,
> > because the new PGD entry is only installed in init_mm.pgd,
> > but not in the page tables of other tasks.
> > 
> > And looking at __populate_section_memmap():
> >   if (vmemmap_can_optimize(altmap, pgmap))                                
> >           // does not sync top level page tables
> >           r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
> >   else                                                                    
> >           // sync top level page tables in x86
> >           r = vmemmap_populate(start, end, nid, altmap);
> > 
> > In the normal path, vmemmap_populate() in arch/x86/mm/init_64.c
> > synchronizes the top level page table (See commit 9b861528a801
> > ("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping
> > changes")) so that all tasks in the system can see the new vmemmap area.
> > 
> > However, when vmemmap_can_optimize() returns true, the optimized path
> > skips synchronization of top-level page tables. This is because
> > vmemmap_populate_compound_pages() is implemented in core MM code, which
> > does not handle synchronization of the top-level page tables. Instead,
> > the core MM has historically relied on each architecture to perform this
> > synchronization manually.
> > 
> > We're not the first party to encounter a crash caused by not-sync'd
> > top level page tables: earlier this year, Gwan-gyeong Mun attempted to
> > address the issue [1] [2] after hitting a kernel panic when x86 code
> > accessed the vmemmap area before the corresponding top-level entries
> > were synced. At that time, the issue was believed to be triggered
> > only when struct page was enlarged for debugging purposes, and the patch
> > did not get further updates.
> > 
> > It turns out that current approach of relying on each arch to handle
> > the page table sync manually is fragile because 1) it's easy to forget
> > to sync the top level page table, and 2) it's also easy to overlook that
> > the kernel should not access the vmemmap and direct mapping areas before
> > the sync.
> > 
> > # The solution: Make page table sync more code robust 
> > 
> > To address this, Dave Hansen suggested [3] [4] introducing
> > {pgd,p4d}_populate_kernel() for updating kernel portion
> > of the page tables and allow each architecture to explicitly perform
> > synchronization when installing top-level entries. With this approach,
> > we no longer need to worry about missing the sync step, reducing the risk
> > of future regressions.
> > 
> > The new interface reuses existing ARCH_PAGE_TABLE_SYNC_MASK,
> > PGTBL_P*D_MODIFIED and arch_sync_kernel_mappings() facility used by
> > vmalloc and ioremap to synchronize page tables.
> > 
> > pgd_populate_kernel() looks like this:
> >   #define pgd_populate_kernel(addr, pgd, p4d)                    \               
> >   do {                                                           \               
> >          pgd_populate(&init_mm, pgd, p4d);                       \               
> >          if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_PGD_MODIFIED)     \               
> >                  arch_sync_kernel_mappings(addr, addr);          \               
> >   } while (0) 
> > 
> > It is worth noting that vmalloc() and apply_to_range() carefully
> > synchronizes page tables by calling p*d_alloc_track() and
> > arch_sync_kernel_mappings(), and thus they are not affected by
> > this patch series.
> > 
> > This patch series was hugely inspired by Dave Hansen's suggestion and
> > hence added Suggested-by: Dave Hansen.
> > 
> > Cc stable because lack of this series opens the door to intermittent
> > boot failures.
> > 
> > [1] https://lore.kernel.org/linux-mm/20250220064105.808339-1-gwan-gyeong.mun@intel.com
> > [2] https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@intel.com
> > [3] https://lore.kernel.org/linux-mm/d1da214c-53d3-45ac-a8b6-51821c5416e4@intel.com
> > [4] https://lore.kernel.org/linux-mm/4d800744-7b88-41aa-9979-b245e8bf794b@intel.com 
> > 
> > Harry Yoo (5):
> >   mm: move page table sync declarations to asm/pgalloc.h
> >   mm: introduce and use {pgd,p4d}_populate_kernel()
> >   x86/mm: define ARCH_PAGE_TABLE_SYNC_MASK and
> >     arch_sync_kernel_mappings()
> >   x86/mm: convert p*d_populate{,_init} to _kernel variants
> >   x86/mm: drop unnecessary calls to sync_global_pgds() and fold into its
> >     sole user
> > 
> >  arch/x86/include/asm/pgalloc.h | 22 ++++++++++++++++++++
> >  arch/x86/mm/init_64.c          | 37 +++++++++++++++++++---------------
> >  arch/x86/mm/kasan_init_64.c    |  8 ++++----
> >  include/asm-generic/pgalloc.h  | 30 +++++++++++++++++++++++++++
> >  include/linux/vmalloc.h        | 16 ---------------
> >  mm/kasan/init.c                | 10 ++++-----
> >  mm/percpu.c                    |  4 ++--
> >  mm/sparse-vmemmap.c            |  4 ++--
> >  mm/vmalloc.c                   |  1 +
> >  9 files changed, 87 insertions(+), 45 deletions(-)
> > 
> > -- 
> > 2.43.0
> >

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ