linux-kernel - [RFC PATCH v1 0/7] A subsystem for hot page detection and promotion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250814134826.154003-1-bharata@amd.com>
Date: Thu, 14 Aug 2025 19:18:19 +0530
From: Bharata B Rao <bharata@....com>
To: <linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>
CC: <Jonathan.Cameron@...wei.com>, <dave.hansen@...el.com>,
	<gourry@...rry.net>, <hannes@...xchg.org>, <mgorman@...hsingularity.net>,
	<mingo@...hat.com>, <peterz@...radead.org>, <raghavendra.kt@....com>,
	<riel@...riel.com>, <rientjes@...gle.com>, <sj@...nel.org>,
	<weixugc@...gle.com>, <willy@...radead.org>, <ying.huang@...ux.alibaba.com>,
	<ziy@...dia.com>, <dave@...olabs.net>, <nifan.cxl@...il.com>,
	<xuezhengchu@...wei.com>, <yiannis@...corp.com>, <akpm@...ux-foundation.org>,
	<david@...hat.com>, <byungchul@...com>, <kinseyho@...gle.com>,
	<joshua.hahnjy@...il.com>, <yuanchu@...gle.com>, <balbirs@...dia.com>,
	Bharata B Rao <bharata@....com>
Subject: [RFC PATCH v1 0/7] A subsystem for hot page detection and promotion

Hi,

This patchset is about adding a dedicated sub-system for maintaining
hot pages information from the lower tiers and promoting the hot pages
to the top tiers. It exposes an API that other sub-systems which detect
accesses, can use to report the accesses for further processing. Further
processing includes system-wide accumulation of memory access info at
PFN granularity, classification the PFNs as hot and promotion of hot
pages using per-node kernel threads. This is a continuation of the
earlier kpromoted work [1] that I posted a while back.

Kernel thread based async batch migration [2] was an off-shoot of
this effort that attempted to batch the migrations from NUMA
balancing by creating a separate kernel thread for migration.
Per-page hotness information was stored as part of extended page
flags. The kernel thread then scanned the entire PFN space to pick
the PFNs that are classified as hot.

The observed challenges from the previous approaches were these:

1. Too many PFNs need to be scanned to identify the hot PFNs in
   approach [2].
2. Hot page records stored in hash lists become unwieldy for
   extracting the required hot pages in approach [1].
3. Dynamic allocation vs static availability of space to store
   per-page hotness information.

This series tries to address challenges 1 and 2 by maintaining
the hot page records in hash lists for quick lookup and maintaining
a separate per-target-node max heap for storing ready-to-migrate
hot page records. The records in heap are priority-ordered based
on "hotness" of the page.

The API for reporting the page access remains unchanged from [1].
When the page access gets recorded, the hotness data of the page
is updated and if it crosses a threshold, it gets tracked in the
heap as well. These heaps are per-target-node and corresponding
migrate threads will periodically extract the top records from
them and do batch migration. 

In the current series, two page temperature sources are included
as examples.

1. IBS based memory access profiler.
2. PTE-A bit based access profiler for MGLRU. (from Kinsey Ho)

TODOs:

- Currently only access frequency is used to calculate the hotness.
  We could have a scalar hotness indicator based on both frequency
  of access and time of access.
- There could be millions of allocation and freeing of records
  and from atomic contexts too. Need to understand how problematic
  this could be. Approach [2] mitigated this by having pre-allocated
  hotness records for each page as part of extended page flags.
- The amount of data needed for tracking hotness is also a concern.
  There is scope for packing the three parameters (nid, time, frequency)
  in a more compact manner which I will attempt in next iterations.
- Migration rate-limiting needs to be added.
- Very very lightly tested atm as the current focus is to get the
  hot data arragement right.

Regards,
Bharata.

[1] Kpromoted - https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/
[2] Kmigrated - https://lore.kernel.org/linux-mm/20250616133931.206626-1-bharata@amd.com/

Bharata B Rao (4):
  mm: migrate: Allow misplaced migration without VMA too
  mm: Hot page tracking and promotion
  x86: ibs: In-kernel IBS driver for memory access profiling
  x86: ibs: Enable IBS profiling for memory accesses

Gregory Price (1):
  migrate: implement migrate_misplaced_folios_batch

Kinsey Ho (2):
  mm: mglru: generalize page table walk
  mm: klruscand: use mglru scanning for page promotion

 arch/x86/events/amd/ibs.c           |  11 +
 arch/x86/include/asm/entry-common.h |   3 +
 arch/x86/include/asm/hardirq.h      |   2 +
 arch/x86/include/asm/ibs.h          |   9 +
 arch/x86/include/asm/msr-index.h    |  16 +
 arch/x86/mm/Makefile                |   3 +-
 arch/x86/mm/ibs.c                   | 343 +++++++++++++++++++
 include/linux/migrate.h             |   6 +
 include/linux/mmzone.h              |  16 +
 include/linux/pghot.h               |  87 +++++
 include/linux/vm_event_item.h       |  26 ++
 mm/Kconfig                          |  19 ++
 mm/Makefile                         |   2 +
 mm/internal.h                       |   4 +
 mm/klruscand.c                      | 118 +++++++
 mm/migrate.c                        |  36 +-
 mm/mm_init.c                        |  10 +
 mm/pghot.c                          | 501 ++++++++++++++++++++++++++++
 mm/vmscan.c                         | 176 +++++++---
 mm/vmstat.c                         |  26 ++
 20 files changed, 1365 insertions(+), 49 deletions(-)
 create mode 100644 arch/x86/include/asm/ibs.h
 create mode 100644 arch/x86/mm/ibs.c
 create mode 100644 include/linux/pghot.h
 create mode 100644 mm/klruscand.c
 create mode 100644 mm/pghot.c

-- 
2.34.1