linux-kernel - Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250131122803.000031aa@huawei.com>
Date: Fri, 31 Jan 2025 12:28:03 +0000
From: Jonathan Cameron <Jonathan.Cameron@...wei.com>
To: Raghavendra K T <raghavendra.kt@....com>
CC: <linux-mm@...ck.org>, <akpm@...ux-foundation.org>,
	<lsf-pc@...ts.linux-foundation.org>, <bharata@....com>, <gourry@...rry.net>,
	<nehagholkar@...a.com>, <abhishekd@...a.com>, <ying.huang@...ux.alibaba.com>,
	<nphamcs@...il.com>, <hannes@...xchg.org>, <feng.tang@...el.com>,
	<kbusch@...a.com>, <Hasan.Maruf@....com>, <sj@...nel.org>,
	<david@...hat.com>, <willy@...radead.org>, <k.shutemov@...il.com>,
	<mgorman@...hsingularity.net>, <vbabka@...e.cz>, <hughd@...gle.com>,
	<rientjes@...gle.com>, <shy828301@...il.com>, <liam.howlett@...cle.com>,
	<peterz@...radead.org>, <mingo@...hat.com>, <nadav.amit@...il.com>,
	<shivankg@....com>, <ziy@...dia.com>, <jhubbard@...dia.com>,
	<AneeshKumar.KizhakeVeetil@....com>, <linux-kernel@...r.kernel.org>,
	<jon.grimm@....com>, <santosh.shukla@....com>, <Michael.Day@....com>,
	<riel@...riel.com>, <weixugc@...gle.com>, <leesuyeon0506@...il.com>,
	<honggyu.kim@...com>, <leillc@...gle.com>, <kmanaouil.dev@...il.com>,
	<rppt@...nel.org>, <dave.hansen@...el.com>
Subject: Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion
 based on PTE A bit scanning


> Here is the list of potential discussion points:
...

> 2. Possibility of maintaining single source of truth for page hotness that would
> maintain hot page information from multiple sources and let other sub-systems
> use that info.
Hi,

I was thinking of proposing a separate topic on a single source of hotness,
but this question covers it so I'll add some thoughts here instead.
I think we are very early, but sharing some experience and thoughts in a
session may be useful.

What do the other subsystems that want to use a single source of page hotness
want to be able to find out? (subject to filters like memory range, process etc)

A) How hot is page X?  
- Is this useful, or too much data? What would use it?
  * Application optimization maybe. Very handy for developing algorithms
    to do the rest of the options here as an Oracle!
- Provides both the cold and hot end of the scale, but maybe measurement
  techniques vary and can not be easily combined. Hard in general to combine
  multiple sources of truth if aiming for an absolute number.

B) Which pages are super hot?
- Probably these that make the most difference if they are in a slower memory tier.

C) Some pages are hot enough to consider moving?
- This may be good enough to get the key data into the fast memory over time.
- Can combine sources of info as being able to compare precise numbers doesn't matter.

D) Which pages are fairly cold?
- Likewise maybe good enough over time.

E) Which pages are very cold?
- Ideal case for tiering. Swap these with the super hot ones.
- Maybe extra signal for swap / zswap etc

F) Did these hot pages remain hot (and same for cold)
- This is needed to know when to back off doing things as we have unstable
  hotness (two phase applications are a pain for this), sampling a few
  pages may be fine.

Messy corners:

Temporal aspects.
- If only providing lists of hottest / coldest in last second, very hard
  to find those that are of a stable temperature. We end up moving
  very hot data (which is disruptive) and it doesn't stay hot.
- Can reduce that affect by long sampling windows on some measurement approaches
  (on hardware trackers that can trash accuracy due to resource exhaustion
   and other subtle effects).
- bistable / phase based applications are a pain but perhaps up to higher
  levels to back off.

My main interest is migrating in tiered systems but good to look at what
else would use a common layer.

Mostly I want to know something that is useful to move, and assume convergence
over the long term with the best things to move so to me the ideal layer has
following interface (strawman so shoot holes in it!):

1) Give me up to X hotish pages from a slow tier (greater than a specific measure
of temperature)
2) Give me X coldish pages a faster tier.
3) I expect to ask again in X seconds so please have some info ready for me!
4) (a path to get an idea of 'unhelpful moves' from earlier iterations - this
    is bleeding the tiering application into a shared interface though).

If we have multiple subsystems using the data we will need to resolve their
conflicting demands to generate good enough data with appropriate overhead.

I'd also like a virtualized solution for case of hardware PA trackers (what
I have with CXL Hotness Monitoring Units) and classic memory pool / stranding
avoidance case where the VM is the right entity to make migration decisions.
Making that interface convey what the kernel is going to use would be an
efficient option. I'd like to hide how the sausage was made from the VM.

Jonathan