linux-kernel - Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250207195210.43726-1-sj@kernel.org>
Date: Fri,  7 Feb 2025 11:52:10 -0800
From: SeongJae Park <sj@...nel.org>
To: "Huang, Ying" <ying.huang@...ux.alibaba.com>
Cc: SeongJae Park <sj@...nel.org>,
	Bharata B Rao <bharata@....com>,
	Raghavendra K T <raghavendra.kt@....com>,
	linux-mm@...ck.org,
	akpm@...ux-foundation.org,
	lsf-pc@...ts.linux-foundation.org,
	gourry@...rry.net,
	nehagholkar@...a.com,
	abhishekd@...a.com,
	nphamcs@...il.com,
	hannes@...xchg.org,
	feng.tang@...el.com,
	kbusch@...a.com,
	Hasan.Maruf@....com,
	david@...hat.com,
	willy@...radead.org,
	k.shutemov@...il.com,
	mgorman@...hsingularity.net,
	vbabka@...e.cz,
	hughd@...gle.com,
	rientjes@...gle.com,
	shy828301@...il.com,
	liam.howlett@...cle.com,
	peterz@...radead.org,
	mingo@...hat.com,
	nadav.amit@...il.com,
	shivankg@....com,
	ziy@...dia.com,
	jhubbard@...dia.com,
	AneeshKumar.KizhakeVeetil@....com,
	linux-kernel@...r.kernel.org,
	jon.grimm@....com,
	santosh.shukla@....com,
	Michael.Day@....com,
	riel@...riel.com,
	weixugc@...gle.com,
	leesuyeon0506@...il.com,
	honggyu.kim@...com,
	leillc@...gle.com,
	kmanaouil.dev@...il.com,
	rppt@...nel.org,
	dave.hansen@...el.com
Subject: Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning

On Fri, 07 Feb 2025 16:10:47 +0800 "Huang, Ying" <ying.huang@...ux.alibaba.com> wrote:

> SeongJae Park <sj@...nel.org> writes:
> 
> > On Mon, 27 Jan 2025 10:41:07 +0530 Bharata B Rao <bharata@....com> wrote:
> >
> >> On 26-Jan-25 7:57 AM, Huang, Ying wrote:
> >> > Hi, Raghavendra,
> >> > 
> >> > Raghavendra K T <raghavendra.kt@....com> writes:
[...]
> >> > The drawbacks of asynchronous scanning including
> >> > 
> >> > - The CPU cycles used are not charged properly
> >> > 
> >> > - There may be no idle CPU cycles to use
> >> > 
> >> > - The scanning CPU may be not near the workload CPUs enough
> >> > 
> >> > It's better to involve Mel and Peter in the discussion for this.
> >> 
> >> They are CC'ed in this thread and hopefully have insights to share.
> >> 
> >> Charging CPU cycles to the right process has been brought up in other 
> >> similar contexts. Recent one is from page migration batching and using 
> >> multiple threads for migration - 
> >> https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com/
> >> 
> >> Does it make sense to treat hot page promotion from slow tiers 
> >> differently compared to locality based balancing? I mean couldn't the 
> >> charging of this async thread be similar to the cycles spent by other 
> >> system threads like kcompactd and khugepaged?
> >
> > I'm up to this idea.
> >
> > I agree the fairness is a thing that we need to aware of.  But IMHO, it is
> > something that the async approach can further be advanced for, not a strict
> > blocker for now.
> 
> Personally, I have no objection to async operations in general.
> However, we may need to find some way to control these async operations
> instead of adding more and more background kthreads blindly.  How to
> charge and constrain the resources used by these async operations is
> important too.  For example, some users may want to bind some async
> operations on some CPUs.
> 
> IMHO, we should think about the requirements and possible solutions
> instead of ignoring the issues.

I agree.  For DAMON, we implemented DAMOS quotas feature for such resource
control.  We also had a (non-public) discussion about splitting DAMON thread
for monitoring part and operation schemes execution parts for finer control.
I'm also thinking about making the quotas for monitoring part resource
consumption.  We didn't implement the ideas yet since the requirements on
real-world is unclear as of now, though.  We will keep collecting the
requirements and prioritize those or make another solution as the requirements
becomes clearer.

[...]
> >> > One drawback of physical address based scanning is that it's hard to
> >> > apply some workload specific policy.  For example, if a low priority
> >> > workload has many relatively hot pages, while a high priority workload
> >> > has many relative warm (not so hot) pages.  We need to promote the warm
> >> > pages in the high priority workload, while physcial address based
> >> > scanning may report the hot pages in the low priority workload.  Right?
> >> 
> >> Correct. I wonder if DAMON has already devised a scheme to address this. SJ?
> >
> > Yes, I think DAMOS quotas and DAMOS filters can be used to address this issue.
> >
> > For this case, assuming each workload has its own cgroup, users can add a DAMOS
> > scheme for promotion per workload.  The schemes will have different DAMOS
> > quotas based on the workloads' priority.  The schemes will also be controlled
> > to do the promotion for pages of the specific workloads using DAMOS filters.
> >
> > For example, below kdamond configuration can be used.
> >
> > # damo args damon \
> > 	--damos_action migrate_hot 0 --damos_quotas 100ms 1G 1s 0% 100% 100% \
> > 	--damos_filter reject none memcg /workloads/high-priority \
> > 	\
> > 	--damos_action migrate_hot 0 --damos_quotas 10ms 100M 1s 0% 100% 100% \
> > 	--damos_filter reject none memcg /workloads/low-priority \
> > 	--damos_nr_filters 1 1 --out kdamond.json
> > # damo report damon --input_file ./kdamond.json --damon_params_omit_defaults
> > kdamond 0
> >     context 0
> >         ops: paddr
> >         target 0
> >             region [4,294,967,296, 68,577,918,975) (59.868 GiB)
> >         intervals: sample 5 ms, aggr 100 ms, update 1 s
> >         nr_regions: [10, 1,000]
> >         scheme 0
> >             action: migrate_hot to node 0 per aggr interval
> >             target access pattern
> >                 sz: [0 B, max]
> >                 nr_accesses: [0 %, 18,446,744,073,709,551,616 %]
> >                 age: [0 ns, max]
> >             quotas
> >                 100 ms / 1024.000 MiB / 0 B per 1 s
> >                 priority: sz 0 %, nr_accesses 100 %, age 100 %
> >             filter 0
> >                 reject none memcg /workloads/high-priority
> >         scheme 1
> >             action: migrate_hot to node 0 per aggr interval
> >             target access pattern
> >                 sz: [0 B, max]
> >                 nr_accesses: [0 %, 18,446,744,073,709,551,616 %]
> >                 age: [0 ns, max]
> >             quotas
> >                 10 ms / 100.000 MiB / 0 B per 1 s
> >                 priority: sz 0 %, nr_accesses 100 %, age 100 %
> >             filter 0
> >                 reject none memcg /workloads/low-priority
> >
> > Please note that this is just one example based on existing DAMOS features.
> > This may have drawbacks and future optimizations would be possible.
> 
> IIUC, this is something like,
> 
> physical address -> struct page -> cgroup -> per-cgroup hot threshold

You're right.

> 
> this sounds good to me.  Thanks!

Happy to hear that, and looking forward to contiue improving it further with
you! :)


Thanks,
SJ

> 
> ---
> Best Regards,
> Huang, Ying