linux-kernel - Re: [RFC PATCH V3 10/17] mm: Add a heuristic to calculate target node

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251003110453.00007ca6@huawei.com>
Date: Fri, 3 Oct 2025 11:04:53 +0100
From: Jonathan Cameron <jonathan.cameron@...wei.com>
To: Raghavendra K T <raghavendra.kt@....com>
CC: <AneeshKumar.KizhakeVeetil@....com>, <Michael.Day@....com>,
	<akpm@...ux-foundation.org>, <bharata@....com>, <dave.hansen@...el.com>,
	<david@...hat.com>, <dongjoo.linux.dev@...il.com>, <feng.tang@...el.com>,
	<gourry@...rry.net>, <hannes@...xchg.org>, <honggyu.kim@...com>,
	<hughd@...gle.com>, <jhubbard@...dia.com>, <jon.grimm@....com>,
	<k.shutemov@...il.com>, <kbusch@...a.com>, <kmanaouil.dev@...il.com>,
	<leesuyeon0506@...il.com>, <leillc@...gle.com>, <liam.howlett@...cle.com>,
	<linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
	<mgorman@...hsingularity.net>, <mingo@...hat.com>, <nadav.amit@...il.com>,
	<nphamcs@...il.com>, <peterz@...radead.org>, <riel@...riel.com>,
	<rientjes@...gle.com>, <rppt@...nel.org>, <santosh.shukla@....com>,
	<shivankg@....com>, <shy828301@...il.com>, <sj@...nel.org>, <vbabka@...e.cz>,
	<weixugc@...gle.com>, <willy@...radead.org>, <ying.huang@...ux.alibaba.com>,
	<ziy@...dia.com>, <dave@...olabs.net>, <yuanchu@...gle.com>,
	<kinseyho@...gle.com>, <hdanton@...a.com>, <harry.yoo@...cle.com>
Subject: Re: [RFC PATCH V3 10/17] mm: Add a heuristic to calculate target
 node

On Thu, 14 Aug 2025 15:33:00 +0000
Raghavendra K T <raghavendra.kt@....com> wrote:

> One of the key challenges in PTE A bit based scanning is to find right
> target node to promote to.
> 
> Here is a simple heuristic based approach:
>  1. While scanning pages of any mm, also scan toptier pages that belong
> to that mm.
>  2. Accumulate the insight on the distribution of active pages on
> toptier nodes.
>  3. Walk all the top-tier nodes and pick the one with highest accesses.
> 
>  This method tries to consolidate application to a single node.
Nothing new in the following comment as we've discussed it before, but just
to keep everything together:

So for the pathological case of task that has moved after initial allocations
are done, this is effectively relying on conventional numa balancing ensuring
we don't keep promoting to the wrong node?

That makes me a little nervous.   I guess the proof of this will be
in mass testing though.  Maybe it works well enough - I have no idea yet!

A few comments inline

Jonathan


> 
> TBD: Create a list of preferred nodes for fallback when highest access
>  node is nearly full.
> 
> Signed-off-by: Raghavendra K T <raghavendra.kt@....com>

> +/* Per memory node information used to caclulate target_node for migration */

calculate

> +struct kscand_nodeinfo {
> +	unsigned long nr_scanned;
> +	unsigned long nr_accessed;
> +	int node;
> +	bool is_toptier;
> +};
> +
>  /*
>   * Data structure passed to control scanning and also collect
I'd drop "also". The and implies that already.
>   * per memory node information

Wrap closer to 80 chars.  Also missing full stop.

>   */
>  struct kscand_scanctrl {
>  	struct list_head scan_list;
> +	struct kscand_nodeinfo *nodeinfo[MAX_NUMNODES];
>  	unsigned long address;
> +	unsigned long nr_to_scan;
>  };
>  
>  struct kscand_scanctrl kscand_scanctrl;
> @@ -218,15 +229,129 @@ static void kmigrated_wait_work(void)
>  			migrate_sleep_jiffies);
>  }
>  
> -/*
> - * Do not know what info to pass in the future to make
> - * decision on taget node. Keep it void * now.

Wrong patch review for this comment but "target"

> - */
> +static unsigned long get_slowtier_accesed(struct kscand_scanctrl *scanctrl)

accessed

> +{
> +	int node;
> +	unsigned long accessed = 0;
> +
> +	for_each_node_state(node, N_MEMORY) {
> +		if (!node_is_toptier(node) && scanctrl->nodeinfo[node])
> +			accessed += scanctrl->nodeinfo[node]->nr_accessed;
> +	}
> +	return accessed;
> +}
> +
> +static inline unsigned long get_nodeinfo_nr_accessed(struct kscand_nodeinfo *ni)
> +{
> +	return ni->nr_accessed;
> +}
> +
> +static inline void set_nodeinfo_nr_accessed(struct kscand_nodeinfo *ni, unsigned long val)
> +{
> +	ni->nr_accessed = val;
> +}
> +
> +static inline unsigned long get_nodeinfo_nr_scanned(struct kscand_nodeinfo *ni)
> +{
> +	return ni->nr_scanned;
> +}
> +
> +static inline void set_nodeinfo_nr_scanned(struct kscand_nodeinfo *ni, unsigned long val)
> +{
> +	ni->nr_scanned = val;
> +}

These helpers seems unnecessary given they are static, so we have fully visibility of the
structure where they are called anyway.

Perhaps they get more complex in later patches though (in which case ignore this comment!)

> +
> +static inline void reset_nodeinfo_nr_scanned(struct kscand_nodeinfo *ni)
> +{
> +	set_nodeinfo_nr_scanned(ni, 0);
> +}
> +
> +static inline void reset_nodeinfo(struct kscand_nodeinfo *ni)
> +{
> +	set_nodeinfo_nr_scanned(ni, 0);
> +	set_nodeinfo_nr_accessed(ni, 0);
> +}
> +
> +static void init_one_nodeinfo(struct kscand_nodeinfo *ni, int node)
> +{
> +	ni->nr_scanned = 0;
> +	ni->nr_accessed = 0;
> +	ni->node = node;
> +	ni->is_toptier = node_is_toptier(node) ? true : false;
	ni->is_toptier = node_is_toptier(node);

> +}
> +
> +static struct kscand_nodeinfo *alloc_one_nodeinfo(int node)
> +{
> +	struct kscand_nodeinfo *ni;
> +
> +	ni = kzalloc(sizeof(*ni), GFP_KERNEL);
> +
> +	if (!ni)
> +		return NULL;
> +
> +	init_one_nodeinfo(ni, node);
As only done in one place, I'd just do an inline
	*ni = (struct kscand_node_info) {
		.node = node,
		.is_toptier = node_is_toptier(node),

Can set the zeros if you think that acts as useful documentation.


	};
> +
> +	return ni;
> +}
> +
> +/* TBD: Handle errors */
> +static void init_scanctrl(struct kscand_scanctrl *scanctrl)
> +{
> +	struct kscand_nodeinfo *ni;
Trivial: I'd move this into the for_each_node scope.


> +	int node;
> +
> +	for_each_node(node) {
i.e.
		struct kscand_nodeinfo *ni = alloc_one_nodeinfo(node);

> +		ni = alloc_one_nodeinfo(node);

If this isn't going to get a lot more complex, I'd squash the alloc_one_nodeinfo()
code in here and drop the helper. Up to you though as this is a trade off in
levels of modularity vs compact code.

> +		if (!ni)
> +			WARN_ON_ONCE(ni);
> +		scanctrl->nodeinfo[node] = ni;
> +	}
> +}
> +
> +static void reset_scanctrl(struct kscand_scanctrl *scanctrl)
> +{
> +	int node;
> +
> +	for_each_node_state(node, N_MEMORY)
> +		reset_nodeinfo(scanctrl->nodeinfo[node]);
> +
> +	/* XXX: Not rellay required? */
> +	scanctrl->nr_to_scan = kscand_scan_size;
> +}
> +
> +static void free_scanctrl(struct kscand_scanctrl *scanctrl)
> +{
> +	int node;
> +
> +	for_each_node(node)
> +		kfree(scanctrl->nodeinfo[node]);
> +}
> +
>  static int kscand_get_target_node(void *data)
>  {
>  	return kscand_target_node;
>  }
>  
> +static int get_target_node(struct kscand_scanctrl *scanctrl)
> +{
> +	int node, target_node = NUMA_NO_NODE;
> +	unsigned long prev = 0;
> +
> +	for_each_node(node) {
> +		if (node_is_toptier(node) && scanctrl->nodeinfo[node]) {

Probably flip sense of one or more of the if statements just to reduce indent.

		if (!node_is_toptier(node) || !scanctrl->nodeinfo[node])
			continue;

etc.


> +			/* This creates a fallback migration node list */
> +			if (get_nodeinfo_nr_accessed(scanctrl->nodeinfo[node]) > prev) {
> +				prev = get_nodeinfo_nr_accessed(scanctrl->nodeinfo[node]);

Maybe a local variable given use in check and here.

> +				target_node = node;
> +			}
> +		}
> +	}
> +	if (target_node == NUMA_NO_NODE)
> +		target_node = kscand_get_target_node(NULL);
> +
> +	return target_node;
> +}
> +
>  extern bool migrate_balanced_pgdat(struct pglist_data *pgdat,
>  					unsigned long nr_migrate_pages);
>  
> @@ -495,6 +620,14 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
>  	page_idle_clear_pte_refs(page, pte, walk);
>  	srcnid = folio_nid(folio);
>  
> +	scanctrl->nodeinfo[srcnid]->nr_scanned++;
> +	if (scanctrl->nr_to_scan)
> +		scanctrl->nr_to_scan--;
> +
> +	if (!scanctrl->nr_to_scan) {
> +		folio_put(folio);
> +		return 1;
> +	}
>  
>  	if (!folio_test_lru(folio)) {
>  		folio_put(folio);
> @@ -502,13 +635,17 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
>  	}
>  
>  	if (!kscand_eligible_srcnid(srcnid)) {
> +		if (folio_test_young(folio) || folio_test_referenced(folio)
> +				|| pte_young(pteval)) {
Unusual wrap position.  I'd move the || to line above and align pte_young() 
after the ( on the line above.
> +			scanctrl->nodeinfo[srcnid]->nr_accessed++;
> +		}
>  		folio_put(folio);

> +	/* Either Scan 25% of scan_size or cover vma size of scan_size */
> +	kscand_scanctrl.nr_to_scan =	mm_slot_scan_size >> PAGE_SHIFT;

Trivial but I'm not sure what you are forcing alignment for here.  I'd stick
to one space after =

> +	/* Reduce actual amount of pages scanned */
> +	kscand_scanctrl.nr_to_scan =	mm_slot_scan_size >> 1;

If my eyes aren't tricking me this sets the value then immediately replaces it with
something else. Is that intent?

> +
> +	/* XXX: skip scanning to avoid duplicates until all migrations done? */
>  	kmigrated_mm_slot = kmigrated_get_mm_slot(mm, false);
>  
>  	for_each_vma(vmi, vma) {
>  		kscand_walk_page_vma(vma, &kscand_scanctrl);
>  		vma_scanned_size += vma->vm_end - vma->vm_start;
>  
> -		if (vma_scanned_size >= kscand_scan_size) {
> +		if (vma_scanned_size >= mm_slot_scan_size ||
> +					!kscand_scanctrl.nr_to_scan) {
>  			next_mm = true;
>  
>  			if (!list_empty(&kscand_scanctrl.scan_list)) {
>  				if (!kmigrated_mm_slot)
>  					kmigrated_mm_slot = kmigrated_get_mm_slot(mm, true);
> +				/* Add scanned folios to migration list */
>  				spin_lock(&kmigrated_mm_slot->migrate_lock);
> +
>  				list_splice_tail_init(&kscand_scanctrl.scan_list,
>  						&kmigrated_mm_slot->migrate_head);
>  				spin_unlock(&kmigrated_mm_slot->migrate_lock);
> +				break;
>  			}
> -			break;
> +		}
> +		if (!list_empty(&kscand_scanctrl.scan_list)) {
> +			if (!kmigrated_mm_slot)
> +				kmigrated_mm_slot = kmigrated_get_mm_slot(mm, true);
> +			spin_lock(&kmigrated_mm_slot->migrate_lock);

Use of guard() in these might be a useful readability improvement.

> +			list_splice_tail_init(&kscand_scanctrl.scan_list,
> +					&kmigrated_mm_slot->migrate_head);
> +			spin_unlock(&kmigrated_mm_slot->migrate_lock);

This code block is identical to the one just above and that breaks out to run this.
Do we need them both?  Or is there some subtle difference my eyes are jumping over?


>  		}
>  	}
>