linux-kernel - Re: [RFC PATCH 2/2] NUMA balancing: avoid to migrate task to CPU-less node

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <877dakti0n.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date:   Fri, 28 Jan 2022 15:51:36 +0800
From:   "Huang, Ying" <ying.huang@...el.com>
To:     Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Mel Gorman <mgorman@...e.de>, linux-kernel@...r.kernel.org,
        Ingo Molnar <mingo@...hat.com>, Rik van Riel <riel@...riel.com>
Subject: Re: [RFC PATCH 2/2] NUMA balancing: avoid to migrate task to
 CPU-less node

Srikar Dronamraju <srikar@...ux.vnet.ibm.com> writes:

> * Huang Ying <ying.huang@...el.com> [2022-01-28 10:38:42]:
>
>> In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
>> nodes.  But if the number of the hint page faults on a PMEM node is
>> the max for a task, The current NUMA balancing policy may try to place
>> the task on the PMEM node instead of DRAM node.  This is unreasonable,
>> because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
>> nodes are ignored when searching the migration target node for a task
>> in this patch.
>> 
>> To test the patch, we run a workload that accesses more memory in PMEM
>> node than memory in DRAM node.  Without the patch, the PMEM node will
>> be chosen as preferred node in task_numa_placement().  While the DRAM
>> node will be chosen instead with the patch.
>> 
>> Signed-off-by: "Huang, Ying" <ying.huang@...el.com>
>> Cc: Peter Zijlstra <peterz@...radead.org>
>> Cc: Ingo Molnar <mingo@...hat.com>
>> Cc: Mel Gorman <mgorman@...e.de>
>> Cc: Rik van Riel <riel@...riel.com>
>> Cc: Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
>> ---
>>  kernel/sched/fair.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 54e1aad1c5d7..e462ac5c1e48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2393,6 +2393,10 @@ static void task_numa_placement(struct task_struct *p)
>>  			}
>>  		}
>> 
>> +		/* Cannot migrate task to CPU-less node */
>> +		if (!node_state(nid, N_CPU))
>> +			continue;
>> +
>
> Lets take the example that you quoted 2 socket machine with 1 DRAM node and
> 1 PMEM node per socket.  Now lets say the task is placed on a CPU in node 1
> but most of its memory faults are coming from node 2, which is the PMEM node
> attached to node 0. Now without the hunk, there is a chance that the task
> got moved to node 0. However with the change, are we inhibiting such a move?

This sounds reasonable.  How about the following solution?  If a
CPU-less node is selected as migration target, we select a nearest node
with CPU instead?  That is, something like the below patch.

Best Regards,
Huang, Ying

------------------------------8<---------------------------------
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5146163bfabb..52d926d8cbdb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2401,6 +2401,23 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
+	/* Cannot migrate task to CPU-less node */
+	if (!node_state(max_nid, N_CPU)) {
+		int near_nid = max_nid;
+		int distance, near_distance = INT_MAX;
+
+		for_each_online_node(nid) {
+			if (!node_state(nid, N_CPU))
+				continue;
+			distance = node_distance(max_nid, nid);
+			if (distance < near_distance) {
+				near_nid = nid;
+				near_distance = distance;
+			}
+		}
+		max_nid = near_nid;
+	}
+
 	if (ng) {
 		numa_group_count_active_nodes(ng);
 		spin_unlock_irq(group_lock);