lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <eaacc9c9bd37bac92d43a671867d85b2fdad3b06.1748002400.git.yu.c.chen@intel.com>
Date: Fri, 23 May 2025 20:51:01 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: peterz@...radead.org,
	akpm@...ux-foundation.org
Cc: mkoutny@...e.com,
	mingo@...hat.com,
	tj@...nel.org,
	hannes@...xchg.org,
	corbet@....net,
	mgorman@...e.de,
	mhocko@...nel.org,
	muchun.song@...ux.dev,
	roman.gushchin@...ux.dev,
	shakeel.butt@...ux.dev,
	tim.c.chen@...el.com,
	aubrey.li@...el.com,
	libo.chen@...cle.com,
	kprateek.nayak@....com,
	vineethr@...ux.ibm.com,
	venkat88@...ux.ibm.com,
	ayushjai@....com,
	cgroups@...r.kernel.org,
	linux-doc@...r.kernel.org,
	linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	yu.chen.surf@...mail.com,
	Ayush Jain <Ayush.jain3@....com>,
	Chen Yu <yu.c.chen@...el.com>
Subject: [PATCH v5 1/2] sched/numa: fix task swap by skipping kernel threads

From: Libo Chen <libo.chen@...cle.com>

Task swapping is triggered when there are no idle CPUs in
task A's preferred node. In this case, the NUMA load balancer
chooses a task B on A's preferred node and swaps B with A. This
helps improve NUMA locality without introducing load imbalance
between nodes. In the current implementation, B's NUMA node
preference is not mandatory. That is to say, a kernel thread
might be incorrectly chosen as B. However, kernel thread and
user space thread that does not have mm are not supposed to be
covered by NUMA balancing because NUMA balancing only considers
user pages via VMAs.

According to Peter's suggestion for fixing this issue, we use
PF_KTHREAD to skip the kernel thread. curr->mm is also checked
because it is possible that user_mode_thread() might create a
user thread without an mm. As per Prateek's analysis, after
adding the PF_KTHREAD check, there is no need to further check
the PF_IDLE flag:
"
- play_idle_precise() already ensures PF_KTHREAD is set before adding
  PF_IDLE

- cpu_startup_entry() is only called from the startup thread which
  should be marked with PF_KTHREAD (based on my understanding looking at
  commit cff9b2332ab7 ("kernel/sched: Modify initial boot task idle
  setup"))
"

In summary, the check in task_numa_compare() now aligns with
task_tick_numa().

Suggested-by: Michal Koutny <mkoutny@...e.com>
Tested-by: Ayush Jain <Ayush.jain3@....com>
Signed-off-by: Libo Chen <libo.chen@...cle.com>
Tested-by: Venkat Rao Bagalkote <venkat88@...ux.ibm.com>
Signed-off-by: Chen Yu <yu.c.chen@...el.com>
---
v4->v5:
Add PF_KTHREAD check, and remove PF_IDLE check.
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fb9bf995a47..03d9a49a68b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2273,7 +2273,8 @@ static bool task_numa_compare(struct task_numa_env *env,
 
 	rcu_read_lock();
 	cur = rcu_dereference(dst_rq->curr);
-	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
+	if (cur && ((cur->flags & (PF_EXITING | PF_KTHREAD)) ||
+		    !cur->mm))
 		cur = NULL;
 
 	/*
-- 
2.25.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ