[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <eaacc9c9bd37bac92d43a671867d85b2fdad3b06.1748002400.git.yu.c.chen@intel.com>
Date: Fri, 23 May 2025 20:51:01 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: peterz@...radead.org,
akpm@...ux-foundation.org
Cc: mkoutny@...e.com,
mingo@...hat.com,
tj@...nel.org,
hannes@...xchg.org,
corbet@....net,
mgorman@...e.de,
mhocko@...nel.org,
muchun.song@...ux.dev,
roman.gushchin@...ux.dev,
shakeel.butt@...ux.dev,
tim.c.chen@...el.com,
aubrey.li@...el.com,
libo.chen@...cle.com,
kprateek.nayak@....com,
vineethr@...ux.ibm.com,
venkat88@...ux.ibm.com,
ayushjai@....com,
cgroups@...r.kernel.org,
linux-doc@...r.kernel.org,
linux-mm@...ck.org,
linux-kernel@...r.kernel.org,
yu.chen.surf@...mail.com,
Ayush Jain <Ayush.jain3@....com>,
Chen Yu <yu.c.chen@...el.com>
Subject: [PATCH v5 1/2] sched/numa: fix task swap by skipping kernel threads
From: Libo Chen <libo.chen@...cle.com>
Task swapping is triggered when there are no idle CPUs in
task A's preferred node. In this case, the NUMA load balancer
chooses a task B on A's preferred node and swaps B with A. This
helps improve NUMA locality without introducing load imbalance
between nodes. In the current implementation, B's NUMA node
preference is not mandatory. That is to say, a kernel thread
might be incorrectly chosen as B. However, kernel thread and
user space thread that does not have mm are not supposed to be
covered by NUMA balancing because NUMA balancing only considers
user pages via VMAs.
According to Peter's suggestion for fixing this issue, we use
PF_KTHREAD to skip the kernel thread. curr->mm is also checked
because it is possible that user_mode_thread() might create a
user thread without an mm. As per Prateek's analysis, after
adding the PF_KTHREAD check, there is no need to further check
the PF_IDLE flag:
"
- play_idle_precise() already ensures PF_KTHREAD is set before adding
PF_IDLE
- cpu_startup_entry() is only called from the startup thread which
should be marked with PF_KTHREAD (based on my understanding looking at
commit cff9b2332ab7 ("kernel/sched: Modify initial boot task idle
setup"))
"
In summary, the check in task_numa_compare() now aligns with
task_tick_numa().
Suggested-by: Michal Koutny <mkoutny@...e.com>
Tested-by: Ayush Jain <Ayush.jain3@....com>
Signed-off-by: Libo Chen <libo.chen@...cle.com>
Tested-by: Venkat Rao Bagalkote <venkat88@...ux.ibm.com>
Signed-off-by: Chen Yu <yu.c.chen@...el.com>
---
v4->v5:
Add PF_KTHREAD check, and remove PF_IDLE check.
---
kernel/sched/fair.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fb9bf995a47..03d9a49a68b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2273,7 +2273,8 @@ static bool task_numa_compare(struct task_numa_env *env,
rcu_read_lock();
cur = rcu_dereference(dst_rq->curr);
- if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
+ if (cur && ((cur->flags & (PF_EXITING | PF_KTHREAD)) ||
+ !cur->mm))
cur = NULL;
/*
--
2.25.1
Powered by blists - more mailing lists