linux-kernel - [PATCH] psi: Fix race when task wakes up before psi_sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241226053441.1110-1-kprateek.nayak@amd.com>
Date: Thu, 26 Dec 2024 05:34:41 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Johannes Weiner <hannes@...xchg.org>, Suren Baghdasaryan
	<surenb@...gle.com>, Ingo Molnar <mingo@...hat.com>, Peter Zijlstra
	<peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, <linux-kernel@...r.kernel.org>
CC: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
	<rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
	<mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, Chengming Zhou
	<zhouchengming@...edance.com>, Muchun Song <muchun.song@...ux.dev>, "Gautham
 R. Shenoy" <gautham.shenoy@....com>, K Prateek Nayak <kprateek.nayak@....com>
Subject: [PATCH] psi: Fix race when task wakes up before psi_sched_switch() adjusts flags

When running hackbench in a cgroup with bandwidth throttling enabled,
following PSI splat was observed:

    psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4

When investigating the series of events leading up to the splat,
following sequence was observed:
    [008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120
        ...
    [008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0
    [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8
    # CPU8 goes into newidle balance and releases the rq lock
        ...
    # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
    [015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1)
    [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy
    [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ...

psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
for the blocked entity, however, the following race is possible with
psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to
psi_sched_switch()

    __schedule()
	rq_lock(rq)
	    try_to_block_task(p)
		psi_dequeue()
		[ psi_task_switch() is responsible
		  for adjusting the PSI flags ]
	    put_prev_entity(&p->se)			try_to_wake_up(p)
	    # no runnable task on rq->cfs		    ...
	    sched_balance_newidle()
		raw_spin_rq_unlock(rq)			    __task_rq_lock(p)
		...						psi_enqueue()/psi_ttwu_dequeue() [Woops!]
							    __task_rq_unlock(p)
		raw_spin_rq_lock(rq)
	    ...
	    [ p was re-enqueued or has migrated away ]
	    ...
	    psi_task_switch() [Too late!]
	raw_spin_rq_unlock(rq)

The wakeup context will see the flags for a running task when the flags
should have reflected the task being blocked. Similarly, a migration
context in the wakeup path can clear the flags that psi_sched_switch()
assumes will be set (TSK_ONCPU / TSK_RUNNING)

Since the TSK_ONCPU flag has to be modified with the rq lock of
task_cpu() held, use a combination of task_cpu() and TSK_ONCPU checks to
prevent the race. Specifically:

o psi_enqueue() will clear the TSK_ONCPU flag when it encounters one.
  psi_enqueue() will only be called with TSK_ONCPU set when the task is
  being requeued on the same CPU. If the task was migrated,
  psi_ttwu_dequeue() would have already cleared the PSI flags.

  psi_enqueue() cannot guarantee that this same task will be picked
  again when the scheduling CPU returns from newidle balance which is
  why it clears the TSK_ONCPU to mimic a net result of sleep + wakeup
  without migration.

o When psi_sched_switch() observes that prev's task_cpu() has changes or
  the TSK_ONCPU flag is not set, a wakeup has raced with the
  psi_sched_switch() trying to adjust the dequeue flag. If the next is
  same as the prev, psi_sched_switch() has to now set the TSK_ONCPU flag
  again. Otherwise, psi_enqueue() or psi_ttwu_dequeue() would have
  already adjusted the PSI flags and no further changes are required
  to prev's PSI flags.

With the introduction of DELAY_DEQUEUE, the requeue path is considerably
shortened and with the addition of bandwidth throttling in the
__schedule() path, the race window is large enough to observed this
issue.

Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
---
This patch is based on tip:sched/core at commit af98d8a36a96
("sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug")

Reproducer for the PSI splat:

  mkdir /sys/fs/cgroup/test
  echo $$ > /sys/fs/cgroup/test/cgroup.procs
  # Ridiculous limit on SMP to throttle multiple rqs at once
  echo "50000 100000" > /sys/fs/cgroup/test/cpu.max
  perf bench sched messaging -t -p -l 100000 -g 16

This worked reliably on my 3rd Generation EPYC System (2 x 64C/128T) but
also on a 32 vCPU VM.
---
 kernel/sched/core.c  |  7 ++++-
 kernel/sched/psi.c   | 65 ++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/stats.h | 16 ++++++++++-
 3 files changed, 83 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 84902936a620..9bbe51e44e98 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6717,6 +6717,12 @@ static void __sched notrace __schedule(int sched_mode)
 	rq->last_seen_need_resched_ns = 0;
 #endif
 
+	/*
+	 * PSI might have to deal with the consequences of newidle balance
+	 * possibly dropping the rq lock and prev being requeued and selected.
+	 */
+	psi_sched_switch(prev, next, block);
+
 	if (likely(prev != next)) {
 		rq->nr_switches++;
 		/*
@@ -6750,7 +6756,6 @@ static void __sched notrace __schedule(int sched_mode)
 
 		migrate_disable_switch(rq, prev);
 		psi_account_irqtime(rq, prev, next);
-		psi_sched_switch(prev, next, block);
 
 		trace_sched_switch(preempt, prev, next, prev_state);
 
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 84dad1511d1e..c355a6189595 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -917,9 +917,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep)
 {
 	struct psi_group *group, *common = NULL;
-	int cpu = task_cpu(prev);
+	int prev_cpu, cpu;
+
+	/* No race between psi_dequeue() and now */
+	if (prev == next && (prev->psi_flags & TSK_ONCPU))
+		return;
+
+	prev_cpu = task_cpu(prev);
+	cpu = smp_processor_id();
 
 	if (next->pid) {
+		/*
+		 * If next == prev but TSK_ONCPU is cleared, the task was
+		 * requeued when newidle balance dropped the rq lock and
+		 * psi_enqueue() cleared the TSK_ONCPU flag.
+		 */
 		psi_flags_change(next, 0, TSK_ONCPU);
 		/*
 		 * Set TSK_ONCPU on @next's cgroups. If @next shares any
@@ -928,8 +940,13 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 */
 		group = task_psi_group(next);
 		do {
-			if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
-			    PSI_ONCPU) {
+			/*
+			 * Since newidle balance can drop the rq lock (see the next comment)
+			 * there is a possibility of try_to_wake_up() migrating prev away
+			 * before reaching here. Do not find common if task has migrated.
+			 */
+			if (prev_cpu == cpu &&
+			    (per_cpu_ptr(group->pcpu, cpu)->state_mask & PSI_ONCPU)) {
 				common = group;
 				break;
 			}
@@ -938,6 +955,48 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		} while ((group = group->parent));
 	}
 
+	/*
+	 * When a task is blocked, psi_dequeue() leaves the PSI flag
+	 * adjustments to psi_task_switch() however, there is a possibility of
+	 * rq lock being dropped in the interim and the task being woken up
+	 * again before psi_task_switch() is called leading to psi_enqueue()
+	 * seeing the flags for a running task. Specifically, the following
+	 * scenario is possible:
+	 *
+	 * __schedule()
+	 *   rq_lock(rq)
+	 *     try_to_block_task(p)
+	 *       psi_dequeue()
+	 *        [ psi_task_switch() is responsible
+	 *          for adjusting the PSI flags ]
+	 *     put_prev_entity(&p->se)			try_to_wake_up(p)
+	 *     # no runnable task on rq->cfs		  ...
+	 *     sched_balance_newidle()
+	 *	 raw_spin_rq_unlock(rq)			  __task_rq_lock(p)
+	 *	 ...					  psi_enqueue()/psi_ttwu_dequeue() [Woops!]
+	 *						  __task_rq_unlock(p)
+	 *	 raw_spin_rq_lock(rq)
+	 *     ...
+	 *     [ p was re-enqueued or has migrated away ]
+	 *     ...
+	 *     psi_task_switch() [Too late!]
+	 *   raw_spin_rq_unlock(rq)
+	 *
+	 * In the above case, psi_enqueue() can sees the p->psi_flags state
+	 * before it is adjusted to account for dequeue in psi_task_switch(),
+	 * or psi_ttwu_dequeue() can clear the p->psi_flags which
+	 * psi_task_switch() tries to adjust assuming that the entity has just
+	 * finished running.
+	 *
+	 * Since TSK_ONCPU has to be adjusted holding task CPU's rq lock, use
+	 * the combination of TSK_ONCPU and task_cpu(p) to catch the race
+	 * between psi_task_switch() and psi_enqueue() / psi_ttwu_dequeue()
+	 * Since psi_enqueue() / psi_ttwu_dequeue() would have set the correct
+	 * flags already for prev on this CPU, skip adjusting flags.
+	 */
+	if (prev == next || prev_cpu != cpu || !(prev->psi_flags & TSK_ONCPU))
+		return;
+
 	if (prev->pid) {
 		int clear = TSK_ONCPU, set = 0;
 		bool wake_clock = true;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 8ee0add5a48a..f09903165456 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -138,7 +138,21 @@ static inline void psi_enqueue(struct task_struct *p, int flags)
 	if (flags & ENQUEUE_RESTORE)
 		return;
 
-	if (p->se.sched_delayed) {
+	if (p->psi_flags & TSK_ONCPU) {
+		/*
+		 * psi_enqueue() can race with psi_task_switch() where
+		 * TSK_ONCPU will be still set for the task (see the
+		 * comment in psi_task_switch())
+		 *
+		 * Reaching here with TSK_ONCPU is only possible when
+		 * the task is being enqueued on the same CPU. Since
+		 * psi_task_switch() has not had the chance to adjust
+		 * the flags yet, just clear the TSK_ONCPU which yields
+		 * the same result as sleep + wakeup without migration.
+		 */
+		SCHED_WARN_ON(flags & ENQUEUE_MIGRATED);
+		clear = TSK_ONCPU;
+	} else if (p->se.sched_delayed) {
 		/* CPU migration of "sleeping" task */
 		SCHED_WARN_ON(!(flags & ENQUEUE_MIGRATED));
 		if (p->in_memstall)

base-commit: af98d8a36a963e758e84266d152b92c7b51d4ecb
-- 
2.34.1