linux-kernel - Re: [PATCH] psi: Fix race when task wakes up before psi_sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20df37b9-c653-49d6-83e7-da4f21d5b848@linux.dev>
Date: Thu, 26 Dec 2024 18:43:05 +0800
From: Chengming Zhou <chengming.zhou@...ux.dev>
To: K Prateek Nayak <kprateek.nayak@....com>,
 Johannes Weiner <hannes@...xchg.org>, Suren Baghdasaryan
 <surenb@...gle.com>, Ingo Molnar <mingo@...hat.com>,
 Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>, linux-kernel@...r.kernel.org
Cc: Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 Chengming Zhou <zhouchengming@...edance.com>,
 Muchun Song <muchun.song@...ux.dev>,
 "Gautham R. Shenoy" <gautham.shenoy@....com>,
 Chuyi Zhou <zhouchuyi@...edance.com>
Subject: Re: [PATCH] psi: Fix race when task wakes up before
 psi_sched_switch() adjusts flags

Hi,

On 2024/12/26 13:34, K Prateek Nayak wrote:
> When running hackbench in a cgroup with bandwidth throttling enabled,
> following PSI splat was observed:
> 
>      psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4
> 
> When investigating the series of events leading up to the splat,
> following sequence was observed:
>      [008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120
>          ...
>      [008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0
>      [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8
>      # CPU8 goes into newidle balance and releases the rq lock
>          ...
>      # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
>      [015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1)

I have a question here, why TSK_ONCPU is not set in psi_flags if
the task hasn't arrived psi_sched_switch()?

>      [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy
>      [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ...
> 
> psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
> for the blocked entity, however, the following race is possible with
> psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to
> psi_sched_switch()

Yeah, this race is introduced by delayed dequeue changes.

In the past, a sleep task can't be migrated or enqueued before it's done 
in __schedule(). (finish_task(prev) clear prev->on_cpu.)

Now, ttwu_runnable() can call enqueue_task() on the delayed dequeue task
to bring it schedulable.

But migration is still impossible, since it's still running on this cpu,
so no psi_ttwu_dequeue(), only psi_enqueue() can happen, right?

(Actually, there we can enqueue_task() for any sleep task, including
those are not delayed dequeue, if select_task_rq() returns same cpu
as task_cpu(p) to optimize wakeup latency, maybe need to submit a patch
later.)

> 
>      __schedule()
> 	rq_lock(rq)
> 	    try_to_block_task(p)
> 		psi_dequeue()
> 		[ psi_task_switch() is responsible
> 		  for adjusting the PSI flags ]
> 	    put_prev_entity(&p->se)			try_to_wake_up(p)
> 	    # no runnable task on rq->cfs		    ...
> 	    sched_balance_newidle()
> 		raw_spin_rq_unlock(rq)			    __task_rq_lock(p)
> 		...						psi_enqueue()/psi_ttwu_dequeue() [Woops!]
> 							    __task_rq_unlock(p)
> 		raw_spin_rq_lock(rq)
> 	    ...
> 	    [ p was re-enqueued or has migrated away ]

Here ttwu_runnable() call enqueue_task() for delayed dequeue task.

migration can't happen since p->on_cpu is still true.

> 	    ...
> 	    psi_task_switch() [Too late!]
> 	raw_spin_rq_unlock(rq)
> 
> The wakeup context will see the flags for a running task when the flags
> should have reflected the task being blocked. Similarly, a migration
> context in the wakeup path can clear the flags that psi_sched_switch()
> assumes will be set (TSK_ONCPU / TSK_RUNNING)

In this ttwu_runnable() -> enqueue_task() case, I think psi_enqueue()
should do nothing at all.

Why? Because psi_dequeue() is deferred to psi_sched_switch(), so from
PSI POV, this task hasn't gone sleep at all, so psi_enqueue() should NOT
change any state too. (It's not a wakeup or migration from PSI POV.)

And the current code of "psi_sched_switch(prev, next, block);" looks
buggy to me too! The "block" value is from try_to_block_task(), then
pick_next_task() may drop and gain rq lock, so we can't use the stale
value for psi_sched_switch().

Before we used "task_on_rq_queued(prev)", now we have to also consider
delayed dequeue case, so it should be:

"!task_on_rq_queued(prev) || prev->se.sched_delayed"

Thanks!

> 
> Since the TSK_ONCPU flag has to be modified with the rq lock of
> task_cpu() held, use a combination of task_cpu() and TSK_ONCPU checks to
> prevent the race. Specifically:
> 
> o psi_enqueue() will clear the TSK_ONCPU flag when it encounters one.
>    psi_enqueue() will only be called with TSK_ONCPU set when the task is
>    being requeued on the same CPU. If the task was migrated,
>    psi_ttwu_dequeue() would have already cleared the PSI flags.
> 
>    psi_enqueue() cannot guarantee that this same task will be picked
>    again when the scheduling CPU returns from newidle balance which is
>    why it clears the TSK_ONCPU to mimic a net result of sleep + wakeup
>    without migration.
> 
> o When psi_sched_switch() observes that prev's task_cpu() has changes or
>    the TSK_ONCPU flag is not set, a wakeup has raced with the
>    psi_sched_switch() trying to adjust the dequeue flag. If the next is
>    same as the prev, psi_sched_switch() has to now set the TSK_ONCPU flag
>    again. Otherwise, psi_enqueue() or psi_ttwu_dequeue() would have
>    already adjusted the PSI flags and no further changes are required
>    to prev's PSI flags.
> 
> With the introduction of DELAY_DEQUEUE, the requeue path is considerably
> shortened and with the addition of bandwidth throttling in the
> __schedule() path, the race window is large enough to observed this
> issue.
> 
> Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
> ---
> This patch is based on tip:sched/core at commit af98d8a36a96
> ("sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug")
> 
> Reproducer for the PSI splat:
> 
>    mkdir /sys/fs/cgroup/test
>    echo $$ > /sys/fs/cgroup/test/cgroup.procs
>    # Ridiculous limit on SMP to throttle multiple rqs at once
>    echo "50000 100000" > /sys/fs/cgroup/test/cpu.max
>    perf bench sched messaging -t -p -l 100000 -g 16
> 
> This worked reliably on my 3rd Generation EPYC System (2 x 64C/128T) but
> also on a 32 vCPU VM.
> ---
>   kernel/sched/core.c  |  7 ++++-
>   kernel/sched/psi.c   | 65 ++++++++++++++++++++++++++++++++++++++++++--
>   kernel/sched/stats.h | 16 ++++++++++-
>   3 files changed, 83 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 84902936a620..9bbe51e44e98 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6717,6 +6717,12 @@ static void __sched notrace __schedule(int sched_mode)
>   	rq->last_seen_need_resched_ns = 0;
>   #endif
>   
> +	/*
> +	 * PSI might have to deal with the consequences of newidle balance
> +	 * possibly dropping the rq lock and prev being requeued and selected.
> +	 */
> +	psi_sched_switch(prev, next, block);
> +
>   	if (likely(prev != next)) {
>   		rq->nr_switches++;
>   		/*
> @@ -6750,7 +6756,6 @@ static void __sched notrace __schedule(int sched_mode)
>   
>   		migrate_disable_switch(rq, prev);
>   		psi_account_irqtime(rq, prev, next);
> -		psi_sched_switch(prev, next, block);
>   
>   		trace_sched_switch(preempt, prev, next, prev_state);
>   
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 84dad1511d1e..c355a6189595 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -917,9 +917,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>   		     bool sleep)
>   {
>   	struct psi_group *group, *common = NULL;
> -	int cpu = task_cpu(prev);
> +	int prev_cpu, cpu;
> +
> +	/* No race between psi_dequeue() and now */
> +	if (prev == next && (prev->psi_flags & TSK_ONCPU))
> +		return;
> +
> +	prev_cpu = task_cpu(prev);
> +	cpu = smp_processor_id();
>   
>   	if (next->pid) {
> +		/*
> +		 * If next == prev but TSK_ONCPU is cleared, the task was
> +		 * requeued when newidle balance dropped the rq lock and
> +		 * psi_enqueue() cleared the TSK_ONCPU flag.
> +		 */
>   		psi_flags_change(next, 0, TSK_ONCPU);
>   		/*
>   		 * Set TSK_ONCPU on @next's cgroups. If @next shares any
> @@ -928,8 +940,13 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>   		 */
>   		group = task_psi_group(next);
>   		do {
> -			if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
> -			    PSI_ONCPU) {
> +			/*
> +			 * Since newidle balance can drop the rq lock (see the next comment)
> +			 * there is a possibility of try_to_wake_up() migrating prev away
> +			 * before reaching here. Do not find common if task has migrated.
> +			 */
> +			if (prev_cpu == cpu &&
> +			    (per_cpu_ptr(group->pcpu, cpu)->state_mask & PSI_ONCPU)) {
>   				common = group;
>   				break;
>   			}
> @@ -938,6 +955,48 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>   		} while ((group = group->parent));
>   	}
>   
> +	/*
> +	 * When a task is blocked, psi_dequeue() leaves the PSI flag
> +	 * adjustments to psi_task_switch() however, there is a possibility of
> +	 * rq lock being dropped in the interim and the task being woken up
> +	 * again before psi_task_switch() is called leading to psi_enqueue()
> +	 * seeing the flags for a running task. Specifically, the following
> +	 * scenario is possible:
> +	 *
> +	 * __schedule()
> +	 *   rq_lock(rq)
> +	 *     try_to_block_task(p)
> +	 *       psi_dequeue()
> +	 *        [ psi_task_switch() is responsible
> +	 *          for adjusting the PSI flags ]
> +	 *     put_prev_entity(&p->se)			try_to_wake_up(p)
> +	 *     # no runnable task on rq->cfs		  ...
> +	 *     sched_balance_newidle()
> +	 *	 raw_spin_rq_unlock(rq)			  __task_rq_lock(p)
> +	 *	 ...					  psi_enqueue()/psi_ttwu_dequeue() [Woops!]
> +	 *						  __task_rq_unlock(p)
> +	 *	 raw_spin_rq_lock(rq)
> +	 *     ...
> +	 *     [ p was re-enqueued or has migrated away ]
> +	 *     ...
> +	 *     psi_task_switch() [Too late!]
> +	 *   raw_spin_rq_unlock(rq)
> +	 *
> +	 * In the above case, psi_enqueue() can sees the p->psi_flags state
> +	 * before it is adjusted to account for dequeue in psi_task_switch(),
> +	 * or psi_ttwu_dequeue() can clear the p->psi_flags which
> +	 * psi_task_switch() tries to adjust assuming that the entity has just
> +	 * finished running.
> +	 *
> +	 * Since TSK_ONCPU has to be adjusted holding task CPU's rq lock, use
> +	 * the combination of TSK_ONCPU and task_cpu(p) to catch the race
> +	 * between psi_task_switch() and psi_enqueue() / psi_ttwu_dequeue()
> +	 * Since psi_enqueue() / psi_ttwu_dequeue() would have set the correct
> +	 * flags already for prev on this CPU, skip adjusting flags.
> +	 */
> +	if (prev == next || prev_cpu != cpu || !(prev->psi_flags & TSK_ONCPU))
> +		return;
> +
>   	if (prev->pid) {
>   		int clear = TSK_ONCPU, set = 0;
>   		bool wake_clock = true;
> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> index 8ee0add5a48a..f09903165456 100644
> --- a/kernel/sched/stats.h
> +++ b/kernel/sched/stats.h
> @@ -138,7 +138,21 @@ static inline void psi_enqueue(struct task_struct *p, int flags)
>   	if (flags & ENQUEUE_RESTORE)
>   		return;
>   
> -	if (p->se.sched_delayed) {
> +	if (p->psi_flags & TSK_ONCPU) {
> +		/*
> +		 * psi_enqueue() can race with psi_task_switch() where
> +		 * TSK_ONCPU will be still set for the task (see the
> +		 * comment in psi_task_switch())
> +		 *
> +		 * Reaching here with TSK_ONCPU is only possible when
> +		 * the task is being enqueued on the same CPU. Since
> +		 * psi_task_switch() has not had the chance to adjust
> +		 * the flags yet, just clear the TSK_ONCPU which yields
> +		 * the same result as sleep + wakeup without migration.
> +		 */
> +		SCHED_WARN_ON(flags & ENQUEUE_MIGRATED);
> +		clear = TSK_ONCPU;
> +	} else if (p->se.sched_delayed) {
>   		/* CPU migration of "sleeping" task */
>   		SCHED_WARN_ON(!(flags & ENQUEUE_MIGRATED));
>   		if (p->in_memstall)
> 
> base-commit: af98d8a36a963e758e84266d152b92c7b51d4ecb