linux-kernel - Re: [PATCH v2] psi: Fix race when task wakes up before psi_sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f18a3086-79d6-4466-a72a-29a10bc91b01@linux.dev>
Date: Fri, 27 Dec 2024 14:41:05 +0800
From: Chengming Zhou <chengming.zhou@...ux.dev>
To: K Prateek Nayak <kprateek.nayak@....com>,
 Johannes Weiner <hannes@...xchg.org>, Suren Baghdasaryan
 <surenb@...gle.com>, Ingo Molnar <mingo@...hat.com>,
 Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>, linux-kernel@...r.kernel.org
Cc: Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 "Gautham R. Shenoy" <gautham.shenoy@....com>
Subject: Re: [PATCH v2] psi: Fix race when task wakes up before
 psi_sched_switch() adjusts flags

On 2024/12/27 14:19, K Prateek Nayak wrote:
> From: Chengming Zhou <chengming.zhou@...ux.dev>
> 
> When running hackbench in a cgroup with bandwidth throttling enabled,
> following PSI splat was observed:
> 
>      psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4
> 
> When investigating the series of events leading up to the splat,
> following sequence was observed:
> 
>      [008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120
>          ...
>      [008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0
>      [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8
>      # CPU8 goes into newidle balance and releases the rq lock
>          ...
>      # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
>      [015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1)
>      [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy
>      [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ...
> 
> psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
> for the blocked entity, however, with the introduction of DELAY_DEQUEUE,
> the block task can wakeup when newidle balance drops the runqueue lock
> during __schedule().
> 
> If a task wakes before psi_sched_switch() adjusts the PSI flags, skip
> any modifications in psi_enqueue() which would still see the flags of a
> running task and not a blocked one. Instead, rely on psi_sched_switch()
> to do the right thing.
> 
> Since the status returned by try_to_block_task() may no longer be true
> by the time schedule reaches psi_sched_switch(), check if the task is
> blocked or not using a combination of task_on_rq_queued() and
> p->se.sched_delayed checks.
> 
> [ prateek: Commit message, testing, early bailout in psi_enqueue() ]
> 
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") # 1a6151017ee5
> Link: https://lore.kernel.org/all/409b4a72-483e-467b-8d00-9a8dae48bdc9@linux.dev/
> Signed-off-by: Chengming Zhou <chengming.zhou@...ux.dev>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>

I just notice the return value of try_to_block_task() is not
used anymore, not a problem though.

Reviewed-by: Chengming Zhou <chengming.zhou@...ux.dev>

Thanks!

> ---
> v1..v2:
> 
> o Removed any considerations of psi_ttwu_dequeue() racing with
>    psi_sched_switch() and use the solution from Chengming to only
>    consider a requeue of delayed task.
> 
> o Reworded the commit message to only highlight the relevant bits and
>    corrected the Fixes tag.
> 
> Thank you Chengming for patiently explaining all the nunaces that led to
> the splat :)
> 
> This patch is based on tip:sched/core at commit af98d8a36a96
> ("sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug")
> 
> Reproducer for the PSI splat:
> 
>    mkdir /sys/fs/cgroup/test
>    echo $$ > /sys/fs/cgroup/test/cgroup.procs
>    # Ridiculous limit on SMP to throttle multiple rqs at once
>    echo "50000 100000" > /sys/fs/cgroup/test/cpu.max
>    perf bench sched messaging -t -p -l 100000 -g 16
> 
> This worked reliably on my 3rd Generation EPYC System (2 x 64C/128T) but
> also on a 32 vCPU VM.
> ---
>   kernel/sched/core.c  | 6 +++---
>   kernel/sched/stats.h | 4 ++++
>   2 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 84902936a620..3d2ab0ad80c9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6643,7 +6643,6 @@ static void __sched notrace __schedule(int sched_mode)
>   	 * as a preemption by schedule_debug() and RCU.
>   	 */
>   	bool preempt = sched_mode > SM_NONE;
> -	bool block = false;
>   	unsigned long *switch_count;
>   	unsigned long prev_state;
>   	struct rq_flags rf;
> @@ -6704,7 +6703,7 @@ static void __sched notrace __schedule(int sched_mode)
>   			goto picked;
>   		}
>   	} else if (!preempt && prev_state) {
> -		block = try_to_block_task(rq, prev, prev_state);
> +		try_to_block_task(rq, prev, prev_state);
>   		switch_count = &prev->nvcsw;
>   	}
>   
> @@ -6750,7 +6749,8 @@ static void __sched notrace __schedule(int sched_mode)
>   
>   		migrate_disable_switch(rq, prev);
>   		psi_account_irqtime(rq, prev, next);
> -		psi_sched_switch(prev, next, block);
> +		psi_sched_switch(prev, next, !task_on_rq_queued(prev) ||
> +					     prev->se.sched_delayed);
>   
>   		trace_sched_switch(preempt, prev, next, prev_state);
>   
> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> index 8ee0add5a48a..6ade91bce63e 100644
> --- a/kernel/sched/stats.h
> +++ b/kernel/sched/stats.h
> @@ -138,6 +138,10 @@ static inline void psi_enqueue(struct task_struct *p, int flags)
>   	if (flags & ENQUEUE_RESTORE)
>   		return;
>   
> +	/* psi_sched_switch() will handle the flags */
> +	if (task_on_cpu(task_rq(p), p))
> +		return;
> +
>   	if (p->se.sched_delayed) {
>   		/* CPU migration of "sleeping" task */
>   		SCHED_WARN_ON(!(flags & ENQUEUE_MIGRATED));
> 
> base-commit: af98d8a36a963e758e84266d152b92c7b51d4ecb