[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20df37b9-c653-49d6-83e7-da4f21d5b848@linux.dev>
Date: Thu, 26 Dec 2024 18:43:05 +0800
From: Chengming Zhou <chengming.zhou@...ux.dev>
To: K Prateek Nayak <kprateek.nayak@....com>,
Johannes Weiner <hannes@...xchg.org>, Suren Baghdasaryan
<surenb@...gle.com>, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>, linux-kernel@...r.kernel.org
Cc: Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
Chengming Zhou <zhouchengming@...edance.com>,
Muchun Song <muchun.song@...ux.dev>,
"Gautham R. Shenoy" <gautham.shenoy@....com>,
Chuyi Zhou <zhouchuyi@...edance.com>
Subject: Re: [PATCH] psi: Fix race when task wakes up before
psi_sched_switch() adjusts flags
Hi,
On 2024/12/26 13:34, K Prateek Nayak wrote:
> When running hackbench in a cgroup with bandwidth throttling enabled,
> following PSI splat was observed:
>
> psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4
>
> When investigating the series of events leading up to the splat,
> following sequence was observed:
> [008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120
> ...
> [008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0
> [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8
> # CPU8 goes into newidle balance and releases the rq lock
> ...
> # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
> [015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1)
I have a question here, why TSK_ONCPU is not set in psi_flags if
the task hasn't arrived psi_sched_switch()?
> [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy
> [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ...
>
> psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
> for the blocked entity, however, the following race is possible with
> psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to
> psi_sched_switch()
Yeah, this race is introduced by delayed dequeue changes.
In the past, a sleep task can't be migrated or enqueued before it's done
in __schedule(). (finish_task(prev) clear prev->on_cpu.)
Now, ttwu_runnable() can call enqueue_task() on the delayed dequeue task
to bring it schedulable.
But migration is still impossible, since it's still running on this cpu,
so no psi_ttwu_dequeue(), only psi_enqueue() can happen, right?
(Actually, there we can enqueue_task() for any sleep task, including
those are not delayed dequeue, if select_task_rq() returns same cpu
as task_cpu(p) to optimize wakeup latency, maybe need to submit a patch
later.)
>
> __schedule()
> rq_lock(rq)
> try_to_block_task(p)
> psi_dequeue()
> [ psi_task_switch() is responsible
> for adjusting the PSI flags ]
> put_prev_entity(&p->se) try_to_wake_up(p)
> # no runnable task on rq->cfs ...
> sched_balance_newidle()
> raw_spin_rq_unlock(rq) __task_rq_lock(p)
> ... psi_enqueue()/psi_ttwu_dequeue() [Woops!]
> __task_rq_unlock(p)
> raw_spin_rq_lock(rq)
> ...
> [ p was re-enqueued or has migrated away ]
Here ttwu_runnable() call enqueue_task() for delayed dequeue task.
migration can't happen since p->on_cpu is still true.
> ...
> psi_task_switch() [Too late!]
> raw_spin_rq_unlock(rq)
>
> The wakeup context will see the flags for a running task when the flags
> should have reflected the task being blocked. Similarly, a migration
> context in the wakeup path can clear the flags that psi_sched_switch()
> assumes will be set (TSK_ONCPU / TSK_RUNNING)
In this ttwu_runnable() -> enqueue_task() case, I think psi_enqueue()
should do nothing at all.
Why? Because psi_dequeue() is deferred to psi_sched_switch(), so from
PSI POV, this task hasn't gone sleep at all, so psi_enqueue() should NOT
change any state too. (It's not a wakeup or migration from PSI POV.)
And the current code of "psi_sched_switch(prev, next, block);" looks
buggy to me too! The "block" value is from try_to_block_task(), then
pick_next_task() may drop and gain rq lock, so we can't use the stale
value for psi_sched_switch().
Before we used "task_on_rq_queued(prev)", now we have to also consider
delayed dequeue case, so it should be:
"!task_on_rq_queued(prev) || prev->se.sched_delayed"
Thanks!
>
> Since the TSK_ONCPU flag has to be modified with the rq lock of
> task_cpu() held, use a combination of task_cpu() and TSK_ONCPU checks to
> prevent the race. Specifically:
>
> o psi_enqueue() will clear the TSK_ONCPU flag when it encounters one.
> psi_enqueue() will only be called with TSK_ONCPU set when the task is
> being requeued on the same CPU. If the task was migrated,
> psi_ttwu_dequeue() would have already cleared the PSI flags.
>
> psi_enqueue() cannot guarantee that this same task will be picked
> again when the scheduling CPU returns from newidle balance which is
> why it clears the TSK_ONCPU to mimic a net result of sleep + wakeup
> without migration.
>
> o When psi_sched_switch() observes that prev's task_cpu() has changes or
> the TSK_ONCPU flag is not set, a wakeup has raced with the
> psi_sched_switch() trying to adjust the dequeue flag. If the next is
> same as the prev, psi_sched_switch() has to now set the TSK_ONCPU flag
> again. Otherwise, psi_enqueue() or psi_ttwu_dequeue() would have
> already adjusted the PSI flags and no further changes are required
> to prev's PSI flags.
>
> With the introduction of DELAY_DEQUEUE, the requeue path is considerably
> shortened and with the addition of bandwidth throttling in the
> __schedule() path, the race window is large enough to observed this
> issue.
>
> Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
> ---
> This patch is based on tip:sched/core at commit af98d8a36a96
> ("sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug")
>
> Reproducer for the PSI splat:
>
> mkdir /sys/fs/cgroup/test
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> # Ridiculous limit on SMP to throttle multiple rqs at once
> echo "50000 100000" > /sys/fs/cgroup/test/cpu.max
> perf bench sched messaging -t -p -l 100000 -g 16
>
> This worked reliably on my 3rd Generation EPYC System (2 x 64C/128T) but
> also on a 32 vCPU VM.
> ---
> kernel/sched/core.c | 7 ++++-
> kernel/sched/psi.c | 65 ++++++++++++++++++++++++++++++++++++++++++--
> kernel/sched/stats.h | 16 ++++++++++-
> 3 files changed, 83 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 84902936a620..9bbe51e44e98 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6717,6 +6717,12 @@ static void __sched notrace __schedule(int sched_mode)
> rq->last_seen_need_resched_ns = 0;
> #endif
>
> + /*
> + * PSI might have to deal with the consequences of newidle balance
> + * possibly dropping the rq lock and prev being requeued and selected.
> + */
> + psi_sched_switch(prev, next, block);
> +
> if (likely(prev != next)) {
> rq->nr_switches++;
> /*
> @@ -6750,7 +6756,6 @@ static void __sched notrace __schedule(int sched_mode)
>
> migrate_disable_switch(rq, prev);
> psi_account_irqtime(rq, prev, next);
> - psi_sched_switch(prev, next, block);
>
> trace_sched_switch(preempt, prev, next, prev_state);
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 84dad1511d1e..c355a6189595 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -917,9 +917,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> bool sleep)
> {
> struct psi_group *group, *common = NULL;
> - int cpu = task_cpu(prev);
> + int prev_cpu, cpu;
> +
> + /* No race between psi_dequeue() and now */
> + if (prev == next && (prev->psi_flags & TSK_ONCPU))
> + return;
> +
> + prev_cpu = task_cpu(prev);
> + cpu = smp_processor_id();
>
> if (next->pid) {
> + /*
> + * If next == prev but TSK_ONCPU is cleared, the task was
> + * requeued when newidle balance dropped the rq lock and
> + * psi_enqueue() cleared the TSK_ONCPU flag.
> + */
> psi_flags_change(next, 0, TSK_ONCPU);
> /*
> * Set TSK_ONCPU on @next's cgroups. If @next shares any
> @@ -928,8 +940,13 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> */
> group = task_psi_group(next);
> do {
> - if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
> - PSI_ONCPU) {
> + /*
> + * Since newidle balance can drop the rq lock (see the next comment)
> + * there is a possibility of try_to_wake_up() migrating prev away
> + * before reaching here. Do not find common if task has migrated.
> + */
> + if (prev_cpu == cpu &&
> + (per_cpu_ptr(group->pcpu, cpu)->state_mask & PSI_ONCPU)) {
> common = group;
> break;
> }
> @@ -938,6 +955,48 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> } while ((group = group->parent));
> }
>
> + /*
> + * When a task is blocked, psi_dequeue() leaves the PSI flag
> + * adjustments to psi_task_switch() however, there is a possibility of
> + * rq lock being dropped in the interim and the task being woken up
> + * again before psi_task_switch() is called leading to psi_enqueue()
> + * seeing the flags for a running task. Specifically, the following
> + * scenario is possible:
> + *
> + * __schedule()
> + * rq_lock(rq)
> + * try_to_block_task(p)
> + * psi_dequeue()
> + * [ psi_task_switch() is responsible
> + * for adjusting the PSI flags ]
> + * put_prev_entity(&p->se) try_to_wake_up(p)
> + * # no runnable task on rq->cfs ...
> + * sched_balance_newidle()
> + * raw_spin_rq_unlock(rq) __task_rq_lock(p)
> + * ... psi_enqueue()/psi_ttwu_dequeue() [Woops!]
> + * __task_rq_unlock(p)
> + * raw_spin_rq_lock(rq)
> + * ...
> + * [ p was re-enqueued or has migrated away ]
> + * ...
> + * psi_task_switch() [Too late!]
> + * raw_spin_rq_unlock(rq)
> + *
> + * In the above case, psi_enqueue() can sees the p->psi_flags state
> + * before it is adjusted to account for dequeue in psi_task_switch(),
> + * or psi_ttwu_dequeue() can clear the p->psi_flags which
> + * psi_task_switch() tries to adjust assuming that the entity has just
> + * finished running.
> + *
> + * Since TSK_ONCPU has to be adjusted holding task CPU's rq lock, use
> + * the combination of TSK_ONCPU and task_cpu(p) to catch the race
> + * between psi_task_switch() and psi_enqueue() / psi_ttwu_dequeue()
> + * Since psi_enqueue() / psi_ttwu_dequeue() would have set the correct
> + * flags already for prev on this CPU, skip adjusting flags.
> + */
> + if (prev == next || prev_cpu != cpu || !(prev->psi_flags & TSK_ONCPU))
> + return;
> +
> if (prev->pid) {
> int clear = TSK_ONCPU, set = 0;
> bool wake_clock = true;
> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> index 8ee0add5a48a..f09903165456 100644
> --- a/kernel/sched/stats.h
> +++ b/kernel/sched/stats.h
> @@ -138,7 +138,21 @@ static inline void psi_enqueue(struct task_struct *p, int flags)
> if (flags & ENQUEUE_RESTORE)
> return;
>
> - if (p->se.sched_delayed) {
> + if (p->psi_flags & TSK_ONCPU) {
> + /*
> + * psi_enqueue() can race with psi_task_switch() where
> + * TSK_ONCPU will be still set for the task (see the
> + * comment in psi_task_switch())
> + *
> + * Reaching here with TSK_ONCPU is only possible when
> + * the task is being enqueued on the same CPU. Since
> + * psi_task_switch() has not had the chance to adjust
> + * the flags yet, just clear the TSK_ONCPU which yields
> + * the same result as sleep + wakeup without migration.
> + */
> + SCHED_WARN_ON(flags & ENQUEUE_MIGRATED);
> + clear = TSK_ONCPU;
> + } else if (p->se.sched_delayed) {
> /* CPU migration of "sleeping" task */
> SCHED_WARN_ON(!(flags & ENQUEUE_MIGRATED));
> if (p->in_memstall)
>
> base-commit: af98d8a36a963e758e84266d152b92c7b51d4ecb
Powered by blists - more mailing lists