[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8001a9bc-3f87-403c-a091-215c56c9ebf0@arm.com>
Date: Fri, 22 Dec 2023 14:51:17 +0000
From: Metin Kaya <metin.kaya@....com>
To: John Stultz <jstultz@...gle.com>, LKML <linux-kernel@...r.kernel.org>
Cc: Joel Fernandes <joelaf@...gle.com>, Qais Yousef <qyousef@...gle.com>,
Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Valentin Schneider <vschneid@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Zimuzo Ezeozue <zezeozue@...gle.com>, Youssef Esmat
<youssefesmat@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Will Deacon <will@...nel.org>, Waiman Long <longman@...hat.com>,
Boqun Feng <boqun.feng@...il.com>, "Paul E. McKenney" <paulmck@...nel.org>,
Xuewen Yan <xuewen.yan94@...il.com>, K Prateek Nayak
<kprateek.nayak@....com>, Thomas Gleixner <tglx@...utronix.de>,
kernel-team@...roid.com
Subject: Re: [PATCH v7 23/23] sched: Fix rt/dl load balancing via chain level
balance
On 20/12/2023 12:18 am, John Stultz wrote:
> From: Connor O'Brien <connoro@...gle.com>
>
> RT/DL balancing is supposed to guarantee that with N cpus available &
> CPU affinity permitting, the top N RT/DL tasks will get spread across
> the CPUs and all get to run. Proxy exec greatly complicates this as
> blocked tasks remain on the rq but cannot be usefully migrated away
> from their lock owning tasks. This has two major consequences:
> 1. In order to get the desired properties we need to migrate a blocked
> task, its would-be proxy, and everything in between, all together -
> i.e., we need to push/pull "blocked chains" rather than individual
> tasks.
> 2. Tasks that are part of rq->curr's "blocked tree" therefore should
> not be pushed or pulled. Options for enforcing this seem to include
> a) create some sort of complex data structure for tracking
> pushability, updating it whenever the blocked tree for rq->curr
> changes (e.g. on mutex handoffs, migrations, etc.) as well as on
> context switches.
> b) give up on O(1) pushability checks, and search through the pushable
> list every push/pull until we find a pushable "chain"
> c) Extend option "b" with some sort of caching to avoid repeated work.
>
> For the sake of simplicity & separating the "chain level balancing"
> concerns from complicated optimizations, this patch focuses on trying
> to implement option "b" correctly. This can then hopefully provide a
> baseline for "correct load balancing behavior" that optimizations can
> try to implement more efficiently.
>
> Note:
> The inability to atomically check "is task enqueued on a specific rq"
> creates 2 possible races when following a blocked chain:
> - If we check task_rq() first on a task that is dequeued from its rq,
> it can be woken and enqueued on another rq before the call to
> task_on_rq_queued()
> - If we call task_on_rq_queued() first on a task that is on another
> rq, it can be dequeued (since we don't hold its rq's lock) and then
> be set to the current rq before we check task_rq().
>
> Maybe there's a more elegant solution that would work, but for now,
> just sandwich the task_rq() check between two task_on_rq_queued()
> checks, all separated by smp_rmb() calls. Since we hold rq's lock,
> task can't be enqueued or dequeued from rq, so neither race should be
> possible.
>
> extensive comments on various pitfalls, races, etc. included inline.
>
> This patch was broken out from a larger chain migration
> patch originally by Connor O'Brien.
>
> Cc: Joel Fernandes <joelaf@...gle.com>
> Cc: Qais Yousef <qyousef@...gle.com>
> Cc: Ingo Molnar <mingo@...hat.com>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Juri Lelli <juri.lelli@...hat.com>
> Cc: Vincent Guittot <vincent.guittot@...aro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@....com>
> Cc: Valentin Schneider <vschneid@...hat.com>
> Cc: Steven Rostedt <rostedt@...dmis.org>
> Cc: Ben Segall <bsegall@...gle.com>
> Cc: Zimuzo Ezeozue <zezeozue@...gle.com>
> Cc: Youssef Esmat <youssefesmat@...gle.com>
> Cc: Mel Gorman <mgorman@...e.de>
> Cc: Daniel Bristot de Oliveira <bristot@...hat.com>
> Cc: Will Deacon <will@...nel.org>
> Cc: Waiman Long <longman@...hat.com>
> Cc: Boqun Feng <boqun.feng@...il.com>
> Cc: "Paul E. McKenney" <paulmck@...nel.org>
> Cc: Metin Kaya <Metin.Kaya@....com>
> Cc: Xuewen Yan <xuewen.yan94@...il.com>
> Cc: K Prateek Nayak <kprateek.nayak@....com>
> Cc: Thomas Gleixner <tglx@...utronix.de>
> Cc: kernel-team@...roid.com
> Signed-off-by: Connor O'Brien <connoro@...gle.com>
> [jstultz: split out from larger chain migration patch,
> majorly refactored for runtime conditionalization]
> Signed-off-by: John Stultz <jstultz@...gle.com>
> ---
> v7:
> * Split out from larger chain-migration patch in earlier
> versions of this series
> * Larger rework to allow proper conditionalization of the
> logic when running with CONFIG_SCHED_PROXY_EXEC
> ---
> kernel/sched/core.c | 77 +++++++++++++++++++++++-
> kernel/sched/deadline.c | 98 +++++++++++++++++++++++-------
> kernel/sched/rt.c | 130 ++++++++++++++++++++++++++++++++--------
> kernel/sched/sched.h | 18 +++++-
> 4 files changed, 273 insertions(+), 50 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 77a79d5f829a..30dfb6f14f2b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3923,7 +3923,6 @@ struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p)
> return p;
>
> lockdep_assert_rq_held(rq);
> -
> for (exec_ctx = p; task_is_blocked(exec_ctx) && !task_on_cpu(rq, exec_ctx);
> exec_ctx = owner) {
> mutex = exec_ctx->blocked_on;
> @@ -3938,6 +3937,82 @@ struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p)
> }
> return exec_ctx;
> }
> +
> +#ifdef CONFIG_SMP
> +void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
> +{
> + struct task_struct *owner;
> +
> + if (!sched_proxy_exec()) {
> + __push_task_chain(rq, dst_rq, task);
> + return;
> + }
> +
> + lockdep_assert_rq_held(rq);
> + lockdep_assert_rq_held(dst_rq);
> +
> + BUG_ON(!task_queued_on_rq(rq, task));
> + BUG_ON(task_current_selected(rq, task));
> +
> + while (task) {
> + if (!task_queued_on_rq(rq, task) || task_current_selected(rq, task))
> + break;
> +
> + if (task_is_blocked(task))
> + owner = __mutex_owner(task->blocked_on);
> + else
> + owner = NULL;
> + __push_task_chain(rq, dst_rq, task);
> + if (task == owner)
> + break;
> + task = owner;
> + }
> +}
> +
> +/*
> + * Returns:
> + * 1 if chain is pushable and affinity does not prevent pushing to cpu
> + * 0 if chain is unpushable
> + * -1 if chain is pushable but affinity blocks running on cpu.
> + */
> +int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
> +{
> + struct task_struct *exec_ctx;
> +
> + if (!sched_proxy_exec())
> + return __task_is_pushable(rq, p, cpu);
> +
> + lockdep_assert_rq_held(rq);
> +
> + if (task_rq(p) != rq || !task_on_rq_queued(p))
> + return 0;
> +
> + exec_ctx = find_exec_ctx(rq, p);
> + /*
> + * Chain leads off the rq, we're free to push it anywhere.
> + *
> + * One wrinkle with relying on find_exec_ctx is that when the chain
> + * leads to a task currently migrating to rq, we see the chain as
> + * pushable & push everything prior to the migrating task. Even if
> + * we checked explicitly for this case, we could still race with a
> + * migration after the check.
> + * This shouldn't permanently produce a bad state though, as proxy()
find_proxy_task()
> + * will send the chain back to rq and by that point the migration
> + * should be complete & a proper push can occur.
> + */
> + if (!exec_ctx)
> + return 1;
> +
> + if (task_on_cpu(rq, exec_ctx) || exec_ctx->nr_cpus_allowed <= 1)
> + return 0;
> +
> + return cpumask_test_cpu(cpu, &exec_ctx->cpus_mask) ? 1 : -1;
> +}
> +#else /* !CONFIG_SMP */
> +void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
> +{
> +}
> +#endif /* CONFIG_SMP */
> #else /* !CONFIG_SCHED_PROXY_EXEC */
> static inline void do_activate_task(struct rq *rq, struct task_struct *p,
> int en_flags)
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 8b5701727342..b7be888c1635 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2172,8 +2172,77 @@ static int find_later_rq(struct task_struct *sched_ctx, struct task_struct *exec
> return -1;
> }
>
> +static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
> +{
> + struct task_struct *p = NULL;
> + struct rb_node *next_node;
> +
> + if (!has_pushable_dl_tasks(rq))
> + return NULL;
> +
> + next_node = rb_first_cached(&rq->dl.pushable_dl_tasks_root);
> +
> +next_node:
> + if (next_node) {
> + p = __node_2_pdl(next_node);
> +
> + /*
> + * cpu argument doesn't matter because we treat a -1 result
> + * (pushable but can't go to cpu0) the same as a 1 result
> + * (pushable to cpu0). All we care about here is general
> + * pushability.
> + */
> + if (task_is_pushable(rq, p, 0))
> + return p;
> +
> + next_node = rb_next(next_node);
> + goto next_node;
> + }
> +
> + if (!p)
> + return NULL;
> +
> + WARN_ON_ONCE(rq->cpu != task_cpu(p));
> + WARN_ON_ONCE(task_current(rq, p));
> + WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> +
> + WARN_ON_ONCE(!task_on_rq_queued(p));
> + WARN_ON_ONCE(!dl_task(p));
> +
> + return p;
> +}
> +
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> static inline bool dl_revalidate_rq_state(struct task_struct *task, struct rq *rq,
> - struct rq *later)
> + struct rq *later, bool *retry)
> +{
> + if (!dl_task(task) || is_migration_disabled(task))
> + return false;
> +
> + if (rq != this_rq()) {
> + struct task_struct *next_task = pick_next_pushable_dl_task(rq);
> +
> + if (next_task == task) {
Nit: We can `return false;` if next_task != task and save one level of
indentation.
> + struct task_struct *exec_ctx;
> +
> + exec_ctx = find_exec_ctx(rq, next_task);
> + *retry = (exec_ctx && !cpumask_test_cpu(later->cpu,
> + &exec_ctx->cpus_mask));
> + } else {
> + return false;
> + }
> + } else {
> + int pushable = task_is_pushable(rq, task, later->cpu);
> +
> + *retry = pushable == -1;
> + if (!pushable)
> + return false;
`return pushable;` can replace above 2 lines.
The same for rt_revalidate_rq_state().
> + }
> + return true;
> +}
> +#else
> +static inline bool dl_revalidate_rq_state(struct task_struct *task, struct rq *rq,
> + struct rq *later, bool *retry)
> {
> if (task_rq(task) != rq)
> return false;
> @@ -2195,16 +2264,18 @@ static inline bool dl_revalidate_rq_state(struct task_struct *task, struct rq *r
>
> return true;
> }
> -
> +#endif
> /* Locks the rq it finds */
> static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
> {
> struct task_struct *exec_ctx;
> struct rq *later_rq = NULL;
> + bool retry;
> int tries;
> int cpu;
>
> for (tries = 0; tries < DL_MAX_TRIES; tries++) {
> + retry = false;
> exec_ctx = find_exec_ctx(rq, task);
> if (!exec_ctx)
> break;
> @@ -2228,7 +2299,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
>
> /* Retry if something changed. */
> if (double_lock_balance(rq, later_rq)) {
> - if (unlikely(!dl_revalidate_rq_state(task, rq, later_rq))) {
> + if (unlikely(!dl_revalidate_rq_state(task, rq, later_rq, &retry))) {
> double_unlock_balance(rq, later_rq);
> later_rq = NULL;
> break;
> @@ -2240,7 +2311,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
> * its earliest one has a later deadline than our
> * task, the rq is a good one.
> */
> - if (dl_task_is_earliest_deadline(task, later_rq))
> + if (!retry && dl_task_is_earliest_deadline(task, later_rq))
> break;
>
> /* Otherwise we try again. */
> @@ -2251,25 +2322,6 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
> return later_rq;
> }
>
> -static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
> -{
> - struct task_struct *p;
> -
> - if (!has_pushable_dl_tasks(rq))
> - return NULL;
> -
> - p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
> -
> - WARN_ON_ONCE(rq->cpu != task_cpu(p));
> - WARN_ON_ONCE(task_current(rq, p));
> - WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
> -
> - WARN_ON_ONCE(!task_on_rq_queued(p));
> - WARN_ON_ONCE(!dl_task(p));
> -
> - return p;
> -}
> -
> /*
> * See if the non running -deadline tasks on this rq
> * can be sent to some other CPU where they can preempt
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index fabb19891e95..d5ce95dc5c09 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1935,8 +1935,108 @@ static int find_lowest_rq(struct task_struct *sched_ctx, struct task_struct *exe
> return -1;
> }
>
> +static struct task_struct *pick_next_pushable_task(struct rq *rq)
> +{
> + struct plist_head *head = &rq->rt.pushable_tasks;
> + struct task_struct *p, *push_task = NULL;
> +
> + if (!has_pushable_tasks(rq))
> + return NULL;
> +
> + plist_for_each_entry(p, head, pushable_tasks) {
> + if (task_is_pushable(rq, p, 0)) {
> + push_task = p;
> + break;
> + }
> + }
> +
> + if (!push_task)
> + return NULL;
> +
> + BUG_ON(rq->cpu != task_cpu(push_task));
> + BUG_ON(task_current(rq, push_task) || task_current_selected(rq, push_task));
> + BUG_ON(!task_on_rq_queued(push_task));
> + BUG_ON(!rt_task(push_task));
> +
> + return p;
> +}
> +
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +static inline bool rt_revalidate_rq_state(struct task_struct *task, struct rq *rq,
> + struct rq *lowest, bool *retry)
This function can be consolidated with dl_revalidate_rq_state() as you
noted in the previous patch, although rt_revalidate_rq_state() has few
comments.
> +{
> + /*
> + * Releasing the rq lock means we need to re-check pushability.
> + * Some scenarios:
> + * 1) If a migration from another CPU sent a task/chain to rq
> + * that made task newly unpushable by completing a chain
> + * from task to rq->curr, then we need to bail out and push something
> + * else.
> + * 2) If our chain led off this CPU or to a dequeued task, the last waiter
> + * on this CPU might have acquired the lock and woken (or even migrated
> + * & run, handed off the lock it held, etc...). This can invalidate the
> + * result of find_lowest_rq() if our chain previously ended in a blocked
> + * task whose affinity we could ignore, but now ends in an unblocked
> + * task that can't run on lowest_rq.
> + * 3) Race described at https://lore.kernel.org/all/1523536384-26781-2-git-send-email-huawei.libin@huawei.com/
> + *
> + * Notes on these:
> + * - Scenario #2 is properly handled by rerunning find_lowest_rq
> + * - Scenario #1 requires that we fail
> + * - Scenario #3 can AFAICT only occur when rq is not this_rq(). And the
> + * suggested fix is not universally correct now that push_cpu_stop() can
> + * call this function.
> + */
> + if (!rt_task(task) || is_migration_disabled(task)) {
> + return false;
> + } else if (rq != this_rq()) {
Nit: `else` can be dropped as in dl_revalidate_rq_state().
> + /*
> + * If we are dealing with a remote rq, then all bets are off
> + * because task might have run & then been dequeued since we
> + * released the lock, at which point our normal checks can race
> + * with migration, as described in
> + * https://lore.kernel.org/all/1523536384-26781-2-git-send-email-huawei.libin@huawei.com/
> + * Need to repick to ensure we avoid a race.
> + * But re-picking would be unnecessary & incorrect in the
> + * push_cpu_stop() path.
> + */
> + struct task_struct *next_task = pick_next_pushable_task(rq);
> +
> + if (next_task == task) {
> + struct task_struct *exec_ctx;
> +
> + exec_ctx = find_exec_ctx(rq, next_task);
> + *retry = (exec_ctx &&
> + !cpumask_test_cpu(lowest->cpu,
> + &exec_ctx->cpus_mask));
> + } else {
> + return false;
> + }
> + } else {
> + /*
> + * Chain level balancing introduces new ways for our choice of
> + * task & rq to become invalid when we release the rq lock, e.g.:
> + * 1) Migration to rq from another CPU makes task newly unpushable
> + * by completing a "blocked chain" from task to rq->curr.
> + * Fail so a different task can be chosen for push.
> + * 2) In cases where task's blocked chain led to a dequeued task
> + * or one on another rq, the last waiter in the chain on this
> + * rq might have acquired the lock and woken, meaning we must
> + * pick a different rq if its affinity prevents running on
> + * lowest rq.
> + */
> + int pushable = task_is_pushable(rq, task, lowest->cpu);
> +
> + *retry = pushable == -1;
> + if (!pushable)
> + return false;
> + }
> +
> + return true;
> +}
> +#else /* !CONFIG_SCHED_PROXY_EXEC */
> static inline bool rt_revalidate_rq_state(struct task_struct *task, struct rq *rq,
> - struct rq *lowest)
> + struct rq *lowest, bool *retry)
> {
> /*
> * We had to unlock the run queue. In
> @@ -1967,16 +2067,19 @@ static inline bool rt_revalidate_rq_state(struct task_struct *task, struct rq *r
>
> return true;
> }
> +#endif
>
> /* Will lock the rq it finds */
> static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
> {
> struct task_struct *exec_ctx;
> struct rq *lowest_rq = NULL;
> + bool retry;
> int tries;
> int cpu;
>
> for (tries = 0; tries < RT_MAX_TRIES; tries++) {
> + retry = false;
> exec_ctx = find_exec_ctx(rq, task);
> cpu = find_lowest_rq(task, exec_ctx);
>
> @@ -1997,7 +2100,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
>
> /* if the prio of this runqueue changed, try again */
> if (double_lock_balance(rq, lowest_rq)) {
> - if (unlikely(!rt_revalidate_rq_state(task, rq, lowest_rq))) {
> + if (unlikely(!rt_revalidate_rq_state(task, rq, lowest_rq, &retry))) {
> double_unlock_balance(rq, lowest_rq);
> lowest_rq = NULL;
> break;
> @@ -2005,7 +2108,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
> }
>
> /* If this rq is still suitable use it. */
> - if (lowest_rq->rt.highest_prio.curr > task->prio)
> + if (lowest_rq->rt.highest_prio.curr > task->prio && !retry)
> break;
>
> /* try again */
> @@ -2016,27 +2119,6 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
> return lowest_rq;
> }
>
> -static struct task_struct *pick_next_pushable_task(struct rq *rq)
> -{
> - struct task_struct *p;
> -
> - if (!has_pushable_tasks(rq))
> - return NULL;
> -
> - p = plist_first_entry(&rq->rt.pushable_tasks,
> - struct task_struct, pushable_tasks);
> -
> - BUG_ON(rq->cpu != task_cpu(p));
> - BUG_ON(task_current(rq, p));
> - BUG_ON(task_current_selected(rq, p));
> - BUG_ON(p->nr_cpus_allowed <= 1);
> -
> - BUG_ON(!task_on_rq_queued(p));
> - BUG_ON(!rt_task(p));
> -
> - return p;
> -}
> -
> /*
> * If the current CPU has more than one RT task, see if the non
> * running task can migrate over to a CPU that is running a task
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6cd473224cfe..4b97b36be691 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3548,7 +3548,7 @@ extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
> extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
> #ifdef CONFIG_SMP
> static inline
> -void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
> +void __push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
> {
> deactivate_task(rq, task, 0);
> set_task_cpu(task, dst_rq->cpu);
> @@ -3556,7 +3556,7 @@ void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
> }
>
> static inline
> -int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
> +int __task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
> {
> if (!task_on_cpu(rq, p) &&
> cpumask_test_cpu(cpu, &p->cpus_mask))
> @@ -3566,8 +3566,22 @@ int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
> }
>
> #ifdef CONFIG_SCHED_PROXY_EXEC
> +void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task);
> +int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu);
> struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p);
> #else /* !CONFIG_SCHED_PROXY_EXEC */
> +static inline
> +void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
> +{
> + __push_task_chain(rq, dst_rq, task);
> +}
> +
> +static inline
> +int task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
> +{
> + return __task_is_pushable(rq, p, cpu);
> +}
> +
> static inline
> struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p)
> {
Powered by blists - more mailing lists