[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8df808ca-186d-41f8-845c-c42fd2fd4d45@amd.com>
Date: Tue, 26 Nov 2024 11:02:51 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Mike Galbraith <efault@....de>, Phil Auld <pauld@...hat.com>
CC: Peter Zijlstra <peterz@...radead.org>, <mingo@...hat.com>,
<juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
<dietmar.eggemann@....com>, <rostedt@...dmis.org>, <bsegall@...gle.com>,
<mgorman@...e.de>, <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>,
<wuyun.abel@...edance.com>, <youssefesmat@...omium.org>, <tglx@...utronix.de>
Subject: Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to
a busy CPU
Hello Mike,
On 11/23/2024 2:14 PM, Mike Galbraith wrote:
> [..snip..]
>
> Question: did wiping off the evil leave any meaningful goodness behind?
>
> ---
>
> sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
>
> Phil Auld (Redhat) reported an fio benchmark regression having been found
> to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> may be related to wakees losing the ability to migrate, and confirmed that
> restoration of same indeed did restore previous performance.
>
> V2: do not rip buddies apart, convenient on/off switch
>
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> Signed-off-by: Mike Galbraith <efault@....de>
> ---
> kernel/sched/core.c | 51 ++++++++++++++++++++++++++++++------------------
> kernel/sched/features.h | 5 ++++
> kernel/sched/sched.h | 5 ++++
> 3 files changed, 42 insertions(+), 19 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3783,28 +3783,41 @@ ttwu_do_activate(struct rq *rq, struct t
> */
> static int ttwu_runnable(struct task_struct *p, int wake_flags)
> {
> - struct rq_flags rf;
> - struct rq *rq;
> - int ret = 0;
> -
> - rq = __task_rq_lock(p, &rf);
> - if (task_on_rq_queued(p)) {
> - update_rq_clock(rq);
> - if (p->se.sched_delayed)
> - enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> - if (!task_on_cpu(rq, p)) {
> - /*
> - * When on_rq && !on_cpu the task is preempted, see if
> - * it should preempt the task that is current now.
> - */
> - wakeup_preempt(rq, p, wake_flags);
> + CLASS(__task_rq_lock, rq_guard)(p);
> + struct rq *rq = rq_guard.rq;
> +
> + if (!task_on_rq_queued(p))
> + return 0;
> +
> + update_rq_clock(rq);
> + if (p->se.sched_delayed) {
> + int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> + int dequeue = sched_feat(DEQUEUE_DELAYED);
> +
> + /*
> + * Since sched_delayed means we cannot be current anywhere,
> + * dequeue it here and have it fall through to the
> + * select_task_rq() case further along in ttwu() path.
> + * Note: Do not rip buddies apart else chaos follows.
> + */
> + if (dequeue && rq->nr_running > 1 && p->nr_cpus_allowed > 1 &&
Do we really care if DEQUEUE_DELAYED is enabled / disabled here when we
encounter a delayed task?
> + !(rq->curr->last_wakee == p || p->last_wakee == rq->curr)) {
Technically, we are still looking at the last wakeup here since
record_wakee() is only called later. If we care about 1:1 buddies,
should we just see "current == p->last_wakee", otherwise, there is a
good chance "p" has a m:n waker-wakee relationship in which case
perhaps a "want_affine" like heuristic can help?
For science, I was wondering if the decision to dequeue + migrate or
requeue the delayed task can be put off until after the whole
select_task_rq() target selection (note: without the h_nr_delayed
stuff, some of that wake_affine_idle() logic falls apart). Hackbench
(which saw some regression with EEVDF Complete) seem to like it
somewhat, but it still falls behind NO_DELAY_DEQUEUE.
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
NO_DELAY_DEQUEUE Mike's v2 Full ttwu + requeue/migrate
5.76 5.72 ( 1% ) 5.82 ( -1% )
6.53 6.56 ( 0% ) 6.65 ( -2% )
6.79 7.04 ( -4% ) 7.02 ( -3% )
6.91 7.04 ( -2% ) 7.03 ( -2% )
7.63 8.05 ( -6% ) 7.88 ( -3% )
Only subtle changes in IBS profiles; there aren't any obvious shift
in hotspots with hackbench at least. Not sure if it is just the act of
needing to do a dequeue + enqueue from the wakeup context that adds to
the overall regression.
--
Thanks and Regards,
Prateek
> + dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> + return 0;
> }
> - ttwu_do_wakeup(p);
> - ret = 1;
> +
> + enqueue_task(rq, p, queue_flags);
> + }
> + if (!task_on_cpu(rq, p)) {
> + /*
> + * When on_rq && !on_cpu the task is preempted, see if
> + * it should preempt the task that is current now.
> + */
> + wakeup_preempt(rq, p, wake_flags);
> }
> - __task_rq_unlock(rq, &rf);
> + ttwu_do_wakeup(p);
>
> - return ret;
> + return 1;
> }
>
> #ifdef CONFIG_SMP
> [..snip..]
Powered by blists - more mailing lists