linux-kernel - Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8df808ca-186d-41f8-845c-c42fd2fd4d45@amd.com>
Date: Tue, 26 Nov 2024 11:02:51 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Mike Galbraith <efault@....de>, Phil Auld <pauld@...hat.com>
CC: Peter Zijlstra <peterz@...radead.org>, <mingo@...hat.com>,
	<juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
	<dietmar.eggemann@....com>, <rostedt@...dmis.org>, <bsegall@...gle.com>,
	<mgorman@...e.de>, <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>,
	<wuyun.abel@...edance.com>, <youssefesmat@...omium.org>, <tglx@...utronix.de>
Subject: Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to
 a busy CPU

Hello Mike,

On 11/23/2024 2:14 PM, Mike Galbraith wrote:
> [..snip..]
> 
> Question: did wiping off the evil leave any meaningful goodness behind?
> 
> ---
> 
> sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
> 
> Phil Auld (Redhat) reported an fio benchmark regression having been found
> to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> may be related to wakees losing the ability to migrate, and confirmed that
> restoration of same indeed did restore previous performance.
> 
> V2: do not rip buddies apart, convenient on/off switch
> 
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> Signed-off-by: Mike Galbraith <efault@....de>
> ---
>   kernel/sched/core.c     |   51 ++++++++++++++++++++++++++++++------------------
>   kernel/sched/features.h |    5 ++++
>   kernel/sched/sched.h    |    5 ++++
>   3 files changed, 42 insertions(+), 19 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3783,28 +3783,41 @@ ttwu_do_activate(struct rq *rq, struct t
>    */
>   static int ttwu_runnable(struct task_struct *p, int wake_flags)
>   {
> -	struct rq_flags rf;
> -	struct rq *rq;
> -	int ret = 0;
> -
> -	rq = __task_rq_lock(p, &rf);
> -	if (task_on_rq_queued(p)) {
> -		update_rq_clock(rq);
> -		if (p->se.sched_delayed)
> -			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> -		if (!task_on_cpu(rq, p)) {
> -			/*
> -			 * When on_rq && !on_cpu the task is preempted, see if
> -			 * it should preempt the task that is current now.
> -			 */
> -			wakeup_preempt(rq, p, wake_flags);
> +	CLASS(__task_rq_lock, rq_guard)(p);
> +	struct rq *rq = rq_guard.rq;
> +
> +	if (!task_on_rq_queued(p))
> +		return 0;
> +
> +	update_rq_clock(rq);
> +	if (p->se.sched_delayed) {
> +		int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> +		int dequeue = sched_feat(DEQUEUE_DELAYED);
> +
> +		/*
> +		 * Since sched_delayed means we cannot be current anywhere,
> +		 * dequeue it here and have it fall through to the
> +		 * select_task_rq() case further along in ttwu() path.
> +		 * Note: Do not rip buddies apart else chaos follows.
> +		 */
> +		if (dequeue && rq->nr_running > 1 && p->nr_cpus_allowed > 1 &&

Do we really care if DEQUEUE_DELAYED is enabled / disabled here when we
encounter a delayed task?

> +		    !(rq->curr->last_wakee == p || p->last_wakee == rq->curr)) {

Technically, we are still looking at the last wakeup here since
record_wakee() is only called later. If we care about 1:1 buddies,
should we just see "current == p->last_wakee", otherwise, there is a
good chance "p" has a m:n waker-wakee relationship in which case
perhaps a "want_affine" like heuristic can help?

For science, I was wondering if the decision to dequeue + migrate or
requeue the delayed task can be put off until after the whole
select_task_rq() target selection (note: without the h_nr_delayed
stuff, some of that wake_affine_idle() logic falls apart). Hackbench
(which saw some regression with EEVDF Complete) seem to like it
somewhat, but it still falls behind NO_DELAY_DEQUEUE.

    ==================================================================
     Test          : hackbench
     Units         : Normalized time in seconds
     Interpretation: Lower is better
     Statistic     : AMean
     ==================================================================
     NO_DELAY_DEQUEUE	     Mike's v2	  Full ttwu + requeue/migrate
     5.76	           5.72  (  1% )        5.82  ( -1% )
     6.53	           6.56  (  0% )        6.65  ( -2% )
     6.79	           7.04  ( -4% )        7.02  ( -3% )
     6.91	           7.04  ( -2% )        7.03  ( -2% )
     7.63	           8.05  ( -6% )        7.88  ( -3% )

Only subtle changes in IBS profiles; there aren't any obvious shift
in hotspots with hackbench at least. Not sure if it is just the act of
needing to do a dequeue + enqueue from the wakeup context that adds to
the overall regression.

-- 
Thanks and Regards,
Prateek

> +			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> +			return 0;
>   		}
> -		ttwu_do_wakeup(p);
> -		ret = 1;
> +
> +		enqueue_task(rq, p, queue_flags);
> +	}
> +	if (!task_on_cpu(rq, p)) {
> +		/*
> +		 * When on_rq && !on_cpu the task is preempted, see if
> +		 * it should preempt the task that is current now.
> +		 */
> +		wakeup_preempt(rq, p, wake_flags);
>   	}
> -	__task_rq_unlock(rq, &rf);
> +	ttwu_do_wakeup(p);
> 
> -	return ret;
> +	return 1;
>   }
> 
>   #ifdef CONFIG_SMP
> [..snip..]