linux-kernel - Re: [PATCH] sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3c969a0b-812e-dedd-b9ed-6378f61d5735@amd.com>
Date: Tue, 24 Sep 2024 15:57:51 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Chen Yu <yu.c.chen@...el.com>, Peter Zijlstra <peterz@...radead.org>
CC: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
	<rostedt@...dmis.org>, Valentin Schneider <vschneid@...hat.com>, Chunxin Zang
	<zangchunxin@...iang.com>, <linux-kernel@...r.kernel.org>, Oliver Sang
	<oliver.sang@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Ingo Molnar
	<mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>
Subject: Re: [PATCH] sched/eevdf: Fix wakeup-preempt by checking
 cfs_rq->nr_running

Hello Chenyu,

On 9/23/2024 12:51 PM, Chen Yu wrote:
> Commit 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
> introduced a mechanism that a wakee with shorter slice could preempt
> the current running task. It also lower the bar for the current task
> to be preempted, by checking the rq->nr_running instead of cfs_rq->nr_running
> when the current task has ran out of time slice. Say, if there is 1 cfs
> task and 1 rt task, before 85e511df3cec, update_deadline() will
> not trigger a reschedule, and after 85e511df3cec, since rq->nr_running
> is 2 and resched is true, a resched_curr() would happen.
> 
> Some workloads (like the hackbench reported by lkp) do not like
> over-scheduling. We can see that the preemption rate has been
> increased by 2.2%:
> 
> 1.654e+08            +2.2%   1.69e+08        hackbench.time.involuntary_context_switches
> 
> Restore its previous check criterion.
> 
> Fixes: 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
> Reported-by: kernel test robot <oliver.sang@...el.com>
> Closes: https://lore.kernel.org/oe-lkp/202409231416.9403c2e9-oliver.sang@intel.com
> Signed-off-by: Chen Yu <yu.c.chen@...el.com>

Gave it a spin on my dual socket 3rd Generation EPYC System and I do not
as big a jump in hackbench numbers as Oliver reported, most likely
because I couldn't emulate the exact scenario where a fair task is
running in presence of an RT task queued. Following are numbers from my
testing:

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:           tip[pct imp](CV)    preempt-fix[pct imp](CV)
  1-groups     1.00 [ -0.00]( 2.60)     1.00 [  0.17]( 2.12)
  2-groups     1.00 [ -0.00]( 1.21)     0.98 [  2.05]( 0.95)
  4-groups     1.00 [ -0.00]( 1.63)     0.97 [  2.65]( 1.53)
  8-groups     1.00 [ -0.00]( 1.34)     0.99 [  0.81]( 1.33)
16-groups     1.00 [ -0.00]( 2.07)     0.98 [  2.31]( 1.09)
--

Feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@....com>

> ---
>   kernel/sched/fair.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 225b31aaee55..2859fc7e2da2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1025,7 +1025,7 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>   	/*
>   	 * The task has consumed its request, reschedule.
>   	 */
> -	return true;
> +	return (cfs_rq->nr_running > 1);

Was there a strong reason why Peter decided to use "rq->nr_running"
instead of "cfs_rq->nr_running" with PREEMPT_SHORT in update_curr()?

I wonder if it was to force a pick_next_task() cycle to dequeue a
possibly delayed entity but AFAICT, "cfs_rq->nr_running" should
account for the delayed entity still on the cfs_rq and perhaps the
early return in update_curr() can just be changed to use
"cfs_rq->nr_running". Not sure if I'm missing something trivial.

>   }
>   
>   #include "pelt.h"

-- 
Thanks and Regards,
Prateek