linux-kernel - Re: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZNWgAeN/EVS/vOLi@chenyu5-mobl2.bbrouter>
Date:   Fri, 11 Aug 2023 10:42:09 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Peter Zijlstra <peterz@...radead.org>
CC:     kernel test robot <oliver.sang@...el.com>,
        <oe-lkp@...ts.linux.dev>, <lkp@...el.com>,
        <linux-kernel@...r.kernel.org>, <x86@...nel.org>,
        "Ingo Molnar" <mingo@...nel.org>
Subject: Re: [tip:sched/eevdf] [sched/fair]  e0c2ff903c:
 phoronix-test-suite.blogbench.Write.final_score -34.8% regression

Hi Peter,

On 2023-08-11 at 09:11:21 +0800, Chen Yu wrote:
> On 2023-08-10 at 21:24:37 +0800, kernel test robot wrote:
> > 
> > 
> > Hello,
> > 
> > kernel test robot noticed a -34.8% regression of phoronix-test-suite.blogbench.Write.final_score on:
> > 
> > 
> > commit: e0c2ff903c320d3fd3c2c604dc401b3b7c0a1d13 ("sched/fair: Remove sched_feat(START_DEBIT)")
> > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/eevdf
> > 
> > testcase: phoronix-test-suite
> > test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
> > parameters:
> > 
> > 	test: blogbench-1.1.0
> > 	option_a: Write
> > 	cpufreq_governor: performance
> > 
> 
> It seems that commit e0c2ff903c32 removed the sched_feat(START_DEBIT) for initial
> task, but also increases the vruntime for non-initial task:
> Before the e0c2ff903c32, the vruntime for a enqueued task is:
> cfs_rq->min_vruntime
> After the e0c2ff903c32, the vruntime for a enqueued task is:
> avg_vruntime(cfs_rq) = \Sum v_i * w_i / W
>                      = \Sum v_i / nr_tasks
> which is usually higher than cfs_rq->min_vruntime, and we give less sleep bonus to
> the wakee, which could bring more or less impact to different workloads.
> But since later we switched to lag based placement, this new vruntime will minus
> lag, which could mitigate this problem. 
> 
>

Since previously lkp has reported that with eevdf policy enabled, there was
a regression in hackbench, I did some experiments and found that, with eevdf
enabled there are more preemptions, and this preemption could slow down
the waker(each waker could wakes up 20 wakee in hackbench). The reason might
be that, check_preempt_wakeup() is easier to preempt the current task in eevdf:

baseline(no eevdf): sched/fair: Add cfs_rq::avg_vruntime
compare(eevdf): sched/smp: Use lag to simplify cross-runqueue placement

hackbench -g 112 -f 20 --threads -l 30000 -s 100
on a system with 1 socket 56C/112T online

1. Use the following bpf script to track the preemption count
   within 10 seconds:

tracepoint:sched:sched_switch
{
        if (args->prev_state == TASK_RUNNING) {
                @preempt = count();
        }
}

baseline:
bpftrace preemption.bt 
Attaching 4 probes...
10:54:45 Preemption count: 
@preempt: 2409638


compare:
bpftrace preemption.bt 
Attaching 4 probes...
10:02:21 Preemption count: 
@preempt: 12147709

There are much more preemptions with eevdf enabled.


2. Add the following schedstats to track the ratio of successful preemption
in check_preempt_wakeup():

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 57e8bc14b06e..dfd4a6ebdf23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8294,6 +8294,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	int next_buddy_marked = 0;
 	int cse_is_idle, pse_is_idle;
 
+	schedstat_inc(rq->check_preempt_count);
+
 	if (unlikely(se == pse))
 		return;
 
@@ -8358,8 +8360,12 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 		/*
 		 * XXX pick_eevdf(cfs_rq) != se ?
 		 */
-		if (pick_eevdf(cfs_rq) == pse)
+		if (pick_eevdf(cfs_rq) == pse) {
+			if (se->vruntime <= pse->vruntime + sysctl_sched_wakeup_granularity)
+				schedstat_inc(rq->low_gran_preempt_count);
+
 			goto preempt;
+		}
 
 		return;
 	}
@@ -8377,6 +8383,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	return;
 
 preempt:
+	schedstat_inc(rq->need_preempt_count);
+
 	resched_curr(rq);
 	/*
 	 * Only set the backward buddy when the current task is still
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aa5b293ca4ed..58abd3d53f1d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1128,6 +1128,9 @@ struct rq {
 	/* try_to_wake_up() stats */
 	unsigned int		ttwu_count;
 	unsigned int		ttwu_local;
+	unsigned int		check_preempt_count;
+	unsigned int		need_preempt_count;
+	unsigned int		low_gran_preempt_count;
 #endif
 
 #ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..99392cad0c07 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,14 @@ static int show_schedstat(struct seq_file *seq, void *v)
 
 		/* runqueue-specific stats */
 		seq_printf(seq,
-		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u",
 		    cpu, rq->yld_count,
 		    rq->sched_count, rq->sched_goidle,
 		    rq->ttwu_count, rq->ttwu_local,
 		    rq->rq_cpu_time,
-		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+		    rq->check_preempt_count, rq->need_preempt_count,
+		    rq->low_gran_preempt_count);
 
 		seq_printf(seq, "\n");
 
-- 
2.25.1

Without eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
Thu Aug 10 11:02:02 2023              cpu8
.stats.check_preempt_count            51973       <-----
.stats.need_preempt_count             10514       <-----
.stats.rq_cpu_time                   5004068598
.stats.rq_sched_info.pcount           60374
.stats.rq_sched_info.run_delay       80405664582
.stats.sched_count                    60609
.stats.sched_goidle                    227
.stats.ttwu_count                     56250
.stats.ttwu_local                     14619

The preemption success ration is 10514 / 51973 = 20.23%
-----------------------------------------------------------------------------

With eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
Thu Aug 10 10:22:55 2023              cpu8
.stats.check_preempt_count            71673      <----
.stats.low_gran_preempt_count         57410
.stats.need_preempt_count             57413      <----
.stats.rq_cpu_time                   5007778990
.stats.rq_sched_info.pcount          129233
.stats.rq_sched_info.run_delay       164830921362
.stats.sched_count                   129233
.stats.ttwu_count                     70222
.stats.ttwu_local                     66847

The preemption success ration is 57413 / 71673 = 80.10%

According to the low_gran_preempt_count, most successfully preemption happens
when the current->vruntime is smaller than wakee->vruntime + sysctl_sched_wakeup_granularity,
which will not happen in current cfs's wakeup_preempt_entity().

It seems that, eevdf does not inhit the wakeup preemption as much as cfs, and
maybe it is because eevdf needs to consider fairness more?

thanks,
Chenyu