linux-kernel - Re: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZNxFMYZNoyZSdUbd@chenyu5-mobl2>
Date:   Wed, 16 Aug 2023 11:40:33 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Peter Zijlstra <peterz@...radead.org>
CC:     kernel test robot <oliver.sang@...el.com>,
        <oe-lkp@...ts.linux.dev>, <lkp@...el.com>,
        <linux-kernel@...r.kernel.org>, <x86@...nel.org>,
        "Ingo Molnar" <mingo@...nel.org>,
        Mike Galbraith <umgwanakikbuti@...il.com>
Subject: Re: [tip:sched/eevdf] [sched/fair]  e0c2ff903c:
 phoronix-test-suite.blogbench.Write.final_score -34.8% regression

On 2023-08-14 at 15:29:35 +0200, Peter Zijlstra wrote:
> On Fri, Aug 11, 2023 at 10:42:09AM +0800, Chen Yu wrote:
> 
> > Since previously lkp has reported that with eevdf policy enabled, there was
> > a regression in hackbench, I did some experiments and found that, with eevdf
> > enabled there are more preemptions, and this preemption could slow down
> > the waker(each waker could wakes up 20 wakee in hackbench). The reason might
> > be that, check_preempt_wakeup() is easier to preempt the current task in eevdf:
> 
> This is true.
> 
> > Without eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
> > Thu Aug 10 11:02:02 2023              cpu8
> > .stats.check_preempt_count            51973       <-----
> > .stats.need_preempt_count             10514       <-----
> > .stats.rq_cpu_time                   5004068598
> > .stats.rq_sched_info.pcount           60374
> > .stats.rq_sched_info.run_delay       80405664582
> > .stats.sched_count                    60609
> > .stats.sched_goidle                    227
> > .stats.ttwu_count                     56250
> > .stats.ttwu_local                     14619
> > 
> > The preemption success ration is 10514 / 51973 = 20.23%
> > -----------------------------------------------------------------------------
> > 
> > With eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
> > Thu Aug 10 10:22:55 2023              cpu8
> > .stats.check_preempt_count            71673      <----
> > .stats.low_gran_preempt_count         57410
> > .stats.need_preempt_count             57413      <----
> > .stats.rq_cpu_time                   5007778990
> > .stats.rq_sched_info.pcount          129233
> > .stats.rq_sched_info.run_delay       164830921362
> > .stats.sched_count                   129233
> > .stats.ttwu_count                     70222
> > .stats.ttwu_local                     66847
> > 
> > The preemption success ration is 57413 / 71673 = 80.10%
> 
> note: wakeup-preemption
> 
> > According to the low_gran_preempt_count, most successfully preemption happens
> > when the current->vruntime is smaller than wakee->vruntime + sysctl_sched_wakeup_granularity,
> > which will not happen in current cfs's wakeup_preempt_entity().
> > 
> > It seems that, eevdf does not inhit the wakeup preemption as much as cfs, and
> > maybe it is because eevdf needs to consider fairness more?
> 
> Not fairness, latency. Because it wants to honour the virtual deadline.
> 
> 
> Are these wakeup preemptions typically on runqueues that have only a
> single other task?
>

There are 112 groups of hackbench waker/wakee on a 112 CPUs system.
Each group has 1 waker and 20 wakees. It seems that there are more than 1
other task on each runqueue. I collected the statistics below.
 
> That is, consider a single task running, then avg_vruntime will be it's
> vruntime, because the average of one variable must be the value of that
> one variable.
> 
> Then the moment a second task joins, we get two options:
> 
>  - positive lag
>  - negative lag
> 
> When the new task has negative lag, it gets placed to the right of the
> currently running task (and avg_vruntime has a forward discontinuity).
> At this point the new task is not eligible and does not get to run.
> 
> When the new task has positive lag, it gets placed to the left of the
> currently running task (and avg_vruntime has a backward discontinuity).
> At this point the currently running task is no longer eligible, and the
> new task must be selected -- irrespective of it's deadline.
> 
> The paper doesn't (AFAIR) consider the case of wake-up-preemption
> explicitly. It only considers task selection and vruntime placement.
> 
> One option I suppose would be to gate the wakeup preemption by virtual
> deadline, only allow when the new task has an earlier deadline than the
> currently running one, and otherwise rely on tick preemption.
> 
> NOTE: poking at wakeup preemption is a double edged sword, some
> workloads love it, some hate it. Touching it is bound to upset the
> balance -- again.
> 
> (also, did I get that the right way around? -- I've got a Monday brain
> that isn't willing to boot properly)
> 
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fe5be91c71c7..16d24e5dda8f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8047,6 +8047,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>  	cfs_rq = cfs_rq_of(se);
>  	update_curr(cfs_rq);
>  
> +	if (sched_feat(WAKEUP_DEADLINE)) {
> +		/*
> +		 * Only allow preemption if the virtual deadline of the new
> +		 * task is before the virtual deadline of the existing task.
> +		 */
> +		if (deadline_gt(deadline, pse, se))
> +			return;
> +	}
> +
>  	/*
>  	 * XXX pick_eevdf(cfs_rq) != se ?
>  	 */
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 61bcbf5e46a4..e733981b32aa 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -24,6 +24,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
>   * Allow wakeup-time preemption of the current task:
>   */
>  SCHED_FEAT(WAKEUP_PREEMPTION, true)
> +SCHED_FEAT(WAKEUP_DEADLINE, true)
>  
>  SCHED_FEAT(HRTICK, false)
>  SCHED_FEAT(HRTICK_DL, false)

Added more schedstats to track the counter, when the wakee
succeed to preempt the current task:

1. curr's deadline is smaller than the wakee's deadline
2. curr is not eligible
3. the cfs_rq->nr_running <= 2

Also applied Mike's RUN_TO_PARITY patch with some minor change
to only apply RUN_TO_PARITY for wakeup:


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 57e8bc14b06e..da7260ddd7e7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -931,14 +931,20 @@ static struct sched_entity *pick_cfs(struct cfs_rq *cfs_rq, struct sched_entity
  *
  * Which allows an EDF like search on (sub)trees.
  */
-static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq, int wake)
 {
 	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
 	struct sched_entity *curr = cfs_rq->curr;
 	struct sched_entity *best = NULL;
 
-	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
+	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) {
 		curr = NULL;
+		if (wake)
+			schedstat_inc(cfs_rq->rq->curr_noeli_preempt_count);
+	}
+
+	if (sched_feat(RUN_TO_PARITY) && wake && curr)
+		return curr;
 
 	while (node) {
 		struct sched_entity *se = __node_2_se(node);
@@ -5441,7 +5447,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 		    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
 			return cfs_rq->next;
 
-		return pick_eevdf(cfs_rq);
+		return pick_eevdf(cfs_rq, 0);
 	}
 
 	se = left = pick_cfs(cfs_rq, curr);
@@ -8294,6 +8300,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	int next_buddy_marked = 0;
 	int cse_is_idle, pse_is_idle;
 
+	schedstat_inc(rq->check_preempt_count);
+
 	if (unlikely(se == pse))
 		return;
 
@@ -8358,8 +8366,19 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 		/*
 		 * XXX pick_eevdf(cfs_rq) != se ?
 		 */
-		if (pick_eevdf(cfs_rq) == pse)
+		if (pick_eevdf(cfs_rq, 1) == pse) {
+			if (cfs_rq->curr->vruntime <
+			    pse->vruntime + sysctl_sched_wakeup_granularity) {
+				schedstat_inc(rq->curr_lowgran_preempt_count);
+			}
+			if (cfs_rq->curr->deadline < pse->deadline) {
+				schedstat_inc(rq->curr_lowdl_preempt_count);
+			}
+			if (cfs_rq->nr_running <= 2) {
+				schedstat_inc(rq->curr_lownr_preempt_count);
+			}
 			goto preempt;
+		}
 
 		return;
 	}
@@ -8377,6 +8396,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	return;
 
 preempt:
+	schedstat_inc(rq->need_preempt_count);
+
 	resched_curr(rq);
 	/*
 	 * Only set the backward buddy when the current task is still
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 2a830eccda3e..d021dd0e0226 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -6,6 +6,7 @@
  */
 SCHED_FEAT(PLACE_LAG, true)
 SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
+SCHED_FEAT(RUN_TO_PARITY, true)
 
 /*
  * Prefer to schedule the task we woke last (assuming it failed
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aa5b293ca4ed..b6e245d2bba5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1128,6 +1128,12 @@ struct rq {
 	/* try_to_wake_up() stats */
 	unsigned int		ttwu_count;
 	unsigned int		ttwu_local;
+	unsigned int		check_preempt_count;
+	unsigned int		need_preempt_count;
+	unsigned int		curr_lowgran_preempt_count;
+	unsigned int		curr_lowdl_preempt_count;
+	unsigned int		curr_noeli_preempt_count;
+	unsigned int		curr_lownr_preempt_count;
 #endif
 
 #ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..e7f02e7bfff2 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,16 @@ static int show_schedstat(struct seq_file *seq, void *v)
 
 		/* runqueue-specific stats */
 		seq_printf(seq,
-		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u %u %u %u",
 		    cpu, rq->yld_count,
 		    rq->sched_count, rq->sched_goidle,
 		    rq->ttwu_count, rq->ttwu_local,
 		    rq->rq_cpu_time,
-		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+		    rq->check_preempt_count, rq->need_preempt_count, rq->curr_lowgran_preempt_count,
+		    rq->curr_lowdl_preempt_count,
+		    rq->curr_noeli_preempt_count,
+		    rq->curr_lownr_preempt_count);
 
 		seq_printf(seq, "\n");
 
-- 
2.25.1




NO_RUN_TO_PARITY:

Tue Aug 15 19:16:30 2023              cpu8  
.stats.check_preempt_count            85754 
.stats.curr_lowdl_preempt_count       5189  
.stats.curr_lowgran_preempt_count     70063
.stats.curr_lownr_preempt_count        31   
.stats.curr_noeli_preempt_count       30252 
.stats.need_preempt_count             70066 
.stats.rq_cpu_time                   5032932195
.stats.rq_sched_info.pcount          155748 
.stats.rq_sched_info.run_delay       159730175645
.stats.sched_count                   155748 
.stats.ttwu_count                     84568 
.stats.ttwu_local                     81029 


When the wakee succeeds to preempt the current task:

1. curr's deadline is smaller than wakee's deadline:
   5189 / 70066 = 7.41%

2. curr is not eligible
   30252 / 70066 = 43.18%

3. the cfs_rq->nr_running <= 2
   31 / 70066 = 0.04%

According to above data, maybe comparing the deadline does not take much
effect in this case(per 1). And the scenario of only 1 other task
seldom happens(per 3). And considering the current task's eligibility
without checking deadline might help(per 2, there are (70066 - 30252) times
that the current is eligible but preempted, and Mike's patch addressed
that)


RUN_TO_PARITY:

Tue Aug 15 19:19:07 2023           cpu8  
.stats.check_preempt_count         20147 
.stats.curr_lowdl_preempt_count     334  
.stats.curr_lowgran_preempt_count  4637  
.stats.curr_lownr_preempt_count      8   
.stats.curr_noeli_preempt_count    5451  
.stats.need_preempt_count          4645  
.stats.rq_cpu_time                5052091287
.stats.rq_sched_info.pcount        25486 
.stats.rq_sched_info.run_delay    166261574881
.stats.sched_count                 25486 
.stats.ttwu_count                  20146 
.stats.ttwu_local                  20146 

With Mike's change, the wakeup preemption success ratio drops
to 4645 / 20147 = 23.06%, which is very close to cfs wakeup preemption
ratio 20.23%. And the throughput has been increased:

echo NO_RUN_TO_PARITY > /sys/kernel/debug/sched/features 
hackbench -g 112 -f 20 --threads -l 30000 -s 100
Running in threaded mode with 112 groups using 40 file descriptors each (== 4480 tasks)
Each sender will pass 30000 messages of 100 bytes
Time: 117.487

echo RUN_TO_PARITY > /sys/kernel/debug/sched/features 
hackbench -g 112 -f 20 --threads -l 30000 -s 100
Running in threaded mode with 112 groups using 40 file descriptors each (== 4480 tasks)
Each sender will pass 30000 messages of 100 bytes
Time: 101.285  <--- lower the better

Also tested schbench with/without RUN_TO_PARITY, no much difference(especially wakeup
latency) is observed:

echo NO_RUN_TO_PARITY > /sys/kernel/debug/sched/features
./schbench -m 14 -t 16 -r 100
Wakeup Latencies percentiles (usec) runtime 100 (s) (941339 total samples)
	  50.0th: 4          (53382 samples)
	  90.0th: 3772       (353480 samples)
	* 99.0th: 7448       (84115 samples)
	  99.9th: 7976       (8244 samples)
	  min=1, max=15186
Request Latencies percentiles (usec) runtime 100 (s) (941598 total samples)
	  50.0th: 23264      (283891 samples)
	  90.0th: 32672      (375363 samples)
	* 99.0th: 52288      (83963 samples)
	  99.9th: 95360      (8449 samples)
	  min=4487, max=241201
RPS percentiles (requests) runtime 100 (s) (101 total samples)
	  20.0th: 9264       (22 samples)
	* 50.0th: 9392       (33 samples)
	  90.0th: 9584       (36 samples)
	  min=9055, max=10176
average rps: 9415.98

echo RUN_TO_PARITY > /sys/kernel/debug/sched/features
./schbench -m 14 -t 16 -r 100
Wakeup Latencies percentiles (usec) runtime 100 (s) (943005 total samples)
	  50.0th: 4          (53961 samples)
	  90.0th: 3772       (366473 samples)
	* 99.0th: 7496       (84902 samples)
	  99.9th: 9296       (8441 samples)
	  min=1, max=20467
Request Latencies percentiles (usec) runtime 100 (s) (943206 total samples)
	  50.0th: 23072      (283181 samples)
	  90.0th: 32096      (376473 samples)
	* 99.0th: 53568      (84586 samples)
	  99.9th: 99456      (8443 samples)
	  min=4078, max=219330
RPS percentiles (requests) runtime 100 (s) (101 total samples)
	  20.0th: 9296       (22 samples)
	* 50.0th: 9392       (34 samples)
	  90.0th: 9584       (38 samples)
	  min=9065, max=10188
average rps: 9432.06


thanks,
Chenyu