lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 14 Aug 2023 15:29:35 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Chen Yu <yu.c.chen@...el.com>
Cc:     kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
        lkp@...el.com, linux-kernel@...r.kernel.org, x86@...nel.org,
        Ingo Molnar <mingo@...nel.org>,
        Mike Galbraith <umgwanakikbuti@...il.com>
Subject: Re: [tip:sched/eevdf] [sched/fair]  e0c2ff903c:
 phoronix-test-suite.blogbench.Write.final_score -34.8% regression

On Fri, Aug 11, 2023 at 10:42:09AM +0800, Chen Yu wrote:

> Since previously lkp has reported that with eevdf policy enabled, there was
> a regression in hackbench, I did some experiments and found that, with eevdf
> enabled there are more preemptions, and this preemption could slow down
> the waker(each waker could wakes up 20 wakee in hackbench). The reason might
> be that, check_preempt_wakeup() is easier to preempt the current task in eevdf:

This is true.

> Without eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
> Thu Aug 10 11:02:02 2023              cpu8
> .stats.check_preempt_count            51973       <-----
> .stats.need_preempt_count             10514       <-----
> .stats.rq_cpu_time                   5004068598
> .stats.rq_sched_info.pcount           60374
> .stats.rq_sched_info.run_delay       80405664582
> .stats.sched_count                    60609
> .stats.sched_goidle                    227
> .stats.ttwu_count                     56250
> .stats.ttwu_local                     14619
> 
> The preemption success ration is 10514 / 51973 = 20.23%
> -----------------------------------------------------------------------------
> 
> With eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
> Thu Aug 10 10:22:55 2023              cpu8
> .stats.check_preempt_count            71673      <----
> .stats.low_gran_preempt_count         57410
> .stats.need_preempt_count             57413      <----
> .stats.rq_cpu_time                   5007778990
> .stats.rq_sched_info.pcount          129233
> .stats.rq_sched_info.run_delay       164830921362
> .stats.sched_count                   129233
> .stats.ttwu_count                     70222
> .stats.ttwu_local                     66847
> 
> The preemption success ration is 57413 / 71673 = 80.10%

note: wakeup-preemption

> According to the low_gran_preempt_count, most successfully preemption happens
> when the current->vruntime is smaller than wakee->vruntime + sysctl_sched_wakeup_granularity,
> which will not happen in current cfs's wakeup_preempt_entity().
> 
> It seems that, eevdf does not inhit the wakeup preemption as much as cfs, and
> maybe it is because eevdf needs to consider fairness more?

Not fairness, latency. Because it wants to honour the virtual deadline.


Are these wakeup preemptions typically on runqueues that have only a
single other task?

That is, consider a single task running, then avg_vruntime will be it's
vruntime, because the average of one variable must be the value of that
one variable.

Then the moment a second task joins, we get two options:

 - positive lag
 - negative lag

When the new task has negative lag, it gets placed to the right of the
currently running task (and avg_vruntime has a forward discontinuity).
At this point the new task is not eligible and does not get to run.

When the new task has positive lag, it gets placed to the left of the
currently running task (and avg_vruntime has a backward discontinuity).
At this point the currently running task is no longer eligible, and the
new task must be selected -- irrespective of it's deadline.

The paper doesn't (AFAIR) consider the case of wake-up-preemption
explicitly. It only considers task selection and vruntime placement.

One option I suppose would be to gate the wakeup preemption by virtual
deadline, only allow when the new task has an earlier deadline than the
currently running one, and otherwise rely on tick preemption.

NOTE: poking at wakeup preemption is a double edged sword, some
workloads love it, some hate it. Touching it is bound to upset the
balance -- again.

(also, did I get that the right way around? -- I've got a Monday brain
that isn't willing to boot properly)

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe5be91c71c7..16d24e5dda8f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8047,6 +8047,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	cfs_rq = cfs_rq_of(se);
 	update_curr(cfs_rq);
 
+	if (sched_feat(WAKEUP_DEADLINE)) {
+		/*
+		 * Only allow preemption if the virtual deadline of the new
+		 * task is before the virtual deadline of the existing task.
+		 */
+		if (deadline_gt(deadline, pse, se))
+			return;
+	}
+
 	/*
 	 * XXX pick_eevdf(cfs_rq) != se ?
 	 */
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 61bcbf5e46a4..e733981b32aa 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -24,6 +24,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
  * Allow wakeup-time preemption of the current task:
  */
 SCHED_FEAT(WAKEUP_PREEMPTION, true)
+SCHED_FEAT(WAKEUP_DEADLINE, true)
 
 SCHED_FEAT(HRTICK, false)
 SCHED_FEAT(HRTICK_DL, false)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ