[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20250714134429.19624-3-mgorman@techsingularity.net>
Date: Mon, 14 Jul 2025 14:44:29 +0100
From: Mel Gorman <mgorman@...hsingularity.net>
To: linux-kernel@...r.kernel.org
Cc: Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Valentin Schneider <vschneid@...hat.com>,
Mel Gorman <mgorman@...hsingularity.net>
Subject: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Reimplement NEXT_BUDDY preemption to take into account the deadline and
eligibility of the wakee with respect to the waker. In the event
multiple buddies could be considered, the one with the earliest deadline
is selected.
Sync wakeups are treated differently to every other type of wakeup. The
WF_SYNC assumption is that the waker promises to sleep in the very near
future. This is violated in enough cases that WF_SYNC should be treated
as a mild suggestion instead of a hard rule. If a waker does go to sleep
almost immediately then the delay in wakeup is negligible. In all other
cases, it's throttled based on the accumulated runtime of the waker so
there is a chance that some batched wakeups have been issued before
preemption.
For all other wakeups, preemption happens if the wakee has a sooner deadline
than the waker and eligible to run.
While many workloads were tested, the two main targets were a modified
dbench4 benchmark and hackbench because the are on opposite ends of the
spectrum -- one prefers throughput by avoiding preemption and the other
relies on preemption.
First is the dbench throughput data even though it is a terrible metric for
dbench as it's the default one reported. The test machine is a 2-socket
CascadeLake machine and the backing filesystem is XFS as a lot of the IO
work is dispatched to kernel threads. It's important to note that these
results are not representative across all machines, especially Zen machines,
as different bottlenecks are exposed on different machines and filesystems.
dbench4 Throughput (misleading but traditional)
6.16.0-rc5 6.16.0-rc5
vanilla sched-preemptnext-v1r8
Hmean 1 1286.83 ( 0.00%) 1281.73 ( -0.40%)
Hmean 4 4017.50 ( 0.00%) 3934.85 * -2.06%*
Hmean 7 5536.45 ( 0.00%) 5453.55 * -1.50%*
Hmean 12 7251.59 ( 0.00%) 7217.25 ( -0.47%)
Hmean 21 8957.92 ( 0.00%) 9188.07 ( 2.57%)
Hmean 30 9403.41 ( 0.00%) 10523.72 * 11.91%*
Hmean 48 9320.12 ( 0.00%) 11496.27 * 23.35%*
Hmean 79 8962.30 ( 0.00%) 11555.71 * 28.94%*
Hmean 110 8066.52 ( 0.00%) 11307.26 * 40.18%*
Hmean 141 7605.20 ( 0.00%) 10622.52 * 39.67%*
Hmean 160 7422.56 ( 0.00%) 10250.78 * 38.10%*
As throughput is misleading, the benchmark is modified to use a short
loadfile report the completion time duration in milliseconds.
dbench4 Loadfile Execution Time
6.16.0-rc5 6.16.0-rc5
vanilla sched-preemptnext-v1r8
Amean 1 14.35 ( 0.00%) 14.27 ( 0.57%)
Amean 4 18.58 ( 0.00%) 19.01 ( -2.35%)
Amean 7 23.83 ( 0.00%) 24.18 ( -1.48%)
Amean 12 31.59 ( 0.00%) 31.77 ( -0.55%)
Amean 21 44.65 ( 0.00%) 43.44 ( 2.71%)
Amean 30 60.73 ( 0.00%) 54.21 ( 10.74%)
Amean 48 98.25 ( 0.00%) 79.41 ( 19.17%)
Amean 79 168.34 ( 0.00%) 130.06 ( 22.74%)
Amean 110 261.03 ( 0.00%) 185.04 ( 29.11%)
Amean 141 353.98 ( 0.00%) 251.55 ( 28.94%)
Amean 160 410.66 ( 0.00%) 296.87 ( 27.71%)
Stddev 1 0.51 ( 0.00%) 0.48 ( 6.67%)
Stddev 4 1.14 ( 0.00%) 1.21 ( -6.78%)
Stddev 7 1.63 ( 0.00%) 1.58 ( 3.12%)
Stddev 12 2.62 ( 0.00%) 2.38 ( 9.05%)
Stddev 21 5.21 ( 0.00%) 3.87 ( 25.70%)
Stddev 30 10.03 ( 0.00%) 6.65 ( 33.65%)
Stddev 48 22.31 ( 0.00%) 12.26 ( 45.05%)
Stddev 79 41.14 ( 0.00%) 29.11 ( 29.25%)
Stddev 110 70.55 ( 0.00%) 47.71 ( 32.38%)
Stddev 141 98.12 ( 0.00%) 66.83 ( 31.89%)
Stddev 160 139.37 ( 0.00%) 67.73 ( 51.40%)
That is still looking good and the variance is reduced quite a bit.
Finally, fairness is a concern so the next report tracks how many
milliseconds does it take for all clients to complete a workfile. This
one is tricky because dbench makes to effort to synchronise clients so
the durations at benchmark start time differ substantially from typical
runtimes. This problem could be mitigated by warming up the benchmark
for a number of minutes but it's a matter of opinion whether that
counts as an evasion of inconvenient results.
dbench4 All Clients Loadfile Execution Time
6.16.0-rc5 6.16.0-rc5
vanilla sched-preemptnext-v1r8
Amean 1 14.93 ( 0.00%) 14.91 ( 0.11%)
Amean 4 348.88 ( 0.00%) 277.06 ( 20.59%)
Amean 7 722.94 ( 0.00%) 991.70 ( -37.18%)
Amean 12 2055.72 ( 0.00%) 2684.48 ( -30.59%)
Amean 21 4393.85 ( 0.00%) 2625.79 ( 40.24%)
Amean 30 6119.84 ( 0.00%) 2491.15 ( 59.29%)
Amean 48 20600.85 ( 0.00%) 6717.61 ( 67.39%)
Amean 79 22677.38 ( 0.00%) 21866.80 ( 3.57%)
Amean 110 35937.71 ( 0.00%) 22517.63 ( 37.34%)
Amean 141 25104.66 ( 0.00%) 29897.08 ( -19.09%)
Amean 160 23843.74 ( 0.00%) 23106.66 ( 3.09%)
Stddev 1 0.50 ( 0.00%) 0.46 ( 6.67%)
Stddev 4 201.33 ( 0.00%) 130.13 ( 35.36%)
Stddev 7 471.94 ( 0.00%) 641.69 ( -35.97%)
Stddev 12 1401.94 ( 0.00%) 1750.14 ( -24.84%)
Stddev 21 2519.12 ( 0.00%) 1416.77 ( 43.76%)
Stddev 30 3469.05 ( 0.00%) 1293.37 ( 62.72%)
Stddev 48 11521.49 ( 0.00%) 3846.34 ( 66.62%)
Stddev 79 12849.21 ( 0.00%) 12275.89 ( 4.46%)
Stddev 110 20362.88 ( 0.00%) 12989.46 ( 36.21%)
Stddev 141 13768.42 ( 0.00%) 17108.34 ( -24.26%)
Stddev 160 13196.34 ( 0.00%) 13029.75 ( 1.26%)
This is more of a mixed bag but it at least shows that fairness
is not crippled.
The hackbench results are more neutral but this is still important.
It's possible to boost the dbench figures by a large amount but only by
crippling the performance of a workload like hackbench.
hackbench-process-pipes
6.16.0-rc5 6.16.0-rc5
vanilla sched-preemptnext-v1r8
Amean 1 0.2183 ( 0.00%) 0.2223 ( -1.83%)
Amean 4 0.5780 ( 0.00%) 0.5413 ( 6.34%)
Amean 7 0.7727 ( 0.00%) 0.7093 ( 8.20%)
Amean 12 1.1220 ( 0.00%) 1.1170 ( 0.45%)
Amean 21 1.7470 ( 0.00%) 1.7713 ( -1.39%)
Amean 30 2.2940 ( 0.00%) 2.6957 * -17.51%*
Amean 48 3.7337 ( 0.00%) 4.1003 * -9.82%*
Amean 79 4.9310 ( 0.00%) 5.1417 * -4.27%*
Amean 110 6.1800 ( 0.00%) 6.5370 * -5.78%*
Amean 141 7.5737 ( 0.00%) 8.0060 * -5.71%*
Amean 172 9.0820 ( 0.00%) 9.4767 * -4.35%*
Amean 203 10.6053 ( 0.00%) 10.8870 ( -2.66%)
Amean 234 12.3380 ( 0.00%) 13.1290 * -6.41%*
Amean 265 14.5900 ( 0.00%) 15.3547 * -5.24%*
Amean 296 16.1937 ( 0.00%) 17.1533 * -5.93%*
Processes using pipes are impacted and it's outside the noise as the
coefficient of variance is roughly 3%. These results are not always
reproducible. If executed across multiple reboots, it may show neutral or
small gains so the worst measured results are presented.
Hackbench using sockets is more reliably neutral as the wakeup
mechanisms are different between sockets and pipes.
hackbench-process-sockets
6.16.0-rc5 6.16.0-rc5
vanilla sched-preemptnext-v1r8
Amean 1 0.3217 ( 0.00%) 0.3053 ( 5.08%)
Amean 4 0.8967 ( 0.00%) 0.9007 ( -0.45%)
Amean 7 1.4780 ( 0.00%) 1.5067 ( -1.94%)
Amean 12 2.1977 ( 0.00%) 2.2693 ( -3.26%)
Amean 21 3.4983 ( 0.00%) 3.6667 * -4.81%*
Amean 30 4.9270 ( 0.00%) 5.1207 * -3.93%*
Amean 48 7.6250 ( 0.00%) 7.9667 * -4.48%*
Amean 79 15.7477 ( 0.00%) 15.4177 ( 2.10%)
Amean 110 21.8070 ( 0.00%) 21.9563 ( -0.68%)
Amean 141 29.4813 ( 0.00%) 29.2327 ( 0.84%)
Amean 172 36.7433 ( 0.00%) 35.9043 ( 2.28%)
Amean 203 40.8823 ( 0.00%) 40.3467 ( 1.31%)
Amean 234 43.1627 ( 0.00%) 43.0343 ( 0.30%)
Amean 265 49.6417 ( 0.00%) 49.9030 ( -0.53%)
Amean 296 51.3137 ( 0.00%) 51.9310 ( -1.20%)
At the time of writing, other tests are still running but most or either
neutral or relatively small gains. In general, the other workloads are
less wakeup-intensive than dbench or hackbench.
Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>
---
kernel/sched/fair.c | 123 ++++++++++++++++++++++++++++++++++++++------
1 file changed, 106 insertions(+), 17 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a14da5396fb..62fa036b0c3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -936,6 +936,16 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
if (cfs_rq->nr_queued == 1)
return curr && curr->on_rq ? curr : se;
+ /*
+ * Picking the ->next buddy will affect latency but not fairness.
+ */
+ if (sched_feat(PICK_BUDDY) &&
+ cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+ /* ->next will never be delayed */
+ WARN_ON_ONCE(cfs_rq->next->sched_delayed);
+ return cfs_rq->next;
+ }
+
if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
curr = NULL;
@@ -1205,6 +1215,83 @@ static inline bool do_preempt_short(struct cfs_rq *cfs_rq,
return false;
}
+enum preempt_buddy_action {
+ PREEMPT_BUDDY_NONE, /* No action on the buddy */
+ PREEMPT_BUDDY_NEXT, /* Check next is most eligible
+ * before rescheduling.
+ */
+ PREEMPT_BUDDY_RESCHED, /* Plain reschedule */
+ PREEMPT_BUDDY_IMMEDIATE /* Remove slice protection
+ * and reschedule
+ */
+};
+
+static void set_next_buddy(struct sched_entity *se);
+
+static inline enum preempt_buddy_action
+do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
+ struct sched_entity *pse, struct sched_entity *se,
+ s64 delta, bool did_short)
+{
+ bool pse_before, pse_eligible;
+
+ if (!sched_feat(NEXT_BUDDY) ||
+ (wake_flags & WF_FORK) ||
+ (pse->sched_delayed)) {
+ BUILD_BUG_ON(PREEMPT_BUDDY_NONE + 1 != PREEMPT_BUDDY_NEXT);
+ return PREEMPT_BUDDY_NONE + did_short;
+ }
+
+ /* Reschedule if waker is no longer eligible */
+ if (!entity_eligible(cfs_rq, se))
+ return PREEMPT_BUDDY_RESCHED;
+
+ /* Keep existing buddy if the deadline is sooner than pse */
+ if (cfs_rq->next && entity_before(cfs_rq->next, pse))
+ return PREEMPT_BUDDY_NONE;
+
+ set_next_buddy(pse);
+ pse_before = entity_before(pse, se);
+ pse_eligible = entity_eligible(cfs_rq, pse);
+
+ /*
+ * WF_SYNC implies that waker will sleep soon but it is not enforced
+ * because the hint is often abused or misunderstood.
+ */
+ if ((wake_flags & (WF_TTWU|WF_SYNC)) == (WF_TTWU|WF_SYNC)) {
+ /*
+ * WF_RQ_SELECTED implies the tasks are stacking on a
+ * CPU. Only consider reschedule if pse deadline expires
+ * before se.
+ */
+ if ((wake_flags & WF_RQ_SELECTED) &&
+ delta < sysctl_sched_migration_cost) {
+
+ if (!pse_before)
+ return PREEMPT_BUDDY_NONE;
+
+ /* Fall through to pse deadline. */
+ }
+
+ /*
+ * Reschedule if pse's deadline is sooner and there is a chance
+ * that some wakeup batching has completed.
+ */
+ if (pse_before &&
+ delta >= (sysctl_sched_migration_cost >> 6)) {
+ return PREEMPT_BUDDY_IMMEDIATE;
+ }
+
+ return PREEMPT_BUDDY_NONE;
+ }
+
+ /* Check eligibility of buddy to start now. */
+ if (pse_before && pse_eligible)
+ return PREEMPT_BUDDY_IMMEDIATE;
+
+ return PREEMPT_BUDDY_NEXT;
+}
+
/*
* Used by other classes to account runtime.
*/
@@ -5589,16 +5676,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
{
struct sched_entity *se;
- /*
- * Picking the ->next buddy will affect latency but not fairness.
- */
- if (sched_feat(PICK_BUDDY) &&
- cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
- /* ->next will never be delayed */
- WARN_ON_ONCE(cfs_rq->next->sched_delayed);
- return cfs_rq->next;
- }
-
se = pick_eevdf(cfs_rq);
if (se->sched_delayed) {
dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
@@ -7056,8 +7133,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
hrtick_update(rq);
}
-static void set_next_buddy(struct sched_entity *se);
-
/*
* Basically dequeue_task_fair(), except it can deal with dequeue_entity()
* failing half-way through and resume the dequeue later.
@@ -8767,6 +8842,8 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
struct sched_entity *se = &donor->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(donor);
int cse_is_idle, pse_is_idle;
+ bool did_short;
+ s64 delta;
if (unlikely(se == pse))
return;
@@ -8780,10 +8857,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
return;
- if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
- set_next_buddy(pse);
- }
-
/*
* We can come here with TIF_NEED_RESCHED already set from new task
* wake up path.
@@ -8829,6 +8902,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
return;
cfs_rq = cfs_rq_of(se);
+ delta = rq_clock_task(rq) - se->exec_start;
update_curr(cfs_rq);
/*
* If @p has a shorter slice than current and @p is eligible, override
@@ -8837,9 +8911,24 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
* Note that even if @p does not turn out to be the most eligible
* task at this moment, current's slice protection will be lost.
*/
- if (do_preempt_short(cfs_rq, pse, se))
+ did_short = do_preempt_short(cfs_rq, pse, se);
+ if (did_short)
cancel_protect_slice(se);
+ switch (do_preempt_buddy(rq, cfs_rq, wake_flags, pse, se, delta, did_short)) {
+ case PREEMPT_BUDDY_NONE:
+ return;
+ break;
+ case PREEMPT_BUDDY_IMMEDIATE:
+ cancel_protect_slice(se);
+ ;;
+ case PREEMPT_BUDDY_RESCHED:
+ goto preempt;
+ break;
+ case PREEMPT_BUDDY_NEXT:
+ break;
+ }
+
/*
* If @p has become the most eligible task, force preemption.
*/
--
2.43.0
Powered by blists - more mailing lists