[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190612163345.GB26997@sinkpad>
Date: Wed, 12 Jun 2019 12:33:45 -0400
From: Julien Desfossez <jdesfossez@...italocean.com>
To: Aaron Lu <aaron.lu@...ux.alibaba.com>
Cc: Aubrey Li <aubrey.intel@...il.com>,
Vineeth Remanan Pillai <vpillai@...italocean.com>,
Nishanth Aravamudan <naravamudan@...italocean.com>,
Peter Zijlstra <peterz@...radead.org>,
Tim Chen <tim.c.chen@...ux.intel.com>,
Ingo Molnar <mingo@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>,
Paul Turner <pjt@...gle.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
Subhra Mazumdar <subhra.mazumdar@...cle.com>,
Frédéric Weisbecker <fweisbec@...il.com>,
Kees Cook <keescook@...omium.org>,
Greg Kerr <kerrnel@...gle.com>, Phil Auld <pauld@...hat.com>,
Valentin Schneider <valentin.schneider@....com>,
Mel Gorman <mgorman@...hsingularity.net>,
Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3
After reading more traces and trying to understand why only untagged
tasks are starving when there are cpu-intensive tasks running on the
same set of CPUs, we noticed a difference in behavior in ‘pick_task’. In
the case where ‘core_cookie’ is 0, we are supposed to only prefer the
tagged task if it’s priority is higher, but when the priorities are
equal we prefer it as well which causes the starving. ‘pick_task’ is
biased toward selecting its first parameter in case of equality which in
this case was the ‘class_pick’ instead of ‘max’. Reversing the order of
the parameter solves this issue and matches the expected behavior.
So we can get rid of this vruntime_boost concept.
We have tested the fix below and it seems to work well with
tagged/untagged tasks.
Here are our initial test results. When core scheduling is enabled,
each VM (and associated vhost threads) are in their own cgroup/tag.
1 12-vcpu VM MySQL TPC-C benchmark (IO + CPU) with 96 mostly-idle 1-vcpu
VMs on each NUMA node (72 logical CPUs total with SMT on):
+-------------+----------+--------------+------------+--------+
| | baseline | coresched | coresched | nosmt |
| | no tag | VMs tagged | VMs tagged | no tag |
| | v5.1.5 | no stall fix | stall fix | |
+-------------+----------+--------------+------------+--------+
|average TPS | 1474 | 1289 | 1264 | 1339 |
|stdev | 48 | 12 | 17 | 24 |
|overhead | N/A | -12% | -14% | -9% |
+-------------+----------+--------------+------------+--------+
3 12-vcpu VMs running linpack (cpu-intensive), all pinned on the same
NUMA node (36 logical CPUs with SMT enabled on that NUMA node):
+---------------+----------+--------------+-----------+--------+
| | baseline | coresched | coresched | nosmt |
| | no tag | VMs tagged | VMs tagged| no tag |
| | v5.1.5 | no stall fix | stall fix | |
+---------------+----------+--------------+-----------+--------+
|average gflops | 177.9 | 171.3 | 172.7 | 81.9 |
|stdev | 2.6 | 10.6 | 6.4 | 8.1 |
|overhead | N/A | -3.7% | -2.9% | -53.9% |
+---------------+----------+--------------+-----------+--------+
This fix can be toggled dynamically with the ‘CORESCHED_STALL_FIX’
sched_feature so it’s easy to test before/after (it is disabled by
default).
The up-to-date git tree can also be found here in case it’s easier to
follow:
https://github.com/digitalocean/linux-coresched/commits/vpillai/coresched-v3-v5.1.5-test
Feedback welcome !
Thanks,
Julien
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6e79421..26fea68 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3668,8 +3668,10 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
* If class_pick is tagged, return it only if it has
* higher priority than max.
*/
- if (max && class_pick->core_cookie &&
- prio_less(class_pick, max))
+ bool max_is_higher = sched_feat(CORESCHED_STALL_FIX) ?
+ max && !prio_less(max, class_pick) :
+ max && prio_less(class_pick, max);
+ if (class_pick->core_cookie && max_is_higher)
return idle_sched_class.pick_task(rq);
return class_pick;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..332a092 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -90,3 +90,9 @@ SCHED_FEAT(WA_BIAS, true)
* UtilEstimation. Use estimated CPU utilization.
*/
SCHED_FEAT(UTIL_EST, true)
+
+/*
+ * Prevent task stall due to vruntime comparison limitation across
+ * cpus.
+ */
+SCHED_FEAT(CORESCHED_STALL_FIX, false)
Powered by blists - more mailing lists