[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAERHkrtvLKxrpvfw04urAuougsYOWnNw4-H1vUDFx27Dvy0=Ww@mail.gmail.com>
Date: Mon, 22 Jul 2019 18:26:46 +0800
From: Aubrey Li <aubrey.intel@...il.com>
To: Aaron Lu <aaron.lu@...ux.alibaba.com>
Cc: Julien Desfossez <jdesfossez@...italocean.com>,
Subhra Mazumdar <subhra.mazumdar@...cle.com>,
Vineeth Remanan Pillai <vpillai@...italocean.com>,
Nishanth Aravamudan <naravamudan@...italocean.com>,
Peter Zijlstra <peterz@...radead.org>,
Tim Chen <tim.c.chen@...ux.intel.com>,
Ingo Molnar <mingo@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>,
Paul Turner <pjt@...gle.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
Frédéric Weisbecker <fweisbec@...il.com>,
Kees Cook <keescook@...omium.org>,
Greg Kerr <kerrnel@...gle.com>, Phil Auld <pauld@...hat.com>,
Valentin Schneider <valentin.schneider@....com>,
Mel Gorman <mgorman@...hsingularity.net>,
Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3
On Thu, Jul 18, 2019 at 6:07 PM Aaron Lu <aaron.lu@...ux.alibaba.com> wrote:
>
> On Wed, Jun 19, 2019 at 02:33:02PM -0400, Julien Desfossez wrote:
> > On 17-Jun-2019 10:51:27 AM, Aubrey Li wrote:
> > > The result looks still unfair, and particularly, the variance is too high,
> >
> > I just want to confirm that I am also seeing the same issue with a
> > similar setup. I also tried with the priority boost fix we previously
> > posted, the results are slightly better, but we are still seeing a very
> > high variance.
> >
> > On average, the results I get for 10 30-seconds runs are still much
> > better than nosmt (both sysbench pinned on the same sibling) for the
> > memory benchmark, and pretty similar for the CPU benchmark, but the high
> > variance between runs is indeed concerning.
>
> I was thinking to use util_avg signal to decide which task win in
> __prio_less() in the cross cpu case. The reason util_avg is chosen
> is because it represents how cpu intensive the task is, so the end
> result is, less cpu intensive task will preempt more cpu intensive
> task.
>
> Here is the test I have done to see how util_avg works
> (on a single node, 16 cores, 32 cpus vm):
> 1 Start tmux and then start 3 windows with each running bash;
> 2 Place two shells into two different cgroups and both have cpu.tag set;
> 3 Switch to the 1st tmux window, start
> will-it-scale/page_fault1_processes -t 16 -s 30
> in the first tagged shell;
> 4 Switch to the 2nd tmux window;
> 5 Start
> will-it-scale/page_fault1_processes -t 16 -s 30
> in the 2nd tagged shell;
> 6 Switch to the 3rd tmux window;
> 7 Do some simple things in the 3rd untagged shell like ls to see if
> untagged task is able to proceed;
> 8 Wait for the two page_fault workloads to finish.
>
> With v3 here, I can not do step 4 and later steps, i.e. the 16
> page_fault1 processes started in step 3 will occupy all 16 cores and
> other tasks do not have a chance to run, including tmux, which made
> switching tmux window impossible.
>
> With the below patch on top of v3 that makes use of util_avg to decide
> which task win, I can do all 8 steps and the final scores of the 2
> workloads are: 1796191 and 2199586. The score number are not close,
> suggesting some unfairness, but I can finish the test now...
>
> Here is the diff(consider it as a POC):
>
> ---
> kernel/sched/core.c | 35 ++---------------------------------
> kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 2 ++
> 3 files changed, 40 insertions(+), 33 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 26fea68f7f54..7557a7bbb481 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -105,25 +105,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
> if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
> return !dl_time_before(a->dl.deadline, b->dl.deadline);
>
> - if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */
> - u64 a_vruntime = a->se.vruntime;
> - u64 b_vruntime = b->se.vruntime;
> -
> - /*
> - * Normalize the vruntime if tasks are in different cpus.
> - */
> - if (task_cpu(a) != task_cpu(b)) {
> - b_vruntime -= task_cfs_rq(b)->min_vruntime;
> - b_vruntime += task_cfs_rq(a)->min_vruntime;
> -
> - trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
> - a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
> - b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
> -
> - }
> -
> - return !((s64)(a_vruntime - b_vruntime) <= 0);
> - }
> + if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
> + return cfs_prio_less(a, b);
>
> return false;
> }
> @@ -3663,20 +3646,6 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
> if (!class_pick)
> return NULL;
>
> - if (!cookie) {
> - /*
> - * If class_pick is tagged, return it only if it has
> - * higher priority than max.
> - */
> - bool max_is_higher = sched_feat(CORESCHED_STALL_FIX) ?
> - max && !prio_less(max, class_pick) :
> - max && prio_less(class_pick, max);
> - if (class_pick->core_cookie && max_is_higher)
> - return idle_sched_class.pick_task(rq);
> -
> - return class_pick;
> - }
> -
> /*
> * If class_pick is idle or matches cookie, return early.
> */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 26d29126d6a5..06fb00689db1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10740,3 +10740,39 @@ __init void init_sched_fair_class(void)
> #endif /* SMP */
>
> }
> +
> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> + struct sched_entity *sea = &a->se;
> + struct sched_entity *seb = &b->se;
> + bool samecore = task_cpu(a) == task_cpu(b);
> + struct task_struct *p;
> + s64 delta;
> +
> + if (samecore) {
> + /* vruntime is per cfs_rq */
> + while (!is_same_group(sea, seb)) {
> + int sea_depth = sea->depth;
> + int seb_depth = seb->depth;
> +
> + if (sea_depth >= seb_depth)
> + sea = parent_entity(sea);
> + if (sea_depth <= seb_depth)
> + seb = parent_entity(seb);
> + }
> +
> + delta = (s64)(sea->vruntime - seb->vruntime);
> + }
> +
> + /* across cpu: use util_avg to decide which task to run */
> + delta = (s64)(sea->avg.util_avg - seb->avg.util_avg);
The granularity period of util_avg seems too large to decide task priority
during pick_task(), at least it is in my case, cfs_prio_less() always picked
core max task, so pick_task() eventually picked idle, which causes this change
not very helpful for my case.
<idle>-0 [057] dN.. 83.716973: __schedule: max: sysbench/2578
ffff889050f68600
<idle>-0 [057] dN.. 83.716974: __schedule:
(swapper/5/0;140,0,0) ?< (mysqld/2511;119,1042118143,0)
<idle>-0 [057] dN.. 83.716975: __schedule:
(sysbench/2578;119,96449836,0) ?< (mysqld/2511;119,1042118143,0)
<idle>-0 [057] dN.. 83.716975: cfs_prio_less: picked
sysbench/2578 util_avg: 20 527 -507 <======= here===
<idle>-0 [057] dN.. 83.716976: __schedule: pick_task cookie
pick swapper/5/0 ffff889050f68600
Thanks,
-Aubrey
Powered by blists - more mailing lists