linux-kernel - Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200508084419.GA120223@aaronlu-desktop>
Date:   Fri, 8 May 2020 16:44:19 +0800
From:   Aaron Lu <aaron.lwe@...il.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Vineeth Remanan Pillai <vpillai@...italocean.com>,
        Nishanth Aravamudan <naravamudan@...italocean.com>,
        Julien Desfossez <jdesfossez@...italocean.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Ingo Molnar <mingo@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Paul Turner <pjt@...gle.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Aaron Lu <aaron.lu@...ux.alibaba.com>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        Frédéric Weisbecker <fweisbec@...il.com>,
        Kees Cook <keescook@...omium.org>,
        Greg Kerr <kerrnel@...gle.com>, Phil Auld <pauld@...hat.com>,
        Aubrey Li <aubrey.intel@...il.com>,
        "Li, Aubrey" <aubrey.li@...ux.intel.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Joel Fernandes <joelaf@...gle.com>,
        Joel Fernandes <joel@...lfernandes.org>
Subject: Re: [PATCH updated v2] sched/fair: core wide cfs task priority
 comparison

On Wed, May 06, 2020 at 04:35:06PM +0200, Peter Zijlstra wrote:
> 
> Sorry for being verbose; I've been procrastinating replying, and in
> doing so the things I wanted to say kept growing.
> 
> On Fri, Apr 24, 2020 at 10:24:43PM +0800, Aaron Lu wrote:
> 
> > To make this work, the root level sched entities' vruntime of the two
> > threads must be directly comparable. So one of the hyperthread's root
> > cfs_rq's min_vruntime is chosen as the core wide one and all root level
> > sched entities' vruntime is normalized against it.
> 
> > +/*
> > + * This is called in stop machine context so no need to take the rq lock.
> > + *
> > + * Core scheduling is going to be enabled and the root level sched entities
> > + * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq
> > + * min_vruntime, so it's necessary to normalize vruntime of existing root
> > + * level sched entities in sibling_cfs_rq.
> > + *
> > + * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be
> > + * only using cfs_rq->min_vruntime during the entire run of core scheduling.
> > + */
> > +void sched_core_normalize_se_vruntime(int cpu)
> > +{
> > +	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
> > +	int i;
> > +
> > +	for_each_cpu(i, cpu_smt_mask(cpu)) {
> > +		struct sched_entity *se, *next;
> > +		struct cfs_rq *sibling_cfs_rq;
> > +		s64 delta;
> > +
> > +		if (i == cpu)
> > +			continue;
> > +
> > +		sibling_cfs_rq = &cpu_rq(i)->cfs;
> > +		if (!sibling_cfs_rq->nr_running)
> > +			continue;
> > +
> > +		delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime;
> > +		rbtree_postorder_for_each_entry_safe(se, next,
> > +				&sibling_cfs_rq->tasks_timeline.rb_root,
> > +				run_node) {
> > +			se->vruntime += delta;
> > +		}
> > +	}
> > +}
> 
> Aside from this being way to complicated for what it does -- you
> could've saved the min_vruntime for each rq and compared them with
> subtraction -- it is also terminally broken afaict.
>
> Consider any infeasible weight scenario. Take for instance two tasks,
> each bound to their respective sibling, one with weight 1 and one with
> weight 2. Then the lower weight task will run ahead of the higher weight
> task without bound.

I don't follow how this could happen. Even the lower weight task runs
first, after some time, the higher weight task will get its turn and
from then on, the higher weight task will get more chance to run(due to
its higher weight and thus, slower accumulation of vruntime).

We used to have the following patch as a standalone one in v4:
sched/fair : Wake up forced idle siblings if needed
https://lore.kernel.org/lkml/cover.1572437285.git.vpillai@digitalocean.com/T/#md22d25d0e2932d059013e9b56600d8a847b02a13
Which originates from:
https://lore.kernel.org/lkml/20190725143344.GD992@aaronlu/

And in this series, it seems to be merged in:
[RFC PATCH 07/13] sched: Add core wide task selection and scheduling
https://lore.kernel.org/lkml/e942da7fd881977923463f19648085c1bfaa37f8.1583332765.git.vpillai@digitalocean.com/

My local test shows that when two cgroup's share are both set to 1024
and each bound to one sibling of a core, start a cpu intensive task in
each cgroup, then the cpu intensive task will each consume 50% cpu. When
one cgroup's share set to 512, it will consume about 33% while the other
consumes 67%, as expected.

I think the current patch works fine when 2 differently tagged tasks are
competing CPU, but when there are 3 tasks or more, things can get less
fair.