linux-kernel - Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200515103844.GG2978@hirez.programming.kicks-ass.net>
Date:   Fri, 15 May 2020 12:38:44 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Vineeth Remanan Pillai <vpillai@...italocean.com>
Cc:     Aaron Lu <aaron.lwe@...il.com>,
        Nishanth Aravamudan <naravamudan@...italocean.com>,
        Julien Desfossez <jdesfossez@...italocean.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Ingo Molnar <mingo@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Paul Turner <pjt@...gle.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Aaron Lu <aaron.lu@...ux.alibaba.com>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        Frédéric Weisbecker <fweisbec@...il.com>,
        Kees Cook <keescook@...omium.org>,
        Greg Kerr <kerrnel@...gle.com>, Phil Auld <pauld@...hat.com>,
        Aubrey Li <aubrey.intel@...il.com>,
        "Li, Aubrey" <aubrey.li@...ux.intel.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Joel Fernandes <joelaf@...gle.com>,
        Joel Fernandes <joel@...lfernandes.org>
Subject: Re: [PATCH updated v2] sched/fair: core wide cfs task priority
 comparison

On Thu, May 14, 2020 at 06:51:27PM -0400, Vineeth Remanan Pillai wrote:
> On Thu, May 14, 2020 at 9:02 AM Peter Zijlstra <peterz@...radead.org> wrote:
> >
> > A little something like so, this syncs min_vruntime when we switch to
> > single queue mode. This is very much SMT2 only, I got my head in twist
> > when thikning about more siblings, I'll have to try again later.
> >
> Thanks for the quick patch! :-)
> 
> For SMT-n, would it work if sync vruntime if atleast one sibling is
> forced idle? Since force_idle is for all the rqs, I think it would
> be correct to sync the vruntime if atleast one cpu is forced idle.

It's complicated ;-)

So this sync is basically a relative reset of S to 0.

So with 2 queues, when one goes idle, we drop them both to 0 and one
then increases due to not being idle, and the idle one builds up lag to
get re-elected. So far so simple, right?

When there's 3, we can have the situation where 2 run and one is idle,
we sync to 0 and let the idle one build up lag to get re-election. Now
suppose another one also drops idle. At this point dropping all to 0
again would destroy the built-up lag from the queue that was already
idle, not good.

So instead of syncing everything, we can:

  less := !((s64)(s_a - s_b) <= 0)

  (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b
                            == v_a - (v_b - S_a + S_b)

IOW, we can recast the (lag) comparison to a one-sided difference.
So if then, instead of syncing the whole queue, sync the idle queue
against the active queue with S_a + S_b at the point where we sync.

(XXX consider the implication of living in a cyclic group: N / 2^n N)

This gives us means of syncing single queues against the active queue,
and for already idle queues to preseve their build-up lag.

Of course, then we get the situation where there's 2 active and one
going idle, who do we pick to sync against? Theory would have us sync
against the combined S, but as we've already demonstated, there is no
such thing in infeasible weight scenarios.

One thing I've considered; and this is where that core_active rudiment
came from, is having active queues sync up between themselves after
every tick. This limits the observed divergence due to the work
conservance.

On top of that, we can improve upon things by moving away from our
horrible (10) hack and moving to (9) and employing (13) here.

Anyway, I got partway through that in the past days, but then my head
hurt. I'll consider it some more :-)

> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > -               if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> > -                       rq_i->core_forceidle = true;
> > +               if (is_idle_task(rq_i->core_pick)) {
> > +                       if (rq_i->nr_running)
> > +                               rq_i->core_forceidle = true;
> > +               } else {
> > +                       new_active++;
> I think we need to reset new_active on restarting the selection.

But this loop is after selection has been done; we don't modify
new_active during selection.

> > +               /*
> > +                * We just dropped into single-rq mode, increment the sequence
> > +                * count to trigger the vruntime sync.
> > +                */
> > +               rq->core->core_sync_seq++;
> > +       }
> > +       rq->core->core_active = new_active;
> core_active seems to be unused.

Correct; that's rudiments from an SMT-n attempt.

> > +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> > +{
> > +       struct sched_entity *se_a = &a->se, *se_b = &b->se;
> > +       struct cfs_rq *cfs_rq_a, *cfa_rq_b;
> > +       u64 vruntime_a, vruntime_b;
> > +
> > +       while (!is_same_tg(se_a, se_b)) {
> > +               int se_a_depth = se_a->depth;
> > +               int se_b_depth = se_b->depth;
> > +
> > +               if (se_a_depth <= se_b_depth)
> > +                       se_b = parent_entity(se_b);
> > +               if (se_a_depth >= se_b_depth)
> > +                       se_a = parent_entity(se_a);
> > +       }
> > +
> > +       cfs_rq_a = cfs_rq_of(se_a);
> > +       cfs_rq_b = cfs_rq_of(se_b);
> > +
> > +       vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
> > +       vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
> Should we be using core_vruntime conditionally? should it be min_vruntime for
> default comparisons and core_vruntime during force_idle?

At the very least it should be min_vruntime when cfs_rq_a == cfs_rq_b,
ie. when we're on the same CPU.

For the other case I was considering that tick based active sync, but
never got that finished and admittedly it all looks a bit weird. But I
figured I'd send it out so we can at least advance the discussion.

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -469,7 +469,7 @@ bool cfs_prio_less(struct task_struct *a
 {
 	struct sched_entity *se_a = &a->se, *se_b = &b->se;
 	struct cfs_rq *cfs_rq_a, *cfa_rq_b;
-	u64 vruntime_a, vruntime_b;
+	u64 s_a, s_b, S_a, S_b;
 
 	while (!is_same_tg(se_a, se_b)) {
 		int se_a_depth = se_a->depth;
@@ -484,10 +484,16 @@ bool cfs_prio_less(struct task_struct *a
 	cfs_rq_a = cfs_rq_of(se_a);
 	cfs_rq_b = cfs_rq_of(se_b);
 
-	vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
-	vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
+	S_a = cfs_rq_a->core_vruntime;
+	S_b = cfs_rq_b->core_vruntime;
 
-	return !((s64)(vruntime_a - vruntime_b) <= 0);
+	if (cfs_rq_a == cfs_rq_b)
+		S_a = S_b = cfs_rq_a->min_vruntime;
+
+	s_a = se_a->vruntime - S_a;
+	s_b = se_b->vruntime - S_b;
+
+	return !((s64)(s_a - s_b) <= 0);
 }
 
 static __always_inline