linux-kernel - Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201105220743.GD2656962@google.com>
Date:   Thu, 5 Nov 2020 17:07:43 -0500
From:   Joel Fernandes <joel@...lfernandes.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Nishanth Aravamudan <naravamudan@...italocean.com>,
        Julien Desfossez <jdesfossez@...italocean.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Vineeth Pillai <viremana@...ux.microsoft.com>,
        Aaron Lu <aaron.lwe@...il.com>,
        Aubrey Li <aubrey.intel@...il.com>, tglx@...utronix.de,
        linux-kernel@...r.kernel.org, mingo@...nel.org,
        torvalds@...ux-foundation.org, fweisbec@...il.com,
        keescook@...omium.org, kerrnel@...gle.com,
        Phil Auld <pauld@...hat.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>, vineeth@...byteword.org,
        Chen Yu <yu.c.chen@...el.com>,
        Christian Brauner <christian.brauner@...ntu.com>,
        Agata Gruza <agata.gruza@...el.com>,
        Antonio Gomez Iglesias <antonio.gomez.iglesias@...el.com>,
        graf@...zon.com, konrad.wilk@...cle.com, dfaggioli@...e.com,
        pjt@...gle.com, rostedt@...dmis.org, derkling@...gle.com,
        benbjiang@...cent.com,
        Alexandre Chartre <alexandre.chartre@...cle.com>,
        James.Bottomley@...senpartnership.com, OWeisse@...ch.edu,
        Dhaval Giani <dhaval.giani@...cle.com>,
        Junaid Shahid <junaids@...gle.com>, jsbarnes@...gle.com,
        chris.hyser@...cle.com,
        Vineeth Remanan Pillai <vpillai@...italocean.com>,
        Aaron Lu <aaron.lu@...ux.alibaba.com>,
        Aubrey Li <aubrey.li@...ux.intel.com>,
        Tim Chen <tim.c.chen@...el.com>
Subject: Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and
 scheduling.

On Thu, Nov 05, 2020 at 01:50:19PM -0500, Joel Fernandes wrote:
> On Mon, Oct 26, 2020 at 10:31:31AM +0100, Peter Zijlstra wrote:
> > On Fri, Oct 23, 2020 at 05:31:18PM -0400, Joel Fernandes wrote:
> > > On Fri, Oct 23, 2020 at 09:26:54PM +0200, Peter Zijlstra wrote:
> > 
> > > > How about this then?
> > > 
> > > This does look better. It makes sense and I think it will work. I will look
> > > more into it and also test it.
> > 
> > Hummm... Looking at it again I wonder if I can make something like the
> > below work.
> > 
> > (depends on the next patch that pulls core_forceidle into core-wide
> > state)
> > 
> > That would retain the CFS-cgroup optimization as well, for as long as
> > there's no cookies around.
> > 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4691,8 +4691,6 @@ pick_next_task(struct rq *rq, struct tas
> >  		return next;
> >  	}
> >  
> > -	put_prev_task_balance(rq, prev, rf);
> > -
> >  	smt_mask = cpu_smt_mask(cpu);
> >  
> >  	/*
> > @@ -4707,14 +4705,25 @@ pick_next_task(struct rq *rq, struct tas
> >  	 */
> >  	rq->core->core_task_seq++;
> >  	need_sync = !!rq->core->core_cookie;
> > -
> > -	/* reset state */
> > -reset:
> > -	rq->core->core_cookie = 0UL;
> >  	if (rq->core->core_forceidle) {
> >  		need_sync = true;
> >  		rq->core->core_forceidle = false;
> >  	}
> > +
> > +	if (!need_sync) {
> > +		next = __pick_next_task(rq, prev, rf);
> 
> This could end up triggering pick_next_task_fair's newidle balancing;
> 
> > +		if (!next->core_cookie) {
> > +			rq->core_pick = NULL;
> > +			return next;
> > +		}
> 
> .. only to realize here that pick_next_task_fair() that we have to put_prev
> the task back as it has a cookie, but the effect of newidle balancing cannot
> be reverted.
> 
> Would that be a problem as the newly pulled task might be incompatible and
> would have been better to leave it alone?
> 
> TBH, this is a drastic change and we've done a lot of testing with the
> current code and its looking good. I'm a little scared of changing it right
> now and introducing regression. Can we maybe do this after the existing
> patches are upstream?

After sleeping over it, I am trying something like the following. Thoughts?

Basically, I call pick_task() in advance. That's mostly all that's different
with your patch:

---8<-----------------------

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0ce17aa72694..366e5ed84a63 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5000,28 +5000,34 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	put_prev_task_balance(rq, prev, rf);
 
 	smt_mask = cpu_smt_mask(cpu);
-
-	/*
-	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
-	 *
-	 * @task_seq guards the task state ({en,de}queues)
-	 * @pick_seq is the @task_seq we did a selection on
-	 * @sched_seq is the @pick_seq we scheduled
-	 *
-	 * However, preemptions can cause multiple picks on the same task set.
-	 * 'Fix' this by also increasing @task_seq for every pick.
-	 */
-	rq->core->core_task_seq++;
 	need_sync = !!rq->core->core_cookie;
 
 	/* reset state */
-reset:
 	rq->core->core_cookie = 0UL;
 	if (rq->core->core_forceidle) {
 		need_sync = true;
 		fi_before = true;
 		rq->core->core_forceidle = false;
 	}
+
+	/*
+	 * Optimize for common case where this CPU has no cookies
+	 * and there are no cookied tasks running on siblings.
+	 */
+	if (!need_sync) {
+		for_each_class(class) {
+			next = class->pick_task(rq);
+			if (next)
+				break;
+		}
+
+		if (!next->core_cookie) {
+			rq->core_pick = NULL;
+			goto done;
+		}
+		need_sync = true;
+	}
+
 	for_each_cpu(i, smt_mask) {
 		struct rq *rq_i = cpu_rq(i);
 
@@ -5039,6 +5045,18 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		}
 	}
 
+	/*
+	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
+	 *
+	 * @task_seq guards the task state ({en,de}queues)
+	 * @pick_seq is the @task_seq we did a selection on
+	 * @sched_seq is the @pick_seq we scheduled
+	 *
+	 * However, preemptions can cause multiple picks on the same task set.
+	 * 'Fix' this by also increasing @task_seq for every pick.
+	 */
+	rq->core->core_task_seq++;
+
 	/*
 	 * Try and select tasks for each sibling in decending sched_class
 	 * order.
@@ -5059,40 +5077,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			 * core.
 			 */
 			p = pick_task(rq_i, class, max);
-			if (!p) {
-				/*
-				 * If there weren't no cookies; we don't need to
-				 * bother with the other siblings.
-				 */
-				if (i == cpu && !need_sync)
-					goto next_class;
-
+			if (!p)
 				continue;
-			}
-
-			/*
-			 * Optimize the 'normal' case where there aren't any
-			 * cookies and we don't need to sync up.
-			 */
-			if (i == cpu && !need_sync) {
-				if (p->core_cookie) {
-					/*
-					 * This optimization is only valid as
-					 * long as there are no cookies
-					 * involved. We may have skipped
-					 * non-empty higher priority classes on
-					 * siblings, which are empty on this
-					 * CPU, so start over.
-					 */
-					need_sync = true;
-					goto reset;
-				}
-
-				next = p;
-				trace_printk("unconstrained pick: %s/%d %lx\n",
-					     next->comm, next->pid, next->core_cookie);
-				goto done;
-			}
 
 			if (!is_task_rq_idle(p))
 				occ++;