linux-kernel - Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20211028121530.GA19512@vingu-book>
Date:   Thu, 28 Oct 2021 14:15:30 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Tim Chen <tim.c.chen@...ux.intel.com>
Cc:     mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        mgorman@...e.de, bristot@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort

Le mercredi 27 oct. 2021 à 13:53:32 (-0700), Tim Chen a écrit :
> On Wed, 2021-10-27 at 10:49 +0200, Vincent Guittot wrote:
> > 
> > > Looking at the profile on update_blocked_averages a bit more,
> > > the majority of the call to update_blocked_averages
> > > happens in run_rebalance_domain.  And we are not
> > > including that cost of update_blocked_averages for
> > > run_rebalance_domains in our current patch set. I think
> > > the patch set should account for that too.
> > 
> > nohz_newidle_balance keeps using sysctl_sched_migration_cost to
> > trigger a _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK, CPU_IDLE);
> > This would probably benefit to take into account the cost of
> > update_blocked_averages instead
> > 
> 
> For the case where
> 
> 	this_rq->avg_idle < sysctl_sched_migration_cost
> 
> in newidle_balance(), we skip to the out: label
> 
> out:
>         /* Move the next balance forward */
>         if (time_after(this_rq->next_balance, next_balance))
>                 this_rq->next_balance = next_balance;
> 
>         if (pulled_task)
>                 this_rq->idle_stamp = 0;
>         else
>                 nohz_newidle_balance(this_rq);
> 
> and we call nohz_newidle_balance as we don't have a pulled_task.
> 
> It seems to make sense to skip the call
> to nohz_newidle_balance() for this case? 

nohz_newidle_balance() also tests this condition :
(this_rq->avg_idle < sysctl_sched_migration_cost)
and doesn't set NOHZ_NEWILB_KICKi in such case

But this patch now used the condition :
this_rq->avg_idle < sd->max_newidle_lb_cost
and sd->max_newidle_lb_cost can be higher than sysctl_sched_migration_cost

which means that we can set NOHZ_NEWILB_KICK:
-although we decided to skip newidle loop
-or when we abort because this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost 

This is even more true when sysctl_sched_migration_cost is lowered which is your case IIRC

The patch below ensures that we don't set NOHZ_NEWILB_KICK in such cases:

---
 kernel/sched/fair.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c19f4bb3df1a..36ddae208959 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10779,7 +10779,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	int this_cpu = this_rq->cpu;
 	u64 t0, t1, curr_cost = 0;
 	struct sched_domain *sd;
-	int pulled_task = 0;
+	int pulled_task = 0, early_stop = 0;
 
 	update_misfit_status(NULL, this_rq);
 
@@ -10816,8 +10816,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (!READ_ONCE(this_rq->rd->overload) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
-		if (sd)
+		if (sd) {
 			update_next_balance(sd, &next_balance);
+
+			/*
+			 * We skip new idle LB because there is not enough
+			 * time before next wake up. Make sure that we will
+			 * not kick NOHZ_NEWILB_KICK
+			 */
+			early_stop = 1;
+		}
 		rcu_read_unlock();
 
 		goto out;
@@ -10836,8 +10844,10 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 		update_next_balance(sd, &next_balance);
 
-		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
+		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
+			early_stop = 1;
 			break;
+		}
 
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
 
@@ -10887,7 +10897,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	if (pulled_task)
 		this_rq->idle_stamp = 0;
-	else
+	else if (!early_stop)
 		nohz_newidle_balance(this_rq);
 
 	rq_repin_lock(this_rq, rf);
-- 

> We expect a very short idle and a task to wake shortly.  
> So we do not have to pull a task
> to this idle cpu and incur the migration cost.
> 
> Tim
> 
> 
> 
>