linux-kernel - Re: [PATCH] sched/fair: Rate limit calls to update_blocked

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <729718fd-bd2c-2e0e-46f5-8027281e5821@linux.intel.com>
Date:   Fri, 11 Jun 2021 13:00:22 -0700
From:   Tim Chen <tim.c.chen@...ux.intel.com>
To:     Qais Yousef <qais.yousef@....com>
Cc:     Vincent Guittot <vincent.guittot@...aro.org>,
        Joel Fernandes <joel@...lfernandes.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Paul McKenney <paulmck@...nel.org>,
        Frederic Weisbecker <fweisbec@...il.com>,
        Dietmar Eggeman <dietmar.eggemann@....com>,
        Ben Segall <bsegall@...gle.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Mel Gorman <mgorman@...e.de>,
        Peter Zijlstra <peterz@...radead.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        "Uladzislau Rezki (Sony)" <urezki@...il.com>,
        Neeraj upadhyay <neeraj.iitr10@...il.com>,
        Aubrey Li <aubrey.li@...ux.intel.com>
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages()
 for NOHZ



On 5/12/21 6:59 AM, Qais Yousef wrote:
> On 05/11/21 10:25, Tim Chen wrote:
>>> update_next_balance() is only used in newidle_balance() so we could
>>> modify it to  have:
>>>
>>> next = max(jiffies+1, next = sd->last_balance + interval)
>>
>> Is the extra assignment "next = sd->last_balance + interval" needed?
>> This seems more straight forward:
>>
>> next = max(jiffies+1, sd->last_balance + interval)
> 
> I haven't been following the whole conversation closely, but it's always
> interesting when manipulating time in non time_*() functions.
> 
> Is this max() safe against wrapping?
> 

Vincent, 

Sorry I haven't got back sooner. I finally was able to get some test
time on the test system. The fix works to correct the next balance time
going backwards but the frequency of balancing still remains the same,
so we don't see performance improvement.

I incorporated Qais' suggestion to take care of the wrap around time
(see patch #1) in patches below.  This patch by itself prevented
the next_balance from going backward.  However, most of the time the
next_balance occurs immediately in the next jiffie after newidle_balance
occured and we still have the same amount of load balancing as the vanilla
kernel on the OLTP workload I was looking at.  I didn't see performance
improvement with just patch#1 and patch#2.

The current logic is when a CPU becomes idle, next_balance occur very
shortly (usually in the next jiffie) as get_sd_balance_interval returns
the next_balance in the next jiffie if the CPU is idle.  However, in
reality, I saw most CPUs are 95% busy on average for my workload and
a task will wake up on an idle CPU shortly.  So having frequent idle
balancing towards shortly idle CPUs is counter productive and simply
increase overhead and does not improve performance.

I tried a patch (patch 3) in addition to the other patches. It improved
performance by 5%, which is quite significant for my OLTP workload.
The patch considers a CPU busy when average its utilization is more
than 80% when determining the next_balance interval.  This tweak may
not be ideal for the case when CPU becomes idle after a CPU intensive
task dominates a CPU for a long time and will block for a while.

Hopefully we can find a way to make good judgement on whether we have
a mostly busy CPU that becomes idle, and a task likely to wake up on
it soon.  For such case, we should push out the next balance time.  Such
logic is absent today in the idle load balance path.  And such frequent
load balancing hurt performance when cgroup is turned on. Computing
update_blocked_averages before load balance becomes expensive. For my
OLTP workload, we lose 9% of performance when cgroup is turned on.

Tim
 

----

>From 2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3 Mon Sep 17 00:00:00 2001
Message-Id: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@...ux.intel.com>
From: Tim Chen <tim.c.chen@...ux.intel.com>
Date: Tue, 11 May 2021 09:55:41 -0700
Subject: [PATCH 1/3] sched: sched: Fix rq->next_balance time updated to
 earlier than current time

In traces on newidle_balance(), this_rq->next_balance
time goes backward and earlier than current time jiffies, e.g.

11.602 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
11.624 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
13.856 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
13.910 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.637 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
14.666 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c

It doesn't make sense to have a next_balance in the past.
Fix newidle_balance() and update_next_balance() so the next
balance time is at least jiffies+1.

Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
---
 kernel/sched/fair.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d75af1ecfb4..740a0572cbf1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9901,7 +9901,10 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
 
 	/* used by idle balance, so cpu_busy = 0 */
 	interval = get_sd_balance_interval(sd, 0);
-	next = sd->last_balance + interval;
+	if (time_after(jiffies+1, sd->last_balance + interval))
+		next = jiffies+1;
+	else
+		next = sd->last_balance + interval;
 
 	if (time_after(*next_balance, next))
 		*next_balance = next;
@@ -10681,6 +10684,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 out:
 	/* Move the next balance forward */
+	if (time_after(jiffies+1, this_rq->next_balance))
+		this_rq->next_balance = jiffies+1;
 	if (time_after(this_rq->next_balance, next_balance))
 		this_rq->next_balance = next_balance;
 
-- 
2.20.1


>From 59de98515bda38b8d6faec5f8c68e1c9ec18962e Mon Sep 17 00:00:00 2001
Message-Id: <59de98515bda38b8d6faec5f8c68e1c9ec18962e.1623433293.git.tim.c.chen@...ux.intel.com>
In-Reply-To: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@...ux.intel.com>
References: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@...ux.intel.com>
From: Vincent Guittot <vincent.guittot@...aro.org>
Date: Fri, 7 May 2021 14:38:10 -0700
Subject: [PATCH 2/3] sched: Skip update_blocked_averages if we are defering
 load balance

In newidle_balance(), the scheduler skips load balance to the new idle cpu when sd is this_rq and when

   this_rq->avg_idle < sd->max_newidle_lb_cost

Doing a costly call to update_blocked_averages() will
not be useful and simply adds overhead when this condition is true.

Check the condition early in newidle_balance() to skip update_blocked_averages()
when possible.

Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
---
 kernel/sched/fair.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 740a0572cbf1..a69bfc651e55 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10615,17 +10615,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	 */
 	rq_unpin_lock(this_rq, rf);
 
+	rcu_read_lock();
+	sd = rcu_dereference_check_sched_domain(this_rq->sd);
+
 	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-	    !READ_ONCE(this_rq->rd->overload)) {
+	    !READ_ONCE(this_rq->rd->overload) ||
+	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
-		rcu_read_lock();
-		sd = rcu_dereference_check_sched_domain(this_rq->sd);
 		if (sd)
 			update_next_balance(sd, &next_balance);
 		rcu_read_unlock();
 
 		goto out;
 	}
+	rcu_read_unlock();
 
 	raw_spin_unlock(&this_rq->lock);
 
-- 
2.20.1


>From 4622055d989a5feb446a7893a48fcd31305ec7a7 Mon Sep 17 00:00:00 2001
Message-Id: <4622055d989a5feb446a7893a48fcd31305ec7a7.1623433293.git.tim.c.chen@...ux.intel.com>
In-Reply-To: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@...ux.intel.com>
References: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@...ux.intel.com>
From: Tim Chen <tim.c.chen@...ux.intel.com>
Date: Mon, 24 May 2021 13:21:03 -0700
Subject: [PATCH 3/3] sched: Don't shorten the load balance interval of a 80%
 or more busy CPU

For a CPU that's busy 80% or more on average, it is quite likely that a task
will wake up on it very soon. It is better to not shorten the load
balance interval as if it is completely idle to save on the load
balancing overhead.

Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
---
 kernel/sched/fair.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a69bfc651e55..7353395d8a3a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9895,12 +9895,11 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
 }
 
 static inline void
-update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
+update_next_balance(struct sched_domain *sd, unsigned long *next_balance, int cpu_busy)
 {
 	unsigned long interval, next;
 
-	/* used by idle balance, so cpu_busy = 0 */
-	interval = get_sd_balance_interval(sd, 0);
+	interval = get_sd_balance_interval(sd, cpu_busy);
 	if (time_after(jiffies+1, sd->last_balance + interval))
 		next = jiffies+1;
 	else
@@ -10593,6 +10592,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	struct sched_domain *sd;
 	int pulled_task = 0;
 	u64 curr_cost = 0;
+	int cpu_busy = 0;
 
 	update_misfit_status(NULL, this_rq);
 	/*
@@ -10618,12 +10618,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
+	/*
+	 * Consider the cpu busy if it has more than 80% average utilization.
+	 * Idle balance such cpu not as frequently as a task may wake up soon.
+	 */
+	if ((cpu_util(this_cpu) * 10 > capacity_orig_of(this_cpu) * 8))
+		cpu_busy = 1;
+
 	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
 	    !READ_ONCE(this_rq->rd->overload) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
 		if (sd)
-			update_next_balance(sd, &next_balance);
+			update_next_balance(sd, &next_balance, cpu_busy);
+
 		rcu_read_unlock();
 
 		goto out;
@@ -10639,7 +10647,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		u64 t0, domain_cost;
 
 		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
-			update_next_balance(sd, &next_balance);
+			update_next_balance(sd, &next_balance, cpu_busy);
 			break;
 		}
 
@@ -10657,7 +10665,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 			curr_cost += domain_cost;
 		}
 
-		update_next_balance(sd, &next_balance);
+		update_next_balance(sd, &next_balance, cpu_busy);
 
 		/*
 		 * Stop searching for tasks to pull if there are
-- 
2.20.1