linux-kernel - [PATCH 3/3] sched: newidle balance set idle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <1297188819-19999-4-git-send-email-venki@google.com>
Date:	Tue,  8 Feb 2011 10:13:39 -0800
From:	Venkatesh Pallipadi <venki@...gle.com>
To:	Suresh Siddha <suresh.b.siddha@...el.com>
Cc:	Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...e.hu>,
	linux-kernel@...r.kernel.org, Paul Turner <pjt@...gle.com>,
	Mike Galbraith <efault@....de>,
	Nick Piggin <npiggin@...il.com>,
	Venkatesh Pallipadi <venki@...gle.com>
Subject: [PATCH 3/3] sched: newidle balance set idle_timestamp only on successful pull

load_balance() could return a negative value in the case of
SMT sibling CPU being busy. Code in idle_balance() though, uses this
return value as an indicator of successful task pull, ignoring the
-1 return value.

This has two problems:
1) Resets idle_stamp even when this return value is -1.
Specific case is on SMT capable system, CPU A is idle and its sibling
CPU B is busy. In this case, CPU A avg_idle will not depend on
a task sleeping/waking up on it. Instead it will continue to hold stale
avg_idle value for extended period of time.

Simple test case of driving avg_idle on a CPU to desired value by using
a usleep loop and then starting a 100% busy loop on its sibling and
changing the usleep rate on original CPU (or removing it completely),
I see the avg_idle on this CPU not updating at all in this case.

2) Breaks out of idle_balance, skipping all higher level domains.
Case can be made that breaking out here is a 'feature' and not a 'bug'.
Periodic balance uses this signal to drop down to busy balance for
higher level domains. But, is simple break out of balancing good for
newidle case?

We do see results in our workload, that this break increases idle time
and reduces both throughput and latency measurably. Also, as newidle
balance itself is ratelimited with avg_idle, would it be OK to continue
balancing upper domains in this case of SMT sibling busy? Or may be
reduce it to CPU_NOT_IDLE from CPU_NEWLY_IDLE? Change here goes with
the former option.

Signed-off-by: Venkatesh Pallipadi <venki@...gle.com>
---
 kernel/sched_fair.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 91227d9..56ab3a6 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3483,7 +3483,7 @@ static void idle_balance(int this_cpu, struct rq *this_rq)
 		interval = msecs_to_jiffies(sd->balance_interval);
 		if (time_after(next_balance, sd->last_balance + interval))
 			next_balance = sd->last_balance + interval;
-		if (pulled_task) {
+		if (pulled_task > 0) {
 			this_rq->idle_stamp = 0;
 			break;
 		}
-- 
1.7.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/