linux-kernel - Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <41e11090-a100-48a7-a0dd-c989772822d7@linux.ibm.com>
Date: Fri, 8 Mar 2024 20:18:17 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Ingo Molnar <mingo@...nel.org>
Cc: linux-kernel@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH 1/9] sched/balancing: Switch the
 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t
 sched_balance_running' flag



On 3/8/24 4:53 PM, Ingo Molnar wrote:
> 
> * Shrikanth Hegde <sshegde@...ux.ibm.com> wrote:
> 
>> system is at 75% load		<-- 	25.6% contention
>> 113K probe:rebalance_domains_L37
>> 84K probe:rebalance_domains_L55
>>
>> 87
>> system is at 100% load		<--	87.5% contention.
>> 64K probe:rebalance_domains_L37
>> 8K probe:rebalance_domains_L55
>>
>>
>> A few reasons for contentions could be: 
>>
>> 1. idle load balance is running and some other cpu is becoming idle, and 
>>    tries newidle_balance.
>>
>> 2. when system is busy, every CPU would do busy balancing, it would 
>>    contend for the lock. It will not do balance as should_we_balance says 
>>    this CPU need not balance. It bails out and release the lock.
> 
> Thanks, these measurements are really useful!
> 
> Would it be possible to disambiguate these two cases?

I think its not case 1, since newidle_balance doesnt even take that lock. So 
likely its case 2. 

> 
> I think we should probably do something about this contention on this large 
> system: especially if #2 'no work to be done' bailout is the common case.
> 


I have been thinking would it be right to move this balancing trylock/atomic after 
should_we_balance(swb). This does reduce the number of times this checked/updated 
significantly. Contention is still present. That's possible at higher utilization 
when there are multiple NUMA domains. one CPU in each NUMA domain can contend if their invocation
is aligned. 

That makes sense since, Right now a CPU takes lock, checks if it can balance, do balance if yes and
 then releases the lock. If the lock is taken after swb then also, CPU checks if it can balance, 
tries to take the lock and releases the lock if it did. If lock is contended, it bails out of 
load_balance. That is the current behaviour as well, or I am completely wrong. 

Not sure in which scenarios that would hurt. we could do this after this series. 
This may need wider functional testing to make sure we don't regress badly in some cases. 
This is only an *idea* as of now. 

Perf probes at spin_trylock and spin_unlock codepoints on the same 224CPU, 6 NUMA node system. 
6.8-rc6                                                                         
-----------------------------------------                                       
idle system:                                                                    
449 probe:rebalance_domains_L37                                                 
377 probe:rebalance_domains_L55                                                 
stress-ng --cpu=$(nproc) -l 51     << 51% load                                               
88K probe:rebalance_domains_L37                                                 
77K probe:rebalance_domains_L55                                                 
stress-ng --cpu=$(nproc) -l 100    << 100% load                                             
41K probe:rebalance_domains_L37                                                 
10K probe:rebalance_domains_L55                                                 
                                                                                
+below patch                                                                          
----------------------------------------                                        
idle system:                                                                    
462 probe:load_balance_L35                                                      
394 probe:load_balance_L274                                                     
stress-ng --cpu=$(nproc) -l 51      << 51% load                                            
5K probe:load_balance_L35                       	<<-- almost 15x less                                
4K probe:load_balance_L274                                                      
stress-ng --cpu=$(nproc) -l 100     << 100% load                                            
8K probe:load_balance_L35                                                       
3K probe:load_balance_L274 				<<-- almost 4x less


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 62f247bdec86..3a8de7454377 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11272,6 +11272,7 @@ static int should_we_balance(struct lb_env *env)
 	return group_balance_cpu(sg) == env->dst_cpu;
 }

+static DEFINE_SPINLOCK(balancing);
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
@@ -11286,6 +11287,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	struct rq *busiest;
 	struct rq_flags rf;
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+	int need_serialize;
 	struct lb_env env = {
 		.sd		= sd,
 		.dst_cpu	= this_cpu,
@@ -11308,6 +11310,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		goto out_balanced;
 	}

+	need_serialize = sd->flags & SD_SERIALIZE;
+	if (need_serialize) {
+		if (!spin_trylock(&balancing))
+			goto lockout;
+	}
+
 	group = find_busiest_group(&env);
 	if (!group) {
 		schedstat_inc(sd->lb_nobusyg[idle]);
@@ -11434,6 +11442,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = SCHED_NR_MIGRATE_BREAK;
+				if (need_serialize)
+					spin_unlock(&balancing);
 				goto redo;
 			}
 			goto out_all_pinned;
@@ -11540,7 +11550,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	     sd->balance_interval < MAX_PINNED_INTERVAL) ||
 	    sd->balance_interval < sd->max_interval)
 		sd->balance_interval *= 2;
+
 out:
+	if (need_serialize)
+		spin_unlock(&balancing);
+
+lockout:
 	return ld_moved;
 }

@@ -11665,7 +11680,6 @@ static int active_load_balance_cpu_stop(void *data)
 	return 0;
 }

-static DEFINE_SPINLOCK(balancing);

 /*
  * Scale the max load_balance interval with the number of CPUs in the system.
@@ -11716,7 +11730,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 	/* Earliest time when we have to do rebalance again */
 	unsigned long next_balance = jiffies + 60*HZ;
 	int update_next_balance = 0;
-	int need_serialize, need_decay = 0;
+	int need_decay = 0;
 	u64 max_cost = 0;

 	rcu_read_lock();
@@ -11741,12 +11755,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)

 		interval = get_sd_balance_interval(sd, busy);

-		need_serialize = sd->flags & SD_SERIALIZE;
-		if (need_serialize) {
-			if (!spin_trylock(&balancing))
-				goto out;
-		}
-
 		if (time_after_eq(jiffies, sd->last_balance + interval)) {
 			if (load_balance(cpu, rq, sd, idle, &continue_balancing)) {
 				/*
@@ -11760,9 +11768,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 			sd->last_balance = jiffies;
 			interval = get_sd_balance_interval(sd, busy);
 		}
-		if (need_serialize)
-			spin_unlock(&balancing);
-out:
+
 		if (time_after(next_balance, sd->last_balance + interval)) {
 			next_balance = sd->last_balance + interval;
 			update_next_balance = 1;