lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 5 Mar 2024 16:41:58 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Ingo Molnar <mingo@...nel.org>, linux-kernel@...r.kernel.org
Cc: Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH 1/9] sched/balancing: Switch the
 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t
 sched_balance_running' flag



On 3/4/24 3:18 PM, Ingo Molnar wrote:
> The 'balancing' spinlock added in:
> 
>   08c183f31bdb ("[PATCH] sched: add option to serialize load balancing")
> 

[...]

> 
>  
> -static DEFINE_SPINLOCK(balancing);
> +/*
> + * This flag serializes load-balancing passes over large domains
> + * (such as SD_NUMA) - only once load-balancing instance may run
> + * at a time, to reduce overhead on very large systems with lots
> + * of CPUs and large NUMA distances.
> + *
> + * - Note that load-balancing passes triggered while another one
> + *   is executing are skipped and not re-tried.
> + *
> + * - Also note that this does not serialize sched_balance_domains()
> + *   execution, as non-SD_SERIALIZE domains will still be
> + *   load-balanced in parallel.
> + */
> +static atomic_t sched_balance_running = ATOMIC_INIT(0);
>  
>  /*

Continuing the discussion related whether this balancing lock is 
contended or not. 


It was observed in large system (1920CPU, 16 NUMA Nodes) cacheline containing the 
balancing trylock was contended and rebalance_domains was seen as part of the traces. 

So did some experiments on smaller system. This system as 224 CPUs and 6 NUMA nodes.
Added probe points in rebalance_domains. If lock is not contended, then lock should
success and both probe points should match. If not, there should be contention. 
Below are the system details and perf probe -L rebalance_domains.

NUMA:                    
  NUMA node(s):          6
  NUMA node0 CPU(s):     0-31
  NUMA node1 CPU(s):     32-71
  NUMA node4 CPU(s):     72-111
  NUMA node5 CPU(s):     112-151
  NUMA node6 CPU(s):     152-183
  NUMA node7 CPU(s):     184-223


------------------------------------------------------------------------------------------------------------------
#perf probe -L rebalance_domains
<rebalance_domains@...rikanth/sched_tip/kernel/sched/fair.c:0>
      0  static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
         {
      2         int continue_balancing = 1;
      3         int cpu = rq->cpu;
[...]


     33                 interval = get_sd_balance_interval(sd, busy);

                        need_serialize = sd->flags & SD_SERIALIZE;
     36                 if (need_serialize) {
     37                         if (!spin_trylock(&balancing))
                                        goto out;
                        }

     41                 if (time_after_eq(jiffies, sd->last_balance + interval)) {
     42                         if (load_balance(cpu, rq, sd, idle, &continue_balancing)) {
                                        /*
                                         * The LBF_DST_PINNED logic could have changed
                                         * env->dst_cpu, so we can't know our idle
                                         * state even if we migrated tasks. Update it.
                                         */
     48                                 idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
     49                                 busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
                                }
     51                         sd->last_balance = jiffies;
     52                         interval = get_sd_balance_interval(sd, busy);
                        }
     54                 if (need_serialize)
     55                         spin_unlock(&balancing);
         out:
     57                 if (time_after(next_balance, sd->last_balance + interval)) {
                                next_balance = sd->last_balance + interval;
                                update_next_balance = 1;
                        }
                }

perf probe --list
  probe:rebalance_domains_L37 (on rebalance_domains+856)
  probe:rebalance_domains_L55 (on rebalance_domains+904)
------------------------------------------------------------------------------------------------------------------

Perf records are collected for 10 seconds in different system loads. load is created using stress-ng. 
Contention is calculated as (1-L55/L37)*100

system is idle:  		<--	No contention
1K probe:rebalance_domains_L37
1K probe:rebalance_domains_L55


system is at 25% loa: 		<-- 	4.4% contention
223K probe:rebalance_domains_L37: 1 chunks LOST!
213K probe:rebalance_domains_L55: 1 chunks LOST!



system is at 50% load		<--	12.5% contention
168K probe:rebalance_domains_L37
147K probe:rebalance_domains_L55


system is at 75% load		<-- 	25.6% contention
113K probe:rebalance_domains_L37
84K probe:rebalance_domains_L55

87
system is at 100% load		<--	87.5% contention.
64K probe:rebalance_domains_L37
8K probe:rebalance_domains_L55


A few reasons for contentions could be: 
1. idle load balance is running and some other cpu is becoming idle, and tries newidle_balance. 
2. when system is busy, every CPU would do busy balancing, it would contend for the lock. It will not do balance as 
   should_we_balance says this CPU need not balance. It bails out and release the lock. 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ