[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZfA1LRq1d2ueoSRm@gmail.com>
Date: Tue, 12 Mar 2024 11:57:49 +0100
From: Ingo Molnar <mingo@...nel.org>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>
Cc: linux-kernel@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH 1/9] sched/balancing: Switch the
'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t
sched_balance_running' flag
* Shrikanth Hegde <sshegde@...ux.ibm.com> wrote:
> > I think we should probably do something about this contention on this
> > large system: especially if #2 'no work to be done' bailout is the
> > common case.
>
>
> I have been thinking would it be right to move this balancing
> trylock/atomic after should_we_balance(swb). This does reduce the number
> of times this checked/updated significantly. Contention is still present.
> That's possible at higher utilization when there are multiple NUMA
> domains. one CPU in each NUMA domain can contend if their invocation is
> aligned.
Note that it's not true contention: it simply means there's overlapping
requests for the highest domains to be balanced, for which we only have a
single thread of execution at a time, system-wide.
> That makes sense since, Right now a CPU takes lock, checks if it can
> balance, do balance if yes and then releases the lock. If the lock is
> taken after swb then also, CPU checks if it can balance,
> tries to take the lock and releases the lock if it did. If lock is
> contended, it bails out of load_balance. That is the current behaviour as
> well, or I am completely wrong.
>
> Not sure in which scenarios that would hurt. we could do this after this
> series. This may need wider functional testing to make sure we don't
> regress badly in some cases. This is only an *idea* as of now.
>
> Perf probes at spin_trylock and spin_unlock codepoints on the same 224CPU, 6 NUMA node system.
> 6.8-rc6
> -----------------------------------------
> idle system:
> 449 probe:rebalance_domains_L37
> 377 probe:rebalance_domains_L55
> stress-ng --cpu=$(nproc) -l 51 << 51% load
> 88K probe:rebalance_domains_L37
> 77K probe:rebalance_domains_L55
> stress-ng --cpu=$(nproc) -l 100 << 100% load
> 41K probe:rebalance_domains_L37
> 10K probe:rebalance_domains_L55
>
> +below patch
> ----------------------------------------
> idle system:
> 462 probe:load_balance_L35
> 394 probe:load_balance_L274
> stress-ng --cpu=$(nproc) -l 51 << 51% load
> 5K probe:load_balance_L35 <<-- almost 15x less
> 4K probe:load_balance_L274
> stress-ng --cpu=$(nproc) -l 100 << 100% load
> 8K probe:load_balance_L35
> 3K probe:load_balance_L274 <<-- almost 4x less
That's nice.
> +static DEFINE_SPINLOCK(balancing);
> /*
> * Check this_cpu to ensure it is balanced within domain. Attempt to move
> * tasks if there is an imbalance.
> @@ -11286,6 +11287,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> struct rq *busiest;
> struct rq_flags rf;
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
> + int need_serialize;
> struct lb_env env = {
> .sd = sd,
> .dst_cpu = this_cpu,
> @@ -11308,6 +11310,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> goto out_balanced;
> }
>
> + need_serialize = sd->flags & SD_SERIALIZE;
> + if (need_serialize) {
> + if (!spin_trylock(&balancing))
> + goto lockout;
> + }
> +
> group = find_busiest_group(&env);
So if I'm reading your patch right, the main difference appears to be that
it allows the should_we_balance() check to be executed in parallel, and
will only try to take the NUMA-balancing flag if that function indicates an
imbalance.
Since should_we_balance() isn't taking any locks AFAICS, this might be a
valid approach. What might make sense is to instrument the percentage of
NUMA-balancing flag-taking 'failures' vs. successful attempts - not
necessarily the 'contention percentage'.
But another question is, why do we get here so frequently, so that the
cumulative execution time of these SD_SERIAL rebalance passes exceeds that
of 100% of single CPU time? Ie. a single CPU is basically continuously
scanning the scheduler data structures for imbalances, right? That doesn't
seem natural even with just ~224 CPUs.
Alternatively, is perhaps the execution time of the SD_SERIAL pass so large
that we exceed 100% CPU time?
Thanks,
Ingo
Powered by blists - more mailing lists