linux-kernel - Re: [PATCH] sched/balance: Skip unnecessary updates to idle load balancer's flags

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Zlygeqy+SVs1KZYW@chenyu5-mobl2>
Date: Mon, 3 Jun 2024 00:40:26 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Tim Chen <tim.c.chen@...ux.intel.com>
CC: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>, <linux-kernel@...r.kernel.org>,
	Vinicius Gomes <vinicius.gomes@...el.com>
Subject: Re: [PATCH] sched/balance: Skip unnecessary updates to idle load
 balancer's flags

On 2024-05-31 at 13:54:52 -0700, Tim Chen wrote:
> We observed that the overhead on trigger_load_balance(), now renamed
> sched_balance_trigger(), has risen with a system's core counts.
> 
> For an OLTP workload running 6.8 kernel on a 2 socket x86 systems
> having 96 cores/socket, we saw that 0.7% cpu cycles are spent in
> trigger_load_balance(). On older systems with fewer cores/socket, this
> function's overhead was less than 0.1%.
> 
> The cause of this overhead was that there are multiple cpus calling
> kick_ilb(flags), updating the balancing work needed to a common idle
> load balancer cpu. The ilb_cpu's flags field got updated unconditionally
> with atomic_fetch_or().  The atomic read and writes to ilb_cpu's flags
> causes much cache bouncing and cpu cycles overhead. This is seen in the
> annotated profile below.
> 
>              kick_ilb():
>              if (ilb_cpu < 0)
>                test   %r14d,%r14d
>              ↑ js     6c
>              flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu));
>                mov    $0x2d600,%rdi
>                movslq %r14d,%r8
>                mov    %rdi,%rdx
>                add    -0x7dd0c3e0(,%r8,8),%rdx
>              arch_atomic_read():
>   0.01         mov    0x64(%rdx),%esi
>  35.58         add    $0x64,%rdx
>              arch_atomic_fetch_or():
> 
>              static __always_inline int arch_atomic_fetch_or(int i, atomic_t *v)
>              {
>              int val = arch_atomic_read(v);
> 
>              do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
>   0.03  157:   mov    %r12d,%ecx
>              arch_atomic_try_cmpxchg():
>              return arch_try_cmpxchg(&v->counter, old, new);
>   0.00         mov    %esi,%eax
>              arch_atomic_fetch_or():
>              do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
>                or     %esi,%ecx
>              arch_atomic_try_cmpxchg():
>              return arch_try_cmpxchg(&v->counter, old, new);
>   0.01         lock   cmpxchg %ecx,(%rdx)
>  42.96       ↓ jne    2d2
>              kick_ilb():
> 
> With instrumentation, we found that 81% of the updates do not result in
> any change in the ilb_cpu's flags.  That is, multiple cpus are asking
> the ilb_cpu to do the same things over and over again, before the ilb_cpu
> has a chance to run NOHZ load balance.
> 
> Skip updates to ilb_cpu's flags if no new work needs to be done.
> Such updates do not change ilb_cpu's NOHZ flags.  This requires an extra
> atomic read but it is less expensive than frequent unnecessary atomic
> updates that generate cache bounces.

A race condition is that many CPUs choose the same ilb_cpu and ask it to trigger
the nohz idle balance. This is because find_new_ilb() always finds the first
nohz idle CPU. I wonder if we could change the
for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask)
into
for_each_cpu_wrap(ilb_cpu,  cpumask_and(nohz.idle_cpus_mask, hk_mask), this_cpu+1) 
so different ilb_cpu might be found by different CPUs.
Then the extra atomic read could brings less cache bounces.

> 
> We saw that on the OLTP workload, cpu cycles from trigger_load_balance()
> (or sched_balance_trigger()) got reduced from 0.7% to 0.2%.
> 
> Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
> ---
>  kernel/sched/fair.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a5b1ae0aa55..9ab6dff6d8ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11891,6 +11891,13 @@ static void kick_ilb(unsigned int flags)
>  	if (ilb_cpu < 0)
>  		return;
>  
> +	/*
> +	 * Don't bother if no new NOHZ balance work items for ilb_cpu,
> +	 * i.e. all bits in flags are already set in ilb_cpu.
> +	 */
> +	if ((atomic_read(nohz_flags(ilb_cpu)) & flags) == flags)

Maybe also mention in the comment that when above statement is true, the
current ilb_cpu's flags is not 0 and in NOHZ_KICK_MASK, so return directly
here is safe(anyway just 2 cents)

Reviewed-by: Chen Yu <yu.c.chen@...el.com>

thanks,
Chenyu

> +		return;
> +
>  	/*
>  	 * Access to rq::nohz_csd is serialized by NOHZ_KICK_MASK; he who sets
>  	 * the first flag owns it; cleared by nohz_csd_func().
> -- 
> 2.32.0
>