linux-kernel - Re: [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU based on sum of util

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220314125657.GA30418@chenyu5-mobl1>
Date:   Mon, 14 Mar 2022 20:56:57 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Abel Wu <wuyun.abel@...edance.com>
Cc:     linux-kernel@...r.kernel.org, Tim Chen <tim.c.chen@...el.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Mel Gorman <mgorman@...e.de>,
        Viresh Kumar <viresh.kumar@...aro.org>,
        Barry Song <21cnbao@...il.com>,
        Barry Song <song.bao.hua@...ilicon.com>,
        Yicong Yang <yangyicong@...ilicon.com>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
        Len Brown <len.brown@...el.com>,
        Ben Segall <bsegall@...gle.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Aubrey Li <aubrey.li@...el.com>,
        K Prateek Nayak <kprateek.nayak@....com>
Subject: Re: [PATCH v2][RFC] sched/fair: Change SIS_PROP to search idle CPU
 based on sum of util_avg

Hi Abel,
On Mon, Mar 14, 2022 at 12:53:54PM +0800, Abel Wu wrote:
> Hi Chen,
> 
> 在 3/10/22 8:52 AM, Chen Yu 写道:
> > [Problem Statement]
> > Currently select_idle_cpu() uses the percpu average idle time to
> > estimate the total LLC domain idle time, and calculate the number
> > of CPUs to be scanned. This might be inconsistent because idle time
> > of a CPU does not necessarily correlate with idle time of a domain.
> > As a result, the load could be underestimated and causes over searching
> > when the system is very busy.
> > 
> > The following histogram is the time spent in select_idle_cpu(),
> > when running 224 instance of netperf on a system with 112 CPUs
> > per LLC domain:
> > 
> > @usecs:
> > [0]                  533 |                                                    |
> > [1]                 5495 |                                                    |
> > [2, 4)             12008 |                                                    |
> > [4, 8)            239252 |                                                    |
> > [8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
> > [16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> > [32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
> > [128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
> > [256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
> > [512, 1K)        2600472 |@@@@@@@@@                                           |
> > [1K, 2K)          927912 |@@@                                                 |
> > [2K, 4K)          218720 |                                                    |
> > [4K, 8K)           98161 |                                                    |
> > [8K, 16K)          37722 |                                                    |
> > [16K, 32K)          6715 |                                                    |
> > [32K, 64K)           477 |                                                    |
> > [64K, 128K)            7 |                                                    |
> > 
> > netperf latency:
> > =======
> > case            	load    	    Lat_99th	    std%
> > TCP_RR          	thread-224	      257.39	(  0.21)
> > UDP_RR          	thread-224	      242.83	(  6.29)
> > 
> > The netperf 99th latency(usec) above is comparable with the time spent in
> > select_idle_cpu(). That is to say, when the system is overloaded, searching
> > for idle CPU could be a bottleneck.
> > 
> > [Proposal]
> > The main idea is to replace percpu average idle time with the domain
> > based metric. Choose average CPU utilization(util_avg) as the candidate.
> > In general, the number of CPUs to be scanned should be inversely
> > proportional to the sum of util_avg in this domain. That is, the lower
> > the util_avg is, the more select_idle_cpu() should scan for idle CPU,
> > and vice versa. The benefit of choosing util_avg is that, it is a metric
> > of accumulated historic activity, which seems to be more accurate than
> > instantaneous metrics(such as rq->nr_running).
> > 
> > Furthermore, borrow the util_avg from periodic load balance,
> > which could offload the overhead of select_idle_cpu().
> > 
> > According to last discussion[1], introduced the linear function
> > for experimental purpose:
> 
> It would be better if you can prove it's a linear model by the
> SIS efficiency statistics :)
>
Thanks for your comments!
 
Good point, the SIS efficiency could be used to verify if the real
search number fitting this linear mode well, I'll collect this statistics.
But TBH I'm not sure whether there is a convergent/accurate mapping
between the sum of util_avg and the number of CPUs to be scanned(unless we use
sum_util/pelt_scan_cost to approach it). Choosing a model seems to always be a
heuristic search policy.
> > 
> > f(x) = a - bx
> > 
> >       llc_size
> > x = \Sum      util_avg[cpu] / llc_cpu_capacity
> >       1
> > f(x) is the number of CPUs to be scanned, x is the sum util_avg.
> > To decide a and b, the following condition should be met:
> > 
> > [1] f(0) = llc_size
> > [2] f(x) = 4,  x >= 50%
> > 
> > +	 * newidle balance.
> > +	 */
> > +	if (env->idle == CPU_NEWLY_IDLE || env->sd->span_weight != llc_size)
> 
> So nr_scan will probably be updated at llc-domain-lb-interval, which
> is llc_size milliseconds. Since load can be varied a lot during such
> a period, would this brought accuracy issues?
>
I agree there might be delay in reflecting the latest utilization.
The sum_util calculated by periodic load balance after 112ms would be
decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%.
But consider that this is a server platform, I have an impression that
the CPU utilization jitter during a small period of time is not a regular
scenario? It seems to be a trade-off. Checking the util_avg in newidle
load balance path would be more frequent, but it also brings overhead -
multiple CPUs write/read the per-LLC shared variable and introduces cache
false sharing. But to make this more robust, maybe we can add time interval
control in newidle load balance too.

thanks,
Chenyu


> Best regards
> Abel