[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2d18453d-9c9b-b57b-1616-d4a9229abd5a@bytedance.com>
Date: Tue, 28 Jun 2022 15:58:55 +0800
From: Abel Wu <wuyun.abel@...edance.com>
To: Chen Yu <yu.c.chen@...el.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
Mel Gorman <mgorman@...e.de>,
Vincent Guittot <vincent.guittot@...aro.org>,
Josh Don <joshdon@...gle.com>,
Tim Chen <tim.c.chen@...ux.intel.com>,
K Prateek Nayak <kprateek.nayak@....com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 6/7] sched/fair: skip busy cores in SIS search
On 6/27/22 6:13 PM, Abel Wu Wrote:
>
> On 6/24/22 11:30 AM, Chen Yu Wrote:
>>> ...
>>>>> @@ -9273,8 +9319,40 @@ find_idlest_group(struct sched_domain *sd,
>>>>> struct task_struct *p, int this_cpu)
>>>>> static void sd_update_state(struct lb_env *env, struct
>>>>> sd_lb_stats *sds)
>>>>> {
>>>>> - if (sds->sd_state == sd_has_icpus &&
>>>>> !test_idle_cpus(env->dst_cpu))
>>>>> - set_idle_cpus(env->dst_cpu, true);
>>>>> + struct sched_domain_shared *sd_smt_shared = env->sd->shared;
>>>>> + enum sd_state new = sds->sd_state;
>>>>> + int this = env->dst_cpu;
>>>>> +
>>>>> + /*
>>>>> + * Parallel updating can hardly contribute accuracy to
>>>>> + * the filter, besides it can be one of the burdens on
>>>>> + * cache traffic.
>>>>> + */
>>>>> + if (cmpxchg(&sd_smt_shared->updating, 0, 1))
>>>>> + return;
>>>>> +
>>>>> + /*
>>>>> + * There is at least one unoccupied cpu available, so
>>>>> + * propagate it to the filter to avoid false negative
>>>>> + * issue which could result in lost tracking of some
>>>>> + * idle cpus thus throughupt downgraded.
>>>>> + */
>>>>> + if (new != sd_is_busy) {
>>>>> + if (!test_idle_cpus(this))
>>>>> + set_idle_cpus(this, true);
>>>>> + } else {
>>>>> + /*
>>>>> + * Nothing changes so nothing to update or
>>>>> + * propagate.
>>>>> + */
>>>>> + if (sd_smt_shared->state == sd_is_busy)
>>>>> + goto out;
>>>>> + }
>>>>> +
>>>>> + sd_update_icpus(this, sds->idle_cpu);
>>>> I wonder if we could further enhance it to facilitate idle CPU scan.
>>>> For example, can we propagate the idle CPUs in smt domain, to its
>>>> parent
>>>> domain in a hierarchic sequence, and finally to the LLC domain. If
>>>> there is
>>>
>>> In fact, it was my first try to cache the unoccupied cpus in SMT
>>> shared domain, but the overhead of cpumask ops seems like a major
>>> stumbling block.
>>>
>>>> a cluster domain between SMT and LLC domain, the cluster domain idle
>>>> CPU filter
>>>> could benefit from this mechanism.
>>>> https://lore.kernel.org/lkml/20220609120622.47724-3-yangyicong@hisilicon.com/
>>>>
>>>
>>> Putting SIS into a hierarchical pattern is good for cache locality.
>>> But I don't think multi-level filter is appropriate since it could
>>> bring too much cache traffic in SIS,
>> Could you please elaborate a little more about the cache traffic? I
>> thought we
>> don't save the unoccupied cpus in SMT shared domain, but to store it
>> in middle
>> layer shared domain, say, cluster->idle_cpus, this would reduce cache
>> write
>> contention compared to writing to llc->idle_cpus directly, because a
>> smaller
>> set of CPUs share the idle_cpus filter. Similarly, SIS can only scan
>> the cluster->idle_cpus
>> first, without having to query the llc->idle_cpus. This looks like
>> splitting
>> a big lock into fine grain small lock.
>
> I'm afraid I didn't quite follow.. Did you mean replace the LLC filter
> with multiple cluster filters? Then I agree with what you suggested
> that the contention would be reduced. But there are other concerns:
>
> a. Is it appropriate to fake an intermediate sched_domain if
> cluster level doesn't available? How to identify the proper
> size of the faked sched_domain?
>
> b. The SIS path might touch more cachelines (multiple cluster
> filters). I'm not sure how much is the impact.
>
> Whatever, this seems worth a try. :)
>
After a second thought, maybe it's a similar case of enabling SNC?
I benchmarked with SNC disabled, so the LLC is relatively big. This
time I enabled SNC on the same machine mentioned in cover letter, to
make the filter more fine grained. Please see the following result.
a) hackbench-process-pipes
Amean 1 0.4380 ( 0.00%) 0.4250 * 2.97%*
Amean 4 0.6123 ( 0.00%) 0.6153 ( -0.49%)
Amean 7 0.7693 ( 0.00%) 0.7217 * 6.20%*
Amean 12 1.0730 ( 0.00%) 1.0723 ( 0.06%)
Amean 21 1.8540 ( 0.00%) 1.8817 ( -1.49%)
Amean 30 2.8147 ( 0.00%) 2.7297 ( 3.02%)
Amean 48 4.6280 ( 0.00%) 4.4923 * 2.93%*
Amean 79 8.0897 ( 0.00%) 7.8773 ( 2.62%)
Amean 110 10.5320 ( 0.00%) 10.1737 ( 3.40%)
Amean 141 13.0260 ( 0.00%) 12.4953 ( 4.07%)
Amean 172 15.5093 ( 0.00%) 14.3697 * 7.35%*
Amean 203 17.9633 ( 0.00%) 16.7853 * 6.56%*
Amean 234 20.2327 ( 0.00%) 19.2020 * 5.09%*
Amean 265 22.1203 ( 0.00%) 21.3353 ( 3.55%)
Amean 296 24.9337 ( 0.00%) 23.8967 ( 4.16%)
b) hackbench-process-sockets
Amean 1 0.6990 ( 0.00%) 0.6520 * 6.72%*
Amean 4 1.6513 ( 0.00%) 1.6080 * 2.62%*
Amean 7 2.5103 ( 0.00%) 2.5020 ( 0.33%)
Amean 12 4.1470 ( 0.00%) 4.0957 * 1.24%*
Amean 21 7.0823 ( 0.00%) 6.9237 * 2.24%*
Amean 30 9.9510 ( 0.00%) 9.7937 * 1.58%*
Amean 48 15.8853 ( 0.00%) 15.5410 * 2.17%*
Amean 79 26.3313 ( 0.00%) 26.0363 * 1.12%*
Amean 110 36.6647 ( 0.00%) 36.2657 * 1.09%*
Amean 141 47.0590 ( 0.00%) 46.4010 * 1.40%*
Amean 172 57.5020 ( 0.00%) 56.9897 ( 0.89%)
Amean 203 67.9277 ( 0.00%) 66.8273 * 1.62%*
Amean 234 78.3967 ( 0.00%) 77.2137 * 1.51%*
Amean 265 88.5817 ( 0.00%) 87.6143 * 1.09%*
Amean 296 99.4397 ( 0.00%) 98.0233 * 1.42%*
c) hackbench-thread-pipes
Amean 1 0.4437 ( 0.00%) 0.4373 ( 1.43%)
Amean 4 0.6667 ( 0.00%) 0.6340 ( 4.90%)
Amean 7 0.7813 ( 0.00%) 0.8177 * -4.65%*
Amean 12 1.2747 ( 0.00%) 1.3113 ( -2.88%)
Amean 21 2.4703 ( 0.00%) 2.3637 * 4.32%*
Amean 30 3.6547 ( 0.00%) 3.2377 * 11.41%*
Amean 48 5.7580 ( 0.00%) 5.3140 * 7.71%*
Amean 79 9.1770 ( 0.00%) 8.3717 * 8.78%*
Amean 110 11.7167 ( 0.00%) 11.3867 * 2.82%*
Amean 141 14.1490 ( 0.00%) 13.9017 ( 1.75%)
Amean 172 17.3880 ( 0.00%) 16.4897 ( 5.17%)
Amean 203 19.3760 ( 0.00%) 18.8807 ( 2.56%)
Amean 234 22.7477 ( 0.00%) 21.7420 * 4.42%*
Amean 265 25.8940 ( 0.00%) 23.6173 * 8.79%*
Amean 296 27.8677 ( 0.00%) 26.5053 * 4.89%*
d) hackbench-thread-sockets
Amean 1 0.7303 ( 0.00%) 0.6817 * 6.66%*
Amean 4 1.6820 ( 0.00%) 1.6343 * 2.83%*
Amean 7 2.6060 ( 0.00%) 2.5393 * 2.56%*
Amean 12 4.2663 ( 0.00%) 4.1810 * 2.00%*
Amean 21 7.2110 ( 0.00%) 7.0873 * 1.71%*
Amean 30 10.1453 ( 0.00%) 10.0320 * 1.12%*
Amean 48 16.2787 ( 0.00%) 15.9040 * 2.30%*
Amean 79 27.0090 ( 0.00%) 26.5803 * 1.59%*
Amean 110 37.5397 ( 0.00%) 37.1200 * 1.12%*
Amean 141 48.0853 ( 0.00%) 47.7613 * 0.67%*
Amean 172 58.7967 ( 0.00%) 58.2570 * 0.92%*
Amean 203 69.5303 ( 0.00%) 68.8930 * 0.92%*
Amean 234 79.9943 ( 0.00%) 79.5347 * 0.57%*
Amean 265 90.5877 ( 0.00%) 90.1223 ( 0.51%)
Amean 296 101.2390 ( 0.00%) 101.1687 ( 0.07%)
e) netperf-udp
Hmean send-64 202.37 ( 0.00%) 202.46 ( 0.05%)
Hmean send-128 407.01 ( 0.00%) 402.86 * -1.02%*
Hmean send-256 788.50 ( 0.00%) 789.87 ( 0.17%)
Hmean send-1024 3047.98 ( 0.00%) 3036.19 ( -0.39%)
Hmean send-2048 5820.33 ( 0.00%) 5776.30 ( -0.76%)
Hmean send-3312 8941.40 ( 0.00%) 8809.25 * -1.48%*
Hmean send-4096 10804.41 ( 0.00%) 10686.95 * -1.09%*
Hmean send-8192 17105.63 ( 0.00%) 17323.44 * 1.27%*
Hmean send-16384 28166.17 ( 0.00%) 28191.05 ( 0.09%)
Hmean recv-64 202.37 ( 0.00%) 202.46 ( 0.05%)
Hmean recv-128 407.01 ( 0.00%) 402.86 * -1.02%*
Hmean recv-256 788.50 ( 0.00%) 789.87 ( 0.17%)
Hmean recv-1024 3047.98 ( 0.00%) 3036.19 ( -0.39%)
Hmean recv-2048 5820.33 ( 0.00%) 5776.30 ( -0.76%)
Hmean recv-3312 8941.40 ( 0.00%) 8809.23 * -1.48%*
Hmean recv-4096 10804.41 ( 0.00%) 10686.95 * -1.09%*
Hmean recv-8192 17105.55 ( 0.00%) 17323.44 * 1.27%*
Hmean recv-16384 28166.03 ( 0.00%) 28191.04 ( 0.09%)
f) netperf-tcp
Hmean 64 838.30 ( 0.00%) 837.61 ( -0.08%)
Hmean 128 1633.25 ( 0.00%) 1653.50 * 1.24%*
Hmean 256 3107.89 ( 0.00%) 3148.10 ( 1.29%)
Hmean 1024 10435.39 ( 0.00%) 10503.81 ( 0.66%)
Hmean 2048 17152.34 ( 0.00%) 17314.40 ( 0.94%)
Hmean 3312 21928.05 ( 0.00%) 21995.97 ( 0.31%)
Hmean 4096 23990.44 ( 0.00%) 24008.97 ( 0.08%)
Hmean 8192 29445.84 ( 0.00%) 29245.31 * -0.68%*
Hmean 16384 33592.90 ( 0.00%) 34096.68 * 1.50%*
g) tbench4 Throughput
Hmean 1 311.15 ( 0.00%) 306.76 * -1.41%*
Hmean 2 619.24 ( 0.00%) 615.00 * -0.68%*
Hmean 4 1220.45 ( 0.00%) 1222.08 * 0.13%*
Hmean 8 2410.93 ( 0.00%) 2413.59 * 0.11%*
Hmean 16 4652.09 ( 0.00%) 4766.12 * 2.45%*
Hmean 32 7809.03 ( 0.00%) 7831.88 * 0.29%*
Hmean 64 9116.92 ( 0.00%) 9171.25 * 0.60%*
Hmean 128 17732.63 ( 0.00%) 20209.26 * 13.97%*
Hmean 256 19603.22 ( 0.00%) 19007.72 * -3.04%*
Hmean 384 19796.37 ( 0.00%) 17396.64 * -12.12%*
There seems like not much difference except hackbench pipe test at
certain groups (30~110). I am intended to provide better scalability
by applying the filter which will be enabled when:
- The LLC is large enough that simply traversing becomes
in-sufficient, and/or
- The LLC is loaded that unoccupied cpus are minority.
But it would be very nice if a more fine grained pattern works well
so we can drop the above constrains.
>
> Thanks & BR,
> Abel
Powered by blists - more mailing lists