linux-kernel - Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <86f761a4-9805-c704-9c23-ec96065fa389@amd.com>
Date:   Thu, 14 Sep 2023 09:43:52 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     Chen Yu <yu.c.chen@...el.com>
Cc:     Tim Chen <tim.c.chen@...el.com>, Aaron Lu <aaron.lu@...el.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>
Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in
 select_idle_cpu()

Hello Chenyu,

On 9/13/2023 8:27 AM, Chen Yu wrote:
> On 2023-09-12 at 19:56:37 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> On 9/12/2023 6:02 PM, Chen Yu wrote:
>>> [..snip..]
>>>
>>>>> If I understand correctly, WF_SYNC is to let the wakee to woken up
>>>>> on the waker's CPU, rather than the wakee's previous CPU, because
>>>>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
>>>>> wakee's previous CPU. We can only restrict that other wakee does not
>>>>> occupy the previous CPU, but do not enhance the possibility that
>>>>> wake_affine_idle() chooses the previous CPU.
>>>>
>>>> Correct me if I'm wrong here,
>>>>
>>>> Say a short sleeper, is always woken up using WF_SYNC flag. When the
>>>> task is dequeued, we mark the previous  CPU where it ran as "cache-hot"
>>>> and restrict any wakeup happening until the "cache_hot_timeout" is
>>>> crossed. Let us assume a perfect world where the task wakes up before
>>>> the "cache_hot_timeout" expires. Logically this CPU was reserved all
>>>> this while for the short sleeper but since the wakeup bears WF_SYNC
>>>> flag, the whole reservation is ignored and waker's LLC is explored.
>>>>
>>>
>>> Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle()
>>> forces the short sleeping wakee to be woken up on waker's CPU rather the
>>> wakee's previous CPU, but wakee's previous has been marked as cache hot
>>> for nothing?
>>
>> Precisely :)
>>
>>>
>>>> Should the timeout be cleared if the wakeup decides to not target the
>>>> previous CPU? (The default "sysctl_sched_migration_cost" is probably
>>>> small enough to curb any side effect that could possibly show here but
>>>> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
>>>> a larger value, the wakeup path might be affected where lot of idle
>>>> targets are overlooked since the CPUs are marked cache-hot forr longer
>>>> duration)
>>>>
>>>> Let me know what you think.
>>>>
>>>
>>> This makes sense. In theory the above logic can be added in
>>> select_idle_sibling(), if target CPU is chosen rather than
>>> the previous CPU, the previous CPU's cache hot flag should be
>>> cleared.
>>>
>>> But this might bring overhead. Because we need to grab the rq
>>> lock and write to other CPU's rq, which could be costly. It
>>> seems to be a trade-off of current implementation.
>>
>> I agree, it will not be pretty. Maybe the other way is to have a
>> history of the type of wakeup the task experiences (similar to
>> wakee_flips but for sync and non-syn wakeups) and only reserve
>> the CPU if the task wakes up more via non-sync wakeups? Thinking
>> out loud here.
>>
> 
> This looks good to consider the task's attribute, or maybe something
> like this:
> 
> new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> if (new_cpu != prev_cpu)
> 	p->burst_sleep_avg >>= 1;
> So the duration of reservation could be shrinked.

That seems like a good approach.

Meanwhile, here is result for the current series without any
modifications:

tl;dr

- There seems to be a noticeable increase in hackbench runtime with a
  single group but big gains beyond that. The regression could possibly
  be because of added searching but let me do some digging to confirm
  that. 

- Small regressions (~2%) noticed in ycsb-mongodb (medium utilization)
  and DeathStarBench (High Utilization)

- Other benchmarks are more of less perf neutral with the changes.

More information below:

o System information

  - Dual socket 3rd Generation EPYC System (2 x 64C/128T)
  - NPS1 mode (each socket is a NUMA node)
  - Boost Enabled
  - C2 disabled (MWAIT based C1 is still enabled)


o Kernel information

base		:   tip:sched/core at commit b41bbb33cf75 ("Merge branch
		    'sched/eevdf' into sched/core")
		  + cheery-pick commit 63304558ba5d ("sched/eevdf: Curb
		    wakeup-preemption")

SIS_CACHE	:   base
		  + this series as is


o Benchmark results

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
 1-groups     1.00 [ -0.00]( 1.89)     1.10 [-10.28]( 2.03)
 2-groups     1.00 [ -0.00]( 2.04)     0.98 [  1.57]( 2.04)
 4-groups     1.00 [ -0.00]( 2.38)     0.95 [  4.70]( 0.88)
 8-groups     1.00 [ -0.00]( 1.52)     0.93 [  7.18]( 0.76)
16-groups     1.00 [ -0.00]( 3.44)     0.90 [  9.76]( 1.04)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
    1     1.00 [  0.00]( 0.18)     0.98 [ -1.61]( 0.27)
    2     1.00 [  0.00]( 0.63)     0.98 [ -1.58]( 0.09)
    4     1.00 [  0.00]( 0.86)     0.99 [ -0.52]( 0.42)
    8     1.00 [  0.00]( 0.22)     0.98 [ -1.77]( 0.65)
   16     1.00 [  0.00]( 1.99)     1.00 [ -0.10]( 1.55)
   32     1.00 [  0.00]( 4.29)     0.98 [ -1.73]( 1.55)
   64     1.00 [  0.00]( 1.71)     0.97 [ -2.77]( 3.74)
  128     1.00 [  0.00]( 0.65)     1.00 [ -0.14]( 0.88)
  256     1.00 [  0.00]( 0.19)     0.97 [ -2.65]( 0.49)
  512     1.00 [  0.00]( 0.20)     0.99 [ -1.10]( 0.33)
 1024     1.00 [  0.00]( 0.29)     0.99 [ -0.70]( 0.16)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
 Copy     1.00 [  0.00]( 4.32)     0.90 [ -9.82](10.72)
Scale     1.00 [  0.00]( 5.21)     1.01 [  0.59]( 1.83)
  Add     1.00 [  0.00]( 6.25)     0.99 [ -0.91]( 4.49)
Triad     1.00 [  0.00](10.74)     1.02 [  2.28]( 6.07)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
 Copy     1.00 [  0.00]( 0.70)     0.98 [ -1.79]( 2.26)
Scale     1.00 [  0.00]( 6.55)     1.03 [  2.80]( 0.74)
  Add     1.00 [  0.00]( 6.53)     1.02 [  2.05]( 1.82)
Triad     1.00 [  0.00]( 6.66)     1.04 [  3.54]( 1.04)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.46)     0.99 [ -0.55]( 0.49)
 2-clients     1.00 [  0.00]( 0.38)     0.99 [ -1.23]( 1.19)
 4-clients     1.00 [  0.00]( 0.72)     0.98 [ -1.91]( 1.21)
 8-clients     1.00 [  0.00]( 0.98)     0.98 [ -1.61]( 1.08)
16-clients     1.00 [  0.00]( 0.70)     0.98 [ -1.80]( 1.04)
32-clients     1.00 [  0.00]( 0.74)     0.98 [ -1.55]( 1.20)
64-clients     1.00 [  0.00]( 2.24)     1.00 [ -0.04]( 2.77)
128-clients    1.00 [  0.00]( 1.72)     1.03 [  3.22]( 1.99)
256-clients    1.00 [  0.00]( 4.44)     0.99 [ -1.33]( 4.71)
512-clients    1.00 [  0.00](52.42)     0.98 [ -1.61](52.72)


==================================================================
Test          : schbench (old)
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
  1     1.00 [ -0.00]( 2.28)     0.96 [  4.00](15.68)
  2     1.00 [ -0.00]( 6.42)     1.00 [ -0.00](10.96)
  4     1.00 [ -0.00]( 3.77)     0.97 [  3.33]( 7.61)
  8     1.00 [ -0.00](13.83)     1.08 [ -7.89]( 2.86)
 16     1.00 [ -0.00]( 4.37)     1.00 [ -0.00]( 2.13)
 32     1.00 [ -0.00]( 8.69)     0.95 [  4.94]( 2.73)
 64     1.00 [ -0.00]( 2.30)     1.05 [ -5.13]( 1.26)
128     1.00 [ -0.00](12.12)     1.03 [ -3.41]( 5.08)
256     1.00 [ -0.00](26.04)     0.91 [  8.88]( 2.59)
512     1.00 [ -0.00]( 5.62)     0.97 [  3.32]( 0.37)


==================================================================
Test          : Unixbench
Units         : Various, Throughput
Interpretation: Higher is better
Statistic     : AMean, Hmean (Specified)
==================================================================
Metric		variant                      base		     SIS_CACHE
Hmean     unixbench-dhry2reg-1            41248390.97 (   0.00%)    41485503.82 (   0.57%)
Hmean     unixbench-dhry2reg-512        6239969914.15 (   0.00%)  6233919689.40 (  -0.10%)
Amean     unixbench-syscall-1              2968518.27 (   0.00%)     2841236.43 *   4.29%*
Amean     unixbench-syscall-512            7790656.20 (   0.00%)     7631558.00 *   2.04%*
Hmean     unixbench-pipe-1                 2535689.01 (   0.00%)     2598208.16 *   2.47%*
Hmean     unixbench-pipe-512             361385055.25 (   0.00%)   368566373.76 *   1.99%*
Hmean     unixbench-spawn-1                   4506.26 (   0.00%)        4551.67 (   1.01%)
Hmean     unixbench-spawn-512                69380.09 (   0.00%)       69264.30 (  -0.17%)
Hmean     unixbench-execl-1                   3824.57 (   0.00%)        3822.67 (  -0.05%)
Hmean     unixbench-execl-512                12288.64 (   0.00%)       11728.12 (  -4.56%)


==================================================================
Test          : ycsb-mongodb
Units         : Throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
base            : 309589.33 (var: 1.41%) 
SIS_CACHE       : 304931.33 (var: 1.29%) [diff: -1.50%]


==================================================================
Test          : DeathStarBench
Units         : Normalized Throughput, relative to base
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Pinning         base     SIS_CACHE
1 CCD           100%      99.18% [%diff: -0.82%]
2 CCD           100%      97.46% [%diff: -2.54%]
4 CCD           100%      97.22% [%diff: -2.78%]
8 CCD           100%      99.01% [%diff: -0.99%]

--

Regression observed could either be because of the larger search time to
find a non cache-hot idle CPU, or perhaps just the larger search time in
general adding to utilization and curbing the SIS_UTIL limits further.
I'll go gather some stats to back my suspicion (particularly for
hackbench).

> 
> [..snip..]
 
--
Thanks and Regards,
Prateek