linux-kernel - Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZQLoAQcQJDCrdOGd@chenyu5-mobl2.ccr.corp.intel.com>
Date:   Thu, 14 Sep 2023 19:01:21 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     K Prateek Nayak <kprateek.nayak@....com>
CC:     Tim Chen <tim.c.chen@...el.com>, Aaron Lu <aaron.lu@...el.com>,
        "Dietmar Eggemann" <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        "Daniel Bristot de Oliveira" <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        <linux-kernel@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>
Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in
 select_idle_cpu()

Hi Prateek,

thanks for the test,

On 2023-09-14 at 09:43:52 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 9/13/2023 8:27 AM, Chen Yu wrote:
> > On 2023-09-12 at 19:56:37 +0530, K Prateek Nayak wrote:
> >> Hello Chenyu,
> >>
> >> On 9/12/2023 6:02 PM, Chen Yu wrote:
> >>> [..snip..]
> >>>
> >>>>> If I understand correctly, WF_SYNC is to let the wakee to woken up
> >>>>> on the waker's CPU, rather than the wakee's previous CPU, because
> >>>>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
> >>>>> wakee's previous CPU. We can only restrict that other wakee does not
> >>>>> occupy the previous CPU, but do not enhance the possibility that
> >>>>> wake_affine_idle() chooses the previous CPU.
> >>>>
> >>>> Correct me if I'm wrong here,
> >>>>
> >>>> Say a short sleeper, is always woken up using WF_SYNC flag. When the
> >>>> task is dequeued, we mark the previous  CPU where it ran as "cache-hot"
> >>>> and restrict any wakeup happening until the "cache_hot_timeout" is
> >>>> crossed. Let us assume a perfect world where the task wakes up before
> >>>> the "cache_hot_timeout" expires. Logically this CPU was reserved all
> >>>> this while for the short sleeper but since the wakeup bears WF_SYNC
> >>>> flag, the whole reservation is ignored and waker's LLC is explored.
> >>>>
> >>>
> >>> Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle()
> >>> forces the short sleeping wakee to be woken up on waker's CPU rather the
> >>> wakee's previous CPU, but wakee's previous has been marked as cache hot
> >>> for nothing?
> >>
> >> Precisely :)
> >>
> >>>
> >>>> Should the timeout be cleared if the wakeup decides to not target the
> >>>> previous CPU? (The default "sysctl_sched_migration_cost" is probably
> >>>> small enough to curb any side effect that could possibly show here but
> >>>> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
> >>>> a larger value, the wakeup path might be affected where lot of idle
> >>>> targets are overlooked since the CPUs are marked cache-hot forr longer
> >>>> duration)
> >>>>
> >>>> Let me know what you think.
> >>>>
> >>>
> >>> This makes sense. In theory the above logic can be added in
> >>> select_idle_sibling(), if target CPU is chosen rather than
> >>> the previous CPU, the previous CPU's cache hot flag should be
> >>> cleared.
> >>>
> >>> But this might bring overhead. Because we need to grab the rq
> >>> lock and write to other CPU's rq, which could be costly. It
> >>> seems to be a trade-off of current implementation.
> >>
> >> I agree, it will not be pretty. Maybe the other way is to have a
> >> history of the type of wakeup the task experiences (similar to
> >> wakee_flips but for sync and non-syn wakeups) and only reserve
> >> the CPU if the task wakes up more via non-sync wakeups? Thinking
> >> out loud here.
> >>
> > 
> > This looks good to consider the task's attribute, or maybe something
> > like this:
> > 
> > new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> > if (new_cpu != prev_cpu)
> > 	p->burst_sleep_avg >>= 1;
> > So the duration of reservation could be shrinked.
> 
> That seems like a good approach.
> 
> Meanwhile, here is result for the current series without any
> modifications:
> 
> tl;dr
> 
> - There seems to be a noticeable increase in hackbench runtime with a
>   single group but big gains beyond that. The regression could possibly
>   be because of added searching but let me do some digging to confirm
>   that. 

Ah OK. May I have the command to run 1 group hackbench?

> 
> - Small regressions (~2%) noticed in ycsb-mongodb (medium utilization)
>   and DeathStarBench (High Utilization)
> 
> - Other benchmarks are more of less perf neutral with the changes.
> 
> More information below:
> 
> o System information
> 
>   - Dual socket 3rd Generation EPYC System (2 x 64C/128T)
>   - NPS1 mode (each socket is a NUMA node)
>   - Boost Enabled
>   - C2 disabled (MWAIT based C1 is still enabled)
> 
> 
> o Kernel information
> 
> base		:   tip:sched/core at commit b41bbb33cf75 ("Merge branch
> 		    'sched/eevdf' into sched/core")
> 		  + cheery-pick commit 63304558ba5d ("sched/eevdf: Curb
> 		    wakeup-preemption")
> 
> SIS_CACHE	:   base
> 		  + this series as is
> 
> 
> o Benchmark results
> 
> ==================================================================
> Test          : hackbench
> Units         : Normalized time in seconds
> Interpretation: Lower is better
> Statistic     : AMean
> ==================================================================
> Case:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>  1-groups     1.00 [ -0.00]( 1.89)     1.10 [-10.28]( 2.03)
>  2-groups     1.00 [ -0.00]( 2.04)     0.98 [  1.57]( 2.04)
>  4-groups     1.00 [ -0.00]( 2.38)     0.95 [  4.70]( 0.88)
>  8-groups     1.00 [ -0.00]( 1.52)     0.93 [  7.18]( 0.76)
> 16-groups     1.00 [ -0.00]( 3.44)     0.90 [  9.76]( 1.04)
> 
> 
> ==================================================================
> Test          : tbench
> Units         : Normalized throughput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Clients:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>     1     1.00 [  0.00]( 0.18)     0.98 [ -1.61]( 0.27)
>     2     1.00 [  0.00]( 0.63)     0.98 [ -1.58]( 0.09)
>     4     1.00 [  0.00]( 0.86)     0.99 [ -0.52]( 0.42)
>     8     1.00 [  0.00]( 0.22)     0.98 [ -1.77]( 0.65)
>    16     1.00 [  0.00]( 1.99)     1.00 [ -0.10]( 1.55)
>    32     1.00 [  0.00]( 4.29)     0.98 [ -1.73]( 1.55)
>    64     1.00 [  0.00]( 1.71)     0.97 [ -2.77]( 3.74)
>   128     1.00 [  0.00]( 0.65)     1.00 [ -0.14]( 0.88)
>   256     1.00 [  0.00]( 0.19)     0.97 [ -2.65]( 0.49)
>   512     1.00 [  0.00]( 0.20)     0.99 [ -1.10]( 0.33)
>  1024     1.00 [  0.00]( 0.29)     0.99 [ -0.70]( 0.16)
> 
> 
> ==================================================================
> Test          : stream-10
> Units         : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic     : HMean
> ==================================================================
> Test:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>  Copy     1.00 [  0.00]( 4.32)     0.90 [ -9.82](10.72)
> Scale     1.00 [  0.00]( 5.21)     1.01 [  0.59]( 1.83)
>   Add     1.00 [  0.00]( 6.25)     0.99 [ -0.91]( 4.49)
> Triad     1.00 [  0.00](10.74)     1.02 [  2.28]( 6.07)
> 
> 
> ==================================================================
> Test          : stream-100
> Units         : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic     : HMean
> ==================================================================
> Test:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>  Copy     1.00 [  0.00]( 0.70)     0.98 [ -1.79]( 2.26)
> Scale     1.00 [  0.00]( 6.55)     1.03 [  2.80]( 0.74)
>   Add     1.00 [  0.00]( 6.53)     1.02 [  2.05]( 1.82)
> Triad     1.00 [  0.00]( 6.66)     1.04 [  3.54]( 1.04)
> 
> 
> ==================================================================
> Test          : netperf
> Units         : Normalized Througput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Clients:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>  1-clients     1.00 [  0.00]( 0.46)     0.99 [ -0.55]( 0.49)
>  2-clients     1.00 [  0.00]( 0.38)     0.99 [ -1.23]( 1.19)
>  4-clients     1.00 [  0.00]( 0.72)     0.98 [ -1.91]( 1.21)
>  8-clients     1.00 [  0.00]( 0.98)     0.98 [ -1.61]( 1.08)
> 16-clients     1.00 [  0.00]( 0.70)     0.98 [ -1.80]( 1.04)
> 32-clients     1.00 [  0.00]( 0.74)     0.98 [ -1.55]( 1.20)
> 64-clients     1.00 [  0.00]( 2.24)     1.00 [ -0.04]( 2.77)
> 128-clients    1.00 [  0.00]( 1.72)     1.03 [  3.22]( 1.99)
> 256-clients    1.00 [  0.00]( 4.44)     0.99 [ -1.33]( 4.71)
> 512-clients    1.00 [  0.00](52.42)     0.98 [ -1.61](52.72)
> 
> 
> ==================================================================
> Test          : schbench (old)
> Units         : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic     : Median
> ==================================================================
> #workers:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>   1     1.00 [ -0.00]( 2.28)     0.96 [  4.00](15.68)
>   2     1.00 [ -0.00]( 6.42)     1.00 [ -0.00](10.96)
>   4     1.00 [ -0.00]( 3.77)     0.97 [  3.33]( 7.61)
>   8     1.00 [ -0.00](13.83)     1.08 [ -7.89]( 2.86)
>  16     1.00 [ -0.00]( 4.37)     1.00 [ -0.00]( 2.13)
>  32     1.00 [ -0.00]( 8.69)     0.95 [  4.94]( 2.73)
>  64     1.00 [ -0.00]( 2.30)     1.05 [ -5.13]( 1.26)
> 128     1.00 [ -0.00](12.12)     1.03 [ -3.41]( 5.08)
> 256     1.00 [ -0.00](26.04)     0.91 [  8.88]( 2.59)
> 512     1.00 [ -0.00]( 5.62)     0.97 [  3.32]( 0.37)
> 
> 
> ==================================================================
> Test          : Unixbench
> Units         : Various, Throughput
> Interpretation: Higher is better
> Statistic     : AMean, Hmean (Specified)
> ==================================================================
> Metric		variant                      base		     SIS_CACHE
> Hmean     unixbench-dhry2reg-1            41248390.97 (   0.00%)    41485503.82 (   0.57%)
> Hmean     unixbench-dhry2reg-512        6239969914.15 (   0.00%)  6233919689.40 (  -0.10%)
> Amean     unixbench-syscall-1              2968518.27 (   0.00%)     2841236.43 *   4.29%*
> Amean     unixbench-syscall-512            7790656.20 (   0.00%)     7631558.00 *   2.04%*
> Hmean     unixbench-pipe-1                 2535689.01 (   0.00%)     2598208.16 *   2.47%*
> Hmean     unixbench-pipe-512             361385055.25 (   0.00%)   368566373.76 *   1.99%*
> Hmean     unixbench-spawn-1                   4506.26 (   0.00%)        4551.67 (   1.01%)
> Hmean     unixbench-spawn-512                69380.09 (   0.00%)       69264.30 (  -0.17%)
> Hmean     unixbench-execl-1                   3824.57 (   0.00%)        3822.67 (  -0.05%)
> Hmean     unixbench-execl-512                12288.64 (   0.00%)       11728.12 (  -4.56%)
> 
> 
> ==================================================================
> Test          : ycsb-mongodb
> Units         : Throughput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> base            : 309589.33 (var: 1.41%) 
> SIS_CACHE       : 304931.33 (var: 1.29%) [diff: -1.50%]
> 
> 
> ==================================================================
> Test          : DeathStarBench
> Units         : Normalized Throughput, relative to base
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Pinning         base     SIS_CACHE
> 1 CCD           100%      99.18% [%diff: -0.82%]
> 2 CCD           100%      97.46% [%diff: -2.54%]
> 4 CCD           100%      97.22% [%diff: -2.78%]
> 8 CCD           100%      99.01% [%diff: -0.99%]
> 
> --
> 
> Regression observed could either be because of the larger search time to
> find a non cache-hot idle CPU, or perhaps just the larger search time in
> general adding to utilization and curbing the SIS_UTIL limits further.

Yeah that is possible. And you also mentioned that we should consider the
cache-hot idle CPU if we can not find any cache-cold idle CPUs, that
might be a better choice than forcely putting the wakee on the current
CPU which brings task stacking.

> I'll go gather some stats to back my suspicion (particularly for
> hackbench).
>

Thanks!
Chenyu