linux-kernel - Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <258c9412-445a-04f5-dd0d-1f6699a257b6@amd.com>
Date:   Wed, 27 Sep 2023 09:53:13 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     David Vernet <void@...ifault.com>
Cc:     linux-kernel@...r.kernel.org, peterz@...radead.org,
        mingo@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com, tj@...nel.org,
        roman.gushchin@...ux.dev, gautham.shenoy@....com,
        aaron.lu@...el.com, wuyun.abel@...edance.com, kernel-team@...a.com
Subject: Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Hello David,

Some more test results (although this might be slightly irrelevant with
next version around the corner)

On 9/1/2023 12:41 AM, David Vernet wrote:
> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> 
> Hi Prateek,
> 
>> Even with the two patches, I still observe the following lock
>> contention when profiling the tbench 128-clients run with IBS:
>>
>>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>>      - 10.94% native_queued_spin_lock_slowpath
>>         - 10.73% _raw_spin_lock
>>            - 9.57% __schedule
>>                 schedule_idle
>>                 do_idle
>>               + cpu_startup_entry
>>            - 0.82% task_rq_lock
>>                 newidle_balance
>>                 pick_next_task_fair
>>                 __schedule
>>                 schedule_idle
>>                 do_idle
>>               + cpu_startup_entry
>>
>> Since David mentioned rq->avg_idle check is probably not the right step
>> towards the solution, this experiment introduces a per-shard
>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
>> notifies of the possibility of one or more rq covered in the shard's
>> domain having a queued task. shard's overload flag is set at the same
>> time as "rq->rd->overload", and is cleared when shard's list is found
>> to be empty.
> 
> I think this is an interesting idea, but I feel that it's still working
> against the core proposition of SHARED_RUNQ, which is to enable work
> conservation.
> 
I have some more numbers. This time I'm accounting the cost for peeking
into the shared-runq and have two variants - one that keeps the current
vanilla flow from your v3 and the other that moves the rq->avg_idle
check before looking at the shared-runq. Following are the results:

-> Without EEVDF

o tl;dr

- With avg_idle check, the improvements observed with shared-runq
  aren't as large but they are still noticeable.

- Most regressions are gone nad the others aren't as bad with the
  introduction of shared-runq

o Kernels

base			: tip is at commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in use")
shared_runq		: base + correct time accounting with v3 of the series without any other changes
shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
			  (the rd->overload check still remains below the shared_runq access)

o Benchmarks

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:           base[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 1-groups     1.00 [ -0.00]( 2.64)     0.90 [ 10.20]( 8.79)             0.93 [  7.08]( 3.87)
 2-groups     1.00 [ -0.00]( 2.97)     0.85 [ 15.06]( 4.86)             0.96 [  4.47]( 2.22)
 4-groups     1.00 [ -0.00]( 1.84)     0.93 [  7.38]( 2.63)             0.94 [  6.07]( 1.02)
 8-groups     1.00 [ -0.00]( 1.24)     0.97 [  2.83]( 2.69)             0.98 [  1.82]( 1.01)
16-groups     1.00 [ -0.00]( 3.31)     1.03 [ -2.93]( 2.46)             1.02 [ -1.61]( 1.34)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:   base[pct imp](CV)     shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
    1     1.00 [  0.00]( 1.08)     0.98 [ -1.89]( 0.48)              0.99 [ -0.73]( 0.70)
    2     1.00 [  0.00]( 0.69)     0.99 [ -1.48]( 0.24)              0.98 [ -1.62]( 0.85)
    4     1.00 [  0.00]( 0.70)     0.97 [ -2.87]( 1.34)              0.98 [ -2.15]( 0.48)
    8     1.00 [  0.00]( 0.85)     0.97 [ -3.17]( 1.56)              0.99 [ -1.32]( 1.09)
   16     1.00 [  0.00]( 2.18)     0.91 [ -8.70]( 0.27)              0.98 [ -2.03]( 1.28)
   32     1.00 [  0.00]( 3.84)     0.51 [-48.53]( 2.52)              1.01 [  1.41]( 3.83)
   64     1.00 [  0.00]( 7.06)     0.38 [-62.49]( 1.89)              1.05 [  5.33]( 4.09)
  128     1.00 [  0.00]( 0.88)     0.41 [-58.92]( 0.28)              1.01 [  0.54]( 1.65)
  256     1.00 [  0.00]( 0.88)     0.97 [ -2.56]( 1.78)              1.00 [ -0.48]( 0.33)
  512     1.00 [  0.00]( 0.07)     1.00 [  0.06]( 0.04)              0.98 [ -1.51]( 0.44)
 1024     1.00 [  0.00]( 0.30)     0.99 [ -1.35]( 0.90)              1.00 [ -0.24]( 0.41)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:      base[pct imp](CV)      shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 Copy     1.00 [  0.00]( 8.87)     1.00 [  0.31]( 5.27)                1.09 [  9.11]( 0.58)
Scale     1.00 [  0.00]( 6.80)     0.99 [ -0.81]( 7.20)                1.00 [  0.49]( 5.67)
  Add     1.00 [  0.00]( 7.24)     0.99 [ -1.13]( 7.02)                1.02 [  2.06]( 6.36)
Triad     1.00 [  0.00]( 5.00)     0.96 [ -4.11]( 9.37)                1.03 [  3.46]( 4.41)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:      base[pct imp](CV)     shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 Copy     1.00 [  0.00]( 0.45)     1.00 [  0.32]( 1.88)              1.04 [  4.02]( 1.45)
Scale     1.00 [  0.00]( 4.40)     0.98 [ -1.76]( 6.46)              1.01 [  1.28]( 1.00)
  Add     1.00 [  0.00]( 4.97)     0.98 [ -1.85]( 6.01)              1.03 [  2.75]( 0.24)
Triad     1.00 [  0.00]( 0.24)     0.96 [ -3.82]( 6.41)              0.99 [ -1.10]( 4.47)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:        base[pct imp](CV)      shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.46)     0.98 [ -2.37]( 0.08)              0.99 [ -1.32]( 0.37)
 2-clients     1.00 [  0.00]( 0.75)     0.98 [ -2.04]( 0.33)              0.98 [ -1.57]( 0.50)
 4-clients     1.00 [  0.00]( 0.84)     0.97 [ -3.25]( 1.01)              0.99 [ -0.77]( 0.54)
 8-clients     1.00 [  0.00]( 0.78)     0.96 [ -4.18]( 0.68)              0.99 [ -0.77]( 0.63)
16-clients     1.00 [  0.00]( 2.56)     0.84 [-15.71]( 6.33)              1.00 [ -0.35]( 0.58)
32-clients     1.00 [  0.00]( 1.03)     0.35 [-64.92]( 8.90)              0.98 [ -1.92]( 1.67)
64-clients     1.00 [  0.00]( 2.69)     0.26 [-74.05]( 6.56)              0.98 [ -2.46]( 2.42)
128-clients    1.00 [  0.00]( 1.91)     0.25 [-74.81]( 3.67)              0.99 [ -1.50]( 2.15)
256-clients    1.00 [  0.00]( 2.21)     0.92 [ -7.73]( 2.29)              0.98 [ -1.51]( 1.85)
512-clients    1.00 [  0.00](45.18)     0.96 [ -4.06](52.89)              0.98 [ -2.49](49.22)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:        base[pct imp](CV)     shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
  1             1.00 [ -0.00](12.03)     1.04 [ -4.35](34.64)              1.13 [-13.04]( 2.25)
  2             1.00 [ -0.00]( 9.36)     1.00 [ -0.00]( 4.56)              1.12 [-11.54](12.83)
  4             1.00 [ -0.00]( 1.95)     1.00 [ -0.00](13.36)              0.93 [  6.67]( 9.10)
  8             1.00 [ -0.00]( 9.01)     0.97 [  2.70]( 4.68)              1.03 [ -2.70](12.11)
 16             1.00 [ -0.00]( 3.08)     1.02 [ -2.00]( 3.01)              1.00 [ -0.00]( 7.33)
 32             1.00 [ -0.00]( 0.75)     1.03 [ -2.60]( 8.20)              1.09 [ -9.09]( 4.24)
 64             1.00 [ -0.00]( 2.15)     0.91 [  9.20]( 1.03)              1.01 [ -0.61]( 7.14)
128             1.00 [ -0.00]( 5.18)     1.05 [ -4.57]( 7.74)              1.01 [ -0.57]( 5.62)
256             1.00 [ -0.00]( 4.18)     1.06 [ -5.87](51.02)              1.10 [ -9.51](15.82)
512             1.00 [ -0.00]( 2.10)     1.03 [ -3.36]( 2.88)              1.06 [ -5.87]( 1.10)


==================================================================
Test          : Unixbench
Units         : Various, Throughput
Interpretation: Higher is better
Statistic     : AMean, Hmean (Specified)
==================================================================
                                                base                  shared_runq             shared_runq_idle_check
Hmean     unixbench-dhry2reg-1            41407024.82 (   0.00%)    41211208.57 (  -0.47%)     41354094.87 (  -0.13%)
Hmean     unixbench-dhry2reg-512        6249629291.88 (   0.00%)  6245782129.00 (  -0.06%)   6236514875.97 (  -0.21%)
Amean     unixbench-syscall-1              2922580.63 (   0.00%)     2928021.57 *  -0.19%*      2895742.17 *   0.92%*
Amean     unixbench-syscall-512            7606400.73 (   0.00%)     8390396.33 * -10.31%*      8236409.00 *  -8.28%*
Hmean     unixbench-pipe-1                 2574942.54 (   0.00%)     2610825.75 *   1.39%*      2531492.38 *  -1.69%*
Hmean     unixbench-pipe-512             364489246.31 (   0.00%)   366388360.22 *   0.52%*    360160487.66 *  -1.19%*
Hmean     unixbench-spawn-1                   4428.94 (   0.00%)        4391.20 (  -0.85%)         4541.06 (   2.53%)
Hmean     unixbench-spawn-512                68883.47 (   0.00%)       69143.38 (   0.38%)        69776.01 *   1.30%*
Hmean     unixbench-execl-1                   3878.47 (   0.00%)        3861.63 (  -0.43%)         3873.96 (  -0.12%)
Hmean     unixbench-execl-512                11638.84 (   0.00%)       12758.38 *   9.62%*        14001.23 *  20.30%*


==================================================================
Test          : ycsb-mongodb
Units         : Throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
tip                     : 1.00 (var: 1.41%)
shared_runq             : 0.99 (var: 0.84%)  (diff: -1.40%)
shared_runq_idle_check  : 1.00 (var: 0.79%)  (diff:  0.00%)


==================================================================
Test          : DeathStarBench
Units         : %diff, relative to base
Interpretation: Higher is better
Statistic     : AMean
==================================================================
pinning      scaling   eevdf   shared_runq    shared_runq_idle_check
1CDD            1       0%       -0.39%              -0.09%
2CDD            2       0%       -0.31%              -1.73%
4CDD            4       0%        3.28%               0.60%
8CDD            8       0%        4.35%               2.98% 
 

-> With EEVDF

o tl;dr

- Same as what was observed without EEVDF  but shared_runq shows
  serious regression with multiple more variants of tbench and
  netperf now.

o Kernels

eevdf			: tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
shared_runq		: eevdf + correct time accounting with v3 of the series without any other changes
shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
			  (the rd->overload check still remains below the shared_runq access)

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:          eevdf[pct imp](CV)    shared_runq[pct imp](CV)  shared_runq_idle_check[pct imp](CV)
 1-groups     1.00 [ -0.00]( 1.89)     0.95 [  4.72]( 8.98)         0.99 [  0.83]( 3.77)
 2-groups     1.00 [ -0.00]( 2.04)     0.86 [ 13.87]( 2.59)         0.95 [  4.92]( 1.98)
 4-groups     1.00 [ -0.00]( 2.38)     0.96 [  4.50]( 3.44)         0.98 [  2.45]( 1.93)
 8-groups     1.00 [ -0.00]( 1.52)     1.01 [ -0.95]( 1.36)         0.96 [  3.97]( 0.89)
16-groups     1.00 [ -0.00]( 3.44)     1.00 [ -0.00]( 1.59)         0.96 [  3.91]( 3.36)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:  eevdf[pct imp](CV)    shared_runq[pct imp](CV)  shared_runq_idle_check[pct imp](CV)
    1     1.00 [  0.00]( 0.18)     1.00 [  0.15]( 0.59)         0.98 [ -1.76]( 0.74)
    2     1.00 [  0.00]( 0.63)     0.97 [ -3.44]( 0.91)         0.98 [ -1.93]( 1.27)
    4     1.00 [  0.00]( 0.86)     0.95 [ -4.86]( 0.85)         0.99 [ -1.15]( 0.77)
    8     1.00 [  0.00]( 0.22)     0.94 [ -6.44]( 1.31)         0.99 [ -1.00]( 0.97)
   16     1.00 [  0.00]( 1.99)     0.86 [-13.68]( 0.38)         1.00 [ -0.47]( 0.99)
   32     1.00 [  0.00]( 4.29)     0.48 [-52.21]( 0.53)         1.01 [  1.24]( 6.66)
   64     1.00 [  0.00]( 1.71)     0.35 [-64.68]( 0.44)         0.99 [ -0.66]( 0.70)
  128     1.00 [  0.00]( 0.65)     0.40 [-60.32]( 0.95)         0.98 [ -2.15]( 1.25)
  256     1.00 [  0.00]( 0.19)     0.72 [-28.28]( 1.88)         0.99 [ -1.39]( 2.50)
  512     1.00 [  0.00]( 0.20)     0.79 [-20.59]( 4.40)         1.00 [ -0.42]( 0.38)
 1024     1.00 [  0.00]( 0.29)     0.80 [-20.24]( 0.64)         0.99 [ -0.51]( 0.20)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:     eevdf[pct imp](CV)    shared_runq[pct imp](CV)   shared_runq_idle_check[pct imp](CV)
 Copy     1.00 [  0.00]( 4.32)     0.94 [ -6.40]( 8.05)          1.01 [  1.39]( 4.58)
Scale     1.00 [  0.00]( 5.21)     0.98 [ -2.15]( 6.79)          0.95 [ -4.54]( 7.35)
  Add     1.00 [  0.00]( 6.25)     0.97 [ -2.64]( 6.47)          0.97 [ -3.08]( 7.49)
Triad     1.00 [  0.00](10.74)     1.01 [  0.92]( 7.40)          1.01 [  1.25]( 8.76)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:     eevdf[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 Copy     1.00 [  0.00]( 0.70)     1.00 [ -0.07]( 0.70)         1.00 [  0.47]( 0.94)
Scale     1.00 [  0.00]( 6.55)     1.02 [  1.72]( 4.83)         1.03 [  2.96]( 1.00)
  Add     1.00 [  0.00]( 6.53)     1.02 [  1.53]( 4.77)         1.03 [  3.19]( 0.90)
Triad     1.00 [  0.00]( 6.66)     1.00 [  0.06]( 6.29)         0.99 [ -0.70]( 5.79)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:       eevdf[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.46)     1.02 [  1.73]( 0.31)           0.99 [ -0.81]( 0.24)
 2-clients     1.00 [  0.00]( 0.38)     0.99 [ -0.68]( 1.17)           0.99 [ -0.87]( 0.45)
 4-clients     1.00 [  0.00]( 0.72)     0.97 [ -3.38]( 1.38)           0.99 [ -1.26]( 0.47)
 8-clients     1.00 [  0.00]( 0.98)     0.94 [ -6.30]( 1.84)           1.00 [ -0.44]( 0.45)
16-clients     1.00 [  0.00]( 0.70)     0.56 [-44.08]( 5.11)           0.99 [ -0.83]( 0.49)
32-clients     1.00 [  0.00]( 0.74)     0.35 [-64.92]( 1.98)           0.98 [ -2.14]( 2.14)
64-clients     1.00 [  0.00]( 2.24)     0.26 [-73.79]( 5.72)           0.97 [ -2.57]( 2.44)
128-clients    1.00 [  0.00]( 1.72)     0.25 [-74.91]( 6.72)           0.96 [ -3.66]( 1.48)
256-clients    1.00 [  0.00]( 4.44)     0.68 [-31.60]( 5.42)           0.96 [ -3.61]( 3.64)
512-clients    1.00 [  0.00](52.42)     0.67 [-32.81](48.45)           0.96 [ -3.80](55.24)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:   eevdf[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
  1         1.00 [ -0.00]( 2.28)     1.00 [ -0.00]( 6.19)          0.84 [ 16.00](20.83)
  2         1.00 [ -0.00]( 6.42)     0.89 [ 10.71]( 2.34)          0.96 [  3.57]( 4.17)
  4         1.00 [ -0.00]( 3.77)     0.97 [  3.33]( 7.35)          1.00 [ -0.00]( 9.12)
  8         1.00 [ -0.00](13.83)     1.03 [ -2.63]( 6.96)          0.95 [  5.26]( 6.93)
 16         1.00 [ -0.00]( 4.37)     1.02 [ -2.13]( 4.17)          1.02 [ -2.13]( 3.53)
 32         1.00 [ -0.00]( 8.69)     0.96 [  3.70]( 5.23)          0.98 [  2.47]( 4.43)
 64         1.00 [ -0.00]( 2.30)     0.96 [  3.85]( 2.34)          0.92 [  7.69]( 4.14)
128         1.00 [ -0.00](12.12)     0.97 [  3.12]( 3.31)          0.93 [  6.53]( 5.31)
256         1.00 [ -0.00](26.04)     1.87 [-86.57](33.02)          1.63 [-62.73](40.63)
512         1.00 [ -0.00]( 5.62)     1.04 [ -3.80]( 0.35)          1.09 [ -8.78]( 2.56)
 
==================================================================
Test          : Unixbench
Units         : Various, Throughput
Interpretation: Higher is better
Statistic     : AMean, Hmean (Specified)
==================================================================

                                             eevdf                   shared_runq             shared_runq_idle_check
Hmean     unixbench-dhry2reg-1            41248390.97 (   0.00%)    41245183.04 (  -0.01%)    41297801.58 (   0.12%)
Hmean     unixbench-dhry2reg-512        6239969914.15 (   0.00%)  6236534715.56 (  -0.06%)  6237356670.12 (  -0.04%)
Amean     unixbench-syscall-1              2968518.27 (   0.00%)     2893792.10 *   2.52%*     2799609.00 *   5.69%*
Amean     unixbench-syscall-512            7790656.20 (   0.00%)     8489302.67 *  -8.97%*     7685974.47 *   1.34%*
Hmean     unixbench-pipe-1                 2535689.01 (   0.00%)     2554662.39 *   0.75%*     2521853.23 *  -0.55%*
Hmean     unixbench-pipe-512             361385055.25 (   0.00%)   365752991.35 *   1.21%*   358310503.28 *  -0.85%*
Hmean     unixbench-spawn-1                   4506.26 (   0.00%)        4566.00 (   1.33%)        4242.52 *  -5.85%*
Hmean     unixbench-spawn-512                69380.09 (   0.00%)       69554.52 (   0.25%)       69413.14 (   0.05%)
Hmean     unixbench-execl-1                   3824.57 (   0.00%)        3782.82 *  -1.09%*        3832.10 (   0.20%)
Hmean     unixbench-execl-512                12288.64 (   0.00%)       13248.40 (   7.81%)       12661.78 (   3.04%)
 
==================================================================
Test          : ycsb-mongodb
Units         : Throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
eevdf                   : 1.00 (var: 1.41%)
shared_runq             : 0.98 (var: 0.84%)  (diff: -2.40%)
shared_runq_idle_check  : 0.97 (var: 0.79%)  (diff: -3.06%)


==================================================================
Test          : DeathStarBench
Units         : %diff, relative to eevdf
Interpretation: Higher is better
Statistic     : AMean
==================================================================
pinning      scaling   eevdf   shared_runq    shared_runq_idle_check
1CDD            1       0%       -0.85%             -1.56%
2CDD            2       0%       -0.60%             -1.22%
4CDD            4       0%        2.87%              0.02%
8CDD            8       0%        0.36%              1.57%


--
Thanks and Regards,
Prateek