[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6c626e8d-4133-00ba-a765-bafe08034517@amd.com>
Date: Thu, 29 Sep 2022 22:28:38 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: "Gautham R. Shenoy" <gautham.shenoy@....com>,
Chen Yu <yu.c.chen@...el.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Tim Chen <tim.c.chen@...el.com>,
Mel Gorman <mgorman@...hsingularity.net>,
Juri Lelli <juri.lelli@...hat.com>,
Rik van Riel <riel@...riel.com>,
Aaron Lu <aaron.lu@...el.com>,
Abel Wu <wuyun.abel@...edance.com>,
Yicong Yang <yangyicong@...ilicon.com>,
Ingo Molnar <mingo@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] sched/fair: Choose the CPU where short task is
running during wake up
Hello Gautham and Chenyu,
On 9/26/2022 8:09 PM, Gautham R. Shenoy wrote:
> Hello Prateek,
>
> On Mon, Sep 26, 2022 at 11:20:16AM +0530, K Prateek Nayak wrote:[
>
> [..snip..]
>
>>> @@ -6050,7 +6063,8 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
>>> if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
>>> return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
>>>
>>> - if (sync && cpu_rq(this_cpu)->nr_running == 1)
>>> + if ((sync && cpu_rq(this_cpu)->nr_running == 1) ||
>>> + is_short_task(cpu_curr(this_cpu)))
>>
>> This change seems to optimize for affine wakeup which benefits
>> tasks with producer-consumer pattern but is not ideal for Stream.
>> Currently the logic ends will do an affine wakeup even if sync
>> flag is not set:
>>
>> stream-4135 [029] d..2. 353.580953: sched_waking: comm=stream pid=4129 prio=120 target_cpu=082
>> stream-4135 [029] d..2. 353.580957: select_task_rq_fair: wake_affine_idle: Select this_cpu: sync(0) rq->nr_running(1) is_short_task(1)
>> stream-4135 [029] d..2. 353.580960: sched_migrate_task: comm=stream pid=4129 prio=120 orig_cpu=82 dest_cpu=30
>> <idle>-0 [030] dNh2. 353.580993: sched_wakeup: comm=stream pid=4129 prio=120 target_cpu=030
>>
>> I believe a consideration should be made for the sync flag when
>> going for an affine wakeup. Also the check for short running could
>> be at the end after checking if prev_cpu is an available_idle_cpu.
>
> We need to check if moving the is_short_task() to a later point after
> checking the availability of the previous CPU solve the problem for
> the workloads which showed regressions on AMD EPYC systems.
I've done some testing with moving the condition check for short
running task to the end of wake_affine_idle and checking if the
length of run queue is 1 similar to what Tim suggested in the thread
but doing it upfront in wake_affine_idle. There are a few variations
I've tested:
v1: move the check for short running task on current CPU to end of wake_affine_idle
v2: move the check for short running task on current CPU to end of wake_affine_idle
+ remove entire hunk in select_idle_cpu
v3: move the check for short running task on current CPU to end of wake_affine_idle
+ check if run queue of current CPU only has 1 task
v4: move the check for short running task on current CPU to end of wake_affine_idle
+ check if run queue of current CPU only has 1 task
+ remove entire hunk in select_idle_cpu
Adding diff for v3 below:
--
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0ad8e7183bf2..dad9bfb0248d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6074,13 +6074,15 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
- if ((sync && cpu_rq(this_cpu)->nr_running == 1) ||
- is_short_task(cpu_curr(this_cpu)))
+ if (sync && cpu_rq(this_cpu)->nr_running == 1)
return this_cpu;
if (available_idle_cpu(prev_cpu))
return prev_cpu;
+ if (cpu_rq(this_cpu)->nr_running == 1 && is_short_task(cpu_curr(this_cpu)))
+ return this_cpu;
+
return nr_cpumask_bits;
}
--
Deviation from above diff in other versions are as follows:
o v1 and v2 doesn't check cpu_rq(this_cpu)->nr_running == 1 and only
moves the condition check to end of wake_affine_idle as:
if (is_short_task(cpu_curr(this_cpu)))
return this_cpu;
o The second hunk of changes in select_idle_cpu form RFC remains same
in v1 and v3 but is removed in v2 and v3 to check if that was the
cause of pileup seen in case of Hackbench.
Following are the results of the standard benchmarks on a dual socket
Zen3 system (2 x 64C/128T) in NPS1 and NPS4 mode:
~~~~~~~~~~~~~
~ Hackbench ~
~~~~~~~~~~~~~
o NPS1
Test: tip v1 v2 v3 v4
1-groups: 4.23 (0.00 pct) 4.21 (0.47 pct) 4.29 (-1.41 pct) 4.02 (4.96 pct) 4.34 (-2.60 pct)
2-groups: 4.93 (0.00 pct) 5.23 (-6.08 pct) 5.20 (-5.47 pct) 4.75 (3.65 pct) 4.77 (3.24 pct)
4-groups: 5.32 (0.00 pct) 5.64 (-6.01 pct) 5.66 (-6.39 pct) 5.13 (3.57 pct) 5.22 (1.87 pct)
8-groups: 5.46 (0.00 pct) 5.92 (-8.42 pct) 5.96 (-9.15 pct) 5.24 (4.02 pct) 5.37 (1.64 pct)
16-groups: 7.31 (0.00 pct) 7.16 (2.05 pct) 7.17 (1.91 pct) 6.65 (9.02 pct) 7.05 (3.55 pct)
o NPS4
Test: tip v1 v2 v3 v4
1-groups: 4.23 (0.00 pct) 4.20 (0.70 pct) 4.37 (-3.30 pct) 4.02 (4.96 pct) 4.23 (0.00 pct)
2-groups: 4.78 (0.00 pct) 5.07 (-6.06 pct) 5.07 (-6.06 pct) 4.60 (3.76 pct) 4.67 (2.30 pct)
4-groups: 5.17 (0.00 pct) 5.47 (-5.80 pct) 5.50 (-6.38 pct) 5.01 (3.09 pct) 5.12 (0.96 pct)
8-groups: 5.63 (0.00 pct) 5.77 (-2.48 pct) 5.84 (-3.73 pct) 5.48 (2.66 pct) 5.47 (2.84 pct)
16-groups: 7.88 (0.00 pct) 6.43 (18.40 pct) 6.60 (16.24 pct) 12.14 (-54.06 pct) 6.51 (17.38 pct) *
16-groups: 10.28 (0.00 pct) 6.62 (35.60 pct) 6.68 (35.01 pct) 8.67 (15.66 pct) 6.96 (32.29 pct) [Verification Run]
~~~~~~~~~~~~~
~ schebench ~
~~~~~~~~~~~~~
o NPS 1
#workers: tip v1 v2 v3 v4
1: 22.00 (0.00 pct) 33.00 (-50.00 pct) 29.00 (-31.81 pct) 33.00 (-50.00 pct) 32.00 (-45.45 pct)
2: 34.00 (0.00 pct) 34.00 (0.00 pct) 36.00 (-5.88 pct) 37.00 (-8.82 pct) 36.00 (-5.88 pct)
4: 37.00 (0.00 pct) 39.00 (-5.40 pct) 36.00 (2.70 pct) 40.00 (-8.10 pct) 34.00 (8.10 pct)
8: 55.00 (0.00 pct) 43.00 (21.81 pct) 52.00 (5.45 pct) 47.00 (14.54 pct) 55.00 (0.00 pct)
16: 69.00 (0.00 pct) 64.00 (7.24 pct) 65.00 (5.79 pct) 65.00 (5.79 pct) 67.00 (2.89 pct)
32: 113.00 (0.00 pct) 110.00 (2.65 pct) 112.00 (0.88 pct) 106.00 (6.19 pct) 108.00 (4.42 pct)
64: 219.00 (0.00 pct) 200.00 (8.67 pct) 221.00 (-0.91 pct) 214.00 (2.28 pct) 217.00 (0.91 pct)
128: 506.00 (0.00 pct) 509.00 (-0.59 pct) 507.00 (-0.19 pct) 495.00 (2.17 pct) 535.00 (-5.73 pct)
256: 45440.00 (0.00 pct) 44096.00 (2.95 pct) 47296.00 (-4.08 pct) 43968.00 (3.23 pct) 42432.00 (6.61 pct)
512: 76672.00 (0.00 pct) 82304.00 (-7.34 pct) 82304.00 (-7.34 pct) 73088.00 (4.67 pct) 78976.00 (-3.00 pct)
o NPS4
#workers: tip v1 v2 v3 v4
1: 30.00 (0.00 pct) 35.00 (-16.66 pct) 20.00 (33.33 pct) 30.00 (0.00 pct) 34.00 (-13.33 pct)
2: 34.00 (0.00 pct) 35.00 (-2.94 pct) 36.00 (-5.88 pct) 38.00 (-11.76 pct) 37.00 (-8.82 pct)
4: 41.00 (0.00 pct) 39.00 (4.87 pct) 43.00 (-4.87 pct) 39.00 (4.87 pct) 41.00 (0.00 pct)
8: 60.00 (0.00 pct) 64.00 (-6.66 pct) 53.00 (11.66 pct) 52.00 (13.33 pct) 56.00 (6.66 pct)
16: 68.00 (0.00 pct) 68.00 (0.00 pct) 69.00 (-1.47 pct) 71.00 (-4.41 pct) 67.00 (1.47 pct)
32: 116.00 (0.00 pct) 115.00 (0.86 pct) 118.00 (-1.72 pct) 111.00 (4.31 pct) 113.00 (2.58 pct)
64: 224.00 (0.00 pct) 208.00 (7.14 pct) 217.00 (3.12 pct) 224.00 (0.00 pct) 231.00 (-3.12 pct)
128: 495.00 (0.00 pct) 523.00 (-5.65 pct) 567.00 (-14.54 pct) 515.00 (-4.04 pct) 675.00 (-36.36 pct) *
256: 45888.00 (0.00 pct) 45888.00 (0.00 pct) 46656.00 (-1.67 pct) 47168.00 (-2.78 pct) 44864.00 (2.23 pct)
512: 78464.00 (0.00 pct) 78976.00 (-0.65 pct) 83584.00 (-6.52 pct) 76672.00 (2.28 pct) 80768.00 (-2.93 pct)
Note: schbench shows a large amount of run to run variation for
lower worker count. The results have been included to check for
any large increase in latency that suggests schbench task queuing
behind one another.
~~~~~~~~~~
~ tbench ~
~~~~~~~~~~
o NPS 1
Clients: tip v1 v2 v3 v4
1 550.66 (0.00 pct) 582.73 (5.82 pct) 572.06 (3.88 pct) 576.94 (4.77 pct) 582.44 (5.77 pct)
2 1009.69 (0.00 pct) 1087.30 (7.68 pct) 1056.81 (4.66 pct) 1072.44 (6.21 pct) 1041.94 (3.19 pct)
4 1795.32 (0.00 pct) 1847.22 (2.89 pct) 1869.23 (4.11 pct) 1839.32 (2.45 pct) 1877.57 (4.58 pct)
8 2971.16 (0.00 pct) 3144.05 (5.81 pct) 3137.94 (5.61 pct) 3100.27 (4.34 pct) 3032.99 (2.08 pct)
16 4627.98 (0.00 pct) 4704.22 (1.64 pct) 4752.77 (2.69 pct) 4833.24 (4.43 pct) 4726.70 (2.13 pct)
32 8065.15 (0.00 pct) 8172.79 (1.33 pct) 9266.77 (14.89 pct) 9508.24 (17.89 pct) 9199.91 (14.06 pct)
64 14994.32 (0.00 pct) 15357.75 (2.42 pct) 15246.82 (1.68 pct) 15670.37 (4.50 pct) 15433.18 (2.92 pct)
128 5175.73 (0.00 pct) 3062.00 (-40.83 pct) 18429.11 (256.06 pct) 3365.81 (-34.96 pct) 2633.09 (-49.12 pct) *
128 20490.63 (0.00 pct) 20504.17 (0.06 pct) 21183.21 (3.37 pct) 20469.20 (-0.10 pct) 20879.77 (1.89 pct) [Verification Run]
256 48763.57 (0.00 pct) 50703.97 (3.97 pct) 50723.68 (4.01 pct) 49387.93 (1.28 pct) 49552.81 (1.61 pct)
512 43780.78 (0.00 pct) 45328.44 (3.53 pct) 45328.59 (3.53 pct) 45384.80 (3.66 pct) 43897.43 (0.26 pct)
1024 40341.84 (0.00 pct) 42823.05 (6.15 pct) 42262.72 (4.76 pct) 41856.06 (3.75 pct) 40785.67 (1.10 pct)
o NPS 4
Clients: tip v1 v2 v3 v4
1 549.22 (0.00 pct) 582.89 (6.13 pct) 576.74 (5.01 pct) 582.34 (6.03 pct) 585.19 (6.54 pct)
2 1000.08 (0.00 pct) 1111.54 (11.14 pct) 1043.47 (4.33 pct) 1060.99 (6.09 pct) 1071.39 (7.13 pct)
4 1794.78 (0.00 pct) 1895.64 (5.61 pct) 1858.40 (3.54 pct) 1828.08 (1.85 pct) 1862.47 (3.77 pct)
8 3008.50 (0.00 pct) 3117.10 (3.60 pct) 3060.15 (1.71 pct) 3143.65 (4.49 pct) 3065.17 (1.88 pct)
16 4804.71 (0.00 pct) 4677.82 (-2.64 pct) 4587.01 (-4.53 pct) 4694.21 (-2.29 pct) 4627.39 (-3.69 pct)
32 9156.57 (0.00 pct) 8462.23 (-7.58 pct) 8290.70 (-9.45 pct) 7906.44 (-13.65 pct) 8679.98 (-5.20 pct) *
32 9157.62 (0.00 pct) 8712.33 (-4.86 pct) 8640.77 (-5.64 pct) 9415.99 (2.82 pct) 9403.35 (2.68 pct) [Verification Run]
64 14901.45 (0.00 pct) 15263.87 (2.43 pct) 15031.33 (0.87 pct) 15149.54 (1.66 pct) 14714.04 (-1.25 pct)
128 20771.20 (0.00 pct) 21114.00 (1.65 pct) 17818.77 (-14.21 pct) 17686.98 (-14.84 pct) 15917.79 (-23.36 pct) *
128 20490.63 (0.00 pct) 20504.17 (0.06 pct) 21183.21 (3.37 pct) 20469.20 (-0.10 pct) 20879.77 (1.89 pct) [Verification Run]
256 47033.88 (0.00 pct) 48021.71 (2.10 pct) 48439.88 (2.98 pct) 48042.49 (2.14 pct) 49294.05 (4.80 pct)
512 43429.01 (0.00 pct) 44488.54 (2.43 pct) 43672.99 (0.56 pct) 42462.44 (-2.22 pct) 44072.90 (1.48 pct)
1024 39271.27 (0.00 pct) 42304.03 (7.72 pct) 41850.17 (6.56 pct) 39791.47 (1.32 pct) 41528.81 (5.74 pct)
Note: tbench for 128 clients runs into an ACPI idle driver issue
that is fixed by the commit e400ad8b7e6a ("ACPI: processor idle:
Practically limit "Dummy wait" workaround to old Intel systems")
which will be a part of the v6.0 kernel release.
~~~~~~~~~~
~ stream ~
~~~~~~~~~~
o NPS 1
- 10 runs
Test: tip v1 v2 v3 v4
Copy: 335832.93 (0.00 pct) 338535.58 (0.80 pct) 334772.76 (-0.31 pct) 337487.50 (0.49 pct) 336720.22 (0.26 pct)
Scale: 212781.21 (0.00 pct) 217118.20 (2.03 pct) 213011.28 (0.10 pct) 216905.50 (1.93 pct) 213371.06 (0.27 pct)
Add: 251667.59 (0.00 pct) 240811.38 (-4.31 pct) 250478.75 (-0.47 pct) 250584.95 (-0.43 pct) 250987.62 (-0.27 pct)
Triad: 251537.87 (0.00 pct) 261919.66 (4.12 pct) 260702.92 (3.64 pct) 251181.87 (-0.14 pct) 262152.01 (4.21 pct)
- 100 runs
Test: tip v1 v2 v3 v4
Copy: 335721.37 (0.00 pct) 337441.09 (0.51 pct) 338472.90 (0.81 pct) 335777.78 (0.01 pct) 338434.23 (0.80 pct)
Scale: 219593.12 (0.00 pct) 224083.11 (2.04 pct) 218742.58 (-0.38 pct) 221381.50 (0.81 pct) 219603.23 (0.00 pct)
Add: 251612.53 (0.00 pct) 251633.66 (0.00 pct) 251593.37 (0.00 pct) 251261.72 (-0.13 pct) 251838.27 (0.08 pct)
Triad: 261985.15 (0.00 pct) 261639.38 (-0.13 pct) 263003.34 (0.38 pct) 261084.30 (-0.34 pct) 260353.64 (-0.62 pct)
o NPS 4
- 10 runs
Test: tip v1 v2 v3 v4
Copy: 354774.17 (0.00 pct) 359486.69 (1.32 pct) 368017.56 (3.73 pct) 374514.29 (5.56 pct) 344022.60 (-3.03 pct)
Scale: 231870.22 (0.00 pct) 221056.77 (-4.66 pct) 246191.29 (6.17 pct) 244736.54 (5.54 pct) 232084.49 (0.09 pct)
Add: 258728.29 (0.00 pct) 243136.12 (-6.02 pct) 259962.30 (0.47 pct) 273104.99 (5.55 pct) 256671.88 (-0.79 pct)
Triad: 269237.56 (0.00 pct) 282994.33 (5.10 pct) 286902.41 (6.56 pct) 290661.36 (7.95 pct) 269610.52 (0.13 pct)
- 100 runs
Test: tip v1 v2 v3 v4
Copy: 369249.91 (0.00 pct) 360411.30 (-2.39 pct) 364531.71 (-1.27 pct) 374280.94 (1.36 pct) 372066.41 (0.76 pct)
Scale: 254849.59 (0.00 pct) 253724.21 (-0.44 pct) 254868.47 (0.00 pct) 254916.90 (0.02 pct) 256054.43 (0.47 pct)
Add: 273124.66 (0.00 pct) 272945.31 (-0.06 pct) 272989.26 (-0.04 pct) 260213.79 (-4.72 pct) 273955.09 (0.30 pct)
Triad: 287935.27 (0.00 pct) 284522.85 (-1.18 pct) 284797.06 (-1.08 pct) 290192.01 (0.78 pct) 288755.39 (0.28 pct)
~~~~~~~~~~~~~~~~~~~~~~~~~
~ Notes and Observation ~
~~~~~~~~~~~~~~~~~~~~~~~~~
We still see a pileup with v1 and v2 but not with v3 and v4 suggesting
that the second hunk is not the reason for the pileup but rather
choosing the current CPU in wake_affine_idle on the basis that the
current running task is the short running task. To prevent a pileup, we
must only choose the current rq if the short running task is the only
task running there.
I've not checked for the sync flag to allow for a larger opportunity
for affine wakeup. This assumes that wake_affine() is called only for
tasks that can benefit from an affine wakeup.
Sharing more data from the test runs:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Hackbench 2 groups schedstat data ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
o NPS1
---------------------------------------------------------------------------------------------------------------------------
cpu: all_cpus (avg) vs cpu: all_cpus (avg)
---------------------------------------------------------------------------------------------------------------------------
kernel : v1 v3 v4
sched_yield count : 0, 0 0
Legacy counter can be ignored : 0, 0 0
schedule called : 49196, 51320 67541 | 37.29|
schedule left the processor idle : 21399, 21609 32655 | 52.60|
try_to_wake_up was called : 27726, 29630 | 6.87| 34868 | 25.76|
try_to_wake_up was called to wake up the local cpu : 2049, 1195 | -41.68| 409 | -80.04|
total runtime by tasks on this processor (in jiffies) : 548520817, 582155720 | 6.13| 1068137641 | 94.73|
total waittime by tasks on this processor (in jiffies) : 668076627, 480034644 | -28.15| 77773209 | -88.36| * v3 and v4 have lower wait time
total timeslices run on this cpu : 27791, 29703 | 6.88| 34883 | 25.52| and a larger runtime / waittime ratio
< ----------------------------------------------------------------- Wakeup info: -------------------------------------- >
kernel : v1 v3 v4
Wakeups on same SMT cpus = all_cpus (avg) : 1368, 1403 309 | -77.41|
Wakeups on same MC cpus = all_cpus (avg) : 20980, 21018 11493 | -45.22|
Wakeups on same DIE cpus = all_cpus (avg) : 2074, 3499 | 68.71| 11166 | 438.38|
Wakeups on same NUMA cpus = all_cpus (avg) : 1252, 2514 | 100.80| 11489 | 817.65|
Affine wakeups on same SMT cpus = all_cpus (avg) : 1400, 1046 | -25.29| 142 | -89.86|
Affine wakeups on same MC cpus = all_cpus (avg) : 18940, 13474 | -28.86| 2916 | -84.60|
Affine wakeups on same DIE cpus = all_cpus (avg) : 2163, 2827 | 30.70| 3771 | 74.34|
Affine wakeups on same NUMA cpus = all_cpus (avg) : 1145, 1945 | 69.87| 3466 | 202.71|
---------------------------------------------------------------------------------------------------------------------------
o NPS4
----------------------------------------------------------------------------------------------------------------------------
cpu: all_cpus (avg) vs cpu: all_cpus (avg)
----------------------------------------------------------------------------------------------------------------------------
kernel : v1 v3 v4
sched_yield count : 0, 0 0
Legacy counter can be ignored : 0, 0 0
schedule called : 49685, 50335 55266 | 11.23|
schedule left the processor idle : 21755, 21269 25277 | 16.19|
try_to_wake_up was called : 27870, 28990 29955 | 7.48|
try_to_wake_up was called to wake up the local cpu : 2054, 1246 | -39.34| 666 | -67.58|
total runtime by tasks on this processor (in jiffies) : 582044948, 657092589 | 12.89| 860907207 | 47.91|
total waittime by tasks on this processor (in jiffies) : 610820439, 435359035 | -28.73| 171279622 | -71.96| * Same is observed in NPS4 runs
total timeslices run on this cpu : 27923, 29059 29985 | 7.38|
< ----------------------------------------------------------------- Wakeup info: --------------------------------------- >
kernel : v1 v3 v4
Wakeups on same SMT cpus = all_cpus (avg) : 1307, 1229 | -5.97| 699 | -46.52|
Wakeups on same MC cpus = all_cpus (avg) : 19854, 19726 16895 | -14.90|
Wakeups on same NODE cpus = all_cpus (avg) : 818, 1442 | 76.28| 1959 | 139.49|
Wakeups on same NUMA cpus = all_cpus (avg) : 2068, 3257 | 57.50| 6861 | 231.77|
Wakeups on same NUMA cpus = all_cpus (avg) : 1767, 2088 | 18.17| 2871 | 62.48|
Affine wakeups on same SMT cpus = all_cpus (avg) : 1314, 887 | -32.50| 439 | -66.59|
Affine wakeups on same MC cpus = all_cpus (avg) : 17572, 11754 | -33.11| 6971 | -60.33|
Affine wakeups on same NODE cpus = all_cpus (avg) : 885, 1195 | 35.03| 1379 | 55.82|
Affine wakeups on same NUMA cpus = all_cpus (avg) : 1516, 2792 | 84.17| 4070 | 168.47|
Affine wakeups on same NUMA cpus = all_cpus (avg) : 845, 2042 | 141.66| 1823 | 115.74|
----------------------------------------------------------------------------------------------------------------------------
~~~~~~~~~~~~~~~~~
~ Stream traces ~
~~~~~~~~~~~~~~~~~
Trace is obtained by enabling the following tracepoints:
- sched_wakeup_new
- sched_migrate_task
trace_stream.sh-4581 [130] d..2. 1795.126862: sched_wakeup_new: comm=trace_stream.sh pid=4589 prio=120 target_cpu=008 (LLC: 1)
stream-4589 [008] d..2. 1795.128145: sched_wakeup_new: comm=stream pid=4591 prio=120 target_cpu=159 (LLC: 3)
stream-4589 [008] d..2. 1795.128189: sched_wakeup_new: comm=stream pid=4592 prio=120 target_cpu=162 (LLC: 4)
stream-4589 [008] d..2. 1795.128259: sched_wakeup_new: comm=stream pid=4593 prio=120 target_cpu=202 (LLC: 9)
stream-4589 [008] d..2. 1795.128281: sched_wakeup_new: comm=stream pid=4594 prio=120 target_cpu=173 (LLC: 5)
stream-4589 [008] d..2. 1795.128311: sched_wakeup_new: comm=stream pid=4595 prio=120 target_cpu=214 (LLC: 10)
stream-4589 [008] d..2. 1795.128366: sched_wakeup_new: comm=stream pid=4596 prio=120 target_cpu=053 (LLC: 6)
stream-4589 [008] d..2. 1795.128454: sched_wakeup_new: comm=stream pid=4597 prio=120 target_cpu=088 (LLC: 11)
stream-4589 [008] d..2. 1795.128475: sched_wakeup_new: comm=stream pid=4598 prio=120 target_cpu=191 (LLC: 7)
stream-4589 [008] d..2. 1795.128508: sched_wakeup_new: comm=stream pid=4599 prio=120 target_cpu=096 (LLC: 12)
stream-4589 [008] d..2. 1795.128568: sched_wakeup_new: comm=stream pid=4600 prio=120 target_cpu=130 (LLC: 0)
stream-4589 [008] d..2. 1795.128620: sched_wakeup_new: comm=stream pid=4601 prio=120 target_cpu=239 (LLC: 13)
stream-4589 [008] d..2. 1795.128641: sched_wakeup_new: comm=stream pid=4602 prio=120 target_cpu=146 (LLC: 2)
stream-4589 [008] d..2. 1795.128672: sched_wakeup_new: comm=stream pid=4603 prio=120 target_cpu=247 (LLC: 14)
stream-4589 [008] d..2. 1795.128747: sched_wakeup_new: comm=stream pid=4604 prio=120 target_cpu=255 (LLC: 15)
stream-4589 [008] d..2. 1795.128784: sched_wakeup_new: comm=stream pid=4605 prio=120 target_cpu=066 (LLC: 8)
No migrations were observed till the end of the run
- Initial task placement distribution
+--------+-------------------------------------+
| LLC ID | Tasks initially placed on this LLC |
+--------+-------------------------------------+
| 0 | 1 |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 1 |
| 6 | 1 |
| 7 | 1 |
| 8 | 1 |
| 9 | 1 |
| 10 | 1 |
| 11 | 1 |
| 12 | 1 |
| 13 | 1 |
| 14 | 1 |
| 15 | 1 |
+--------+-------------------------------------+
A point to note is Stream is more sensitive initially when tasks have not
run for long enough where, if a kworker or another short running task
is running on the previous CPU during wakeup, the logic will favor an
affine wakeup as initially as scheduler might not realize Stream is a
long running task.
Let me know if you would like me to gather more data on the test system
for the modified kernels discussed above.
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists