[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a6e38a33-0003-d3ea-de9b-cf805aef647f@didichuxing.com>
Date: Thu, 29 Sep 2022 14:59:46 +0800
From: Honglei Wang <wanghonglei@...ichuxing.com>
To: Chen Yu <yu.c.chen@...el.com>,
K Prateek Nayak <kprateek.nayak@....com>
CC: Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Tim Chen <tim.c.chen@...el.com>,
Mel Gorman <mgorman@...hsingularity.net>,
Juri Lelli <juri.lelli@...hat.com>,
Rik van Riel <riel@...riel.com>,
Aaron Lu <aaron.lu@...el.com>,
Abel Wu <wuyun.abel@...edance.com>,
Yicong Yang <yangyicong@...ilicon.com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
Ingo Molnar <mingo@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
<linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH] sched/fair: Choose the CPU where short task is
running during wake up
On 2022/9/29 13:25, Chen Yu wrote:
> Hi Prateek,
> On 2022-09-26 at 11:20:16 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> When testing the patch on a dual socket Zen3 system (3 x 64C/128T) we
>> noticed some regressions in some standard benchmark.
>>
>> tl;dr
>>
>> o Hackbench shows noticeable regression in most cases. Looking at schedstat
>> data, we see there is an increased number of affine wakeup and an increase
>> in the average wait time. As the LLC size on the Zen3 machine is only
>> 16 CPUs, there is good chance the LLC was overloaded and it required
>> intervention from load balancer to distribute tasks optimally.
>>
>> o There is a regression in Stream which is cause by piling up of more than
>> one Stream thread on the same LLC. This happens as a result of migration
>> in the wakeup path where the logic goes for an affine wakeup if the
>> waker is short lived task even if sync flag is not set and the previous
>> CPU might be idle.
>>
> Nice analysis and thanks for your testing.
>> I'll inline the results are detailed observation below:
>>
>> On 9/15/2022 10:24 PM, Chen Yu wrote:
>>> [Background]
>>> At LPC 2022 Real-time and Scheduling Micro Conference we presented
>>> the cross CPU wakeup issue. This patch is a text version of the
>>> talk, and hopefully we can clarify the problem and appreciate for any
>>> feedback.
>>>
>>> [re-send due to the previous one did not reach LKML, sorry
>>> for any inconvenience.]
>>>
>>> [Problem Statement]
>>> For a workload that is doing frequent context switches, the throughput
>>> scales well until the number of instances reaches a peak point. After
>>> that peak point, the throughput drops significantly if the number of
>>> instances continues to increase.
>>>
>>> The will-it-scale context_switch1 test case exposes the issue. The
>>> test platform has 112 CPUs per LLC domain. The will-it-scale launches
>>> 1, 8, 16 ... 112 instances respectively. Each instance is composed
>>> of 2 tasks, and each pair of tasks would do ping-pong scheduling via
>>> pipe_read() and pipe_write(). No task is bound to any CPU.
>>> We found that, once the number of instances is higher than
>>> 56(112 tasks in total, every CPU has 1 task), the throughput
>>> drops accordingly if the instance number continues to increase:
>>>
>>> ^
>>> throughput|
>>> | X
>>> | X X X
>>> | X X X
>>> | X X
>>> | X X
>>> | X
>>> | X
>>> | X
>>> | X
>>> |
>>> +-----------------.------------------->
>>> 56
>>> number of instances
>>>
>>> [Symptom analysis]
>>> Both perf profile and lockstat have shown that, the bottleneck
>>> is the runqueue spinlock. Take perf profile for example:
>>>
>>> nr_instance rq lock percentage
>>> 1 1.22%
>>> 8 1.17%
>>> 16 1.20%
>>> 24 1.22%
>>> 32 1.46%
>>> 40 1.61%
>>> 48 1.63%
>>> 56 1.65%
>>> --------------------------
>>> 64 3.77% |
>>> 72 5.90% | increase
>>> 80 7.95% |
>>> 88 9.98% v
>>> 96 11.81%
>>> 104 13.54%
>>> 112 15.13%
>>>
>>> And the rq lock bottleneck is composed of two paths(perf profile):
>>>
>>> (path1):
>>> raw_spin_rq_lock_nested.constprop.0;
>>> try_to_wake_up;
>>> default_wake_function;
>>> autoremove_wake_function;
>>> __wake_up_common;
>>> __wake_up_common_lock;
>>> __wake_up_sync_key;
>>> pipe_write;
>>> new_sync_write;
>>> vfs_write;
>>> ksys_write;
>>> __x64_sys_write;
>>> do_syscall_64;
>>> entry_SYSCALL_64_after_hwframe;write
>>>
>>> (path2):
>>> raw_spin_rq_lock_nested.constprop.0;
>>> __sched_text_start;
>>> schedule_idle;
>>> do_idle;
>>> cpu_startup_entry;
>>> start_secondary;
>>> secondary_startup_64_no_verify
>>>
>>> The idle percentage is around 30% when there are 112 instances:
>>> %Cpu0 : 2.7 us, 66.7 sy, 0.0 ni, 30.7 id
>>>
>>> As a comparison, if we set CPU affinity to these workloads,
>>> which stops them from migrating among CPUs, the idle percentage
>>> drops to nearly 0%, and the throughput increases by about 300%.
>>> This indicates that there is room for optimization.
>>>
>>> A possible scenario to describe the lock contention:
>>> task A tries to wakeup task B on CPU1, then task A grabs the
>>> runqueue lock of CPU1. If CPU1 is about to quit idle, it needs
>>> to grab its own lock which has been taken by someone else. Then
>>> CPU1 takes more time to quit which hurts the performance.
>>>
>>> TTWU_QUEUE could mitigate the cross CPU runqueue lock contention.
>>> Since commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU
>>> on wakelist if wakee cpu is idle"), TTWU_QUEUE offloads the work from
>>> the waker and leverages the idle CPU to queue the wakee. However, a long
>>> idle duration is still observed. The idle task spends quite some time
>>> on sched_ttwu_pending() before it switches out. This long idle
>>> duration would mislead SIS_UTIL, then SIS_UTIL suggests the waker scan
>>> for more CPUs. The time spent searching for an idle CPU would make
>>> wakee waiting for more time, which in turn leads to more idle time.
>>> The NEWLY_IDLE balance fails to pull tasks to the idle CPU, which
>>> might be caused by no runnable wakee being found.
>>>
>>> [Proposal]
>>> If a system is busy, and if the workloads are doing frequent context
>>> switches, it might not be a good idea to spread the wakee on different
>>> CPUs. Instead, consider the task running time and enhance wake affine
>>> might be applicable.
>>>
>>> This idea has been suggested by Rik at LPC 2019 when discussing
>>> the latency nice. He asked the following question: if P1 is a small-time
>>> slice task on CPU, can we put the waking task P2 on the CPU and wait for
>>> P1 to release the CPU, without wasting time to search for an idle CPU?
>>> At LPC 2021 Vincent Guittot has proposed:
>>> 1. If the wakee is a long-running task, should we skip the short idle CPU?
>>> 2. If the wakee is a short-running task, can we put it onto a lightly loaded
>>> local CPU?
>>>
>>> Current proposal is a variant of 2:
>>> If the target CPU is running a short-time slice task, and the wakee
>>> is also a short-time slice task, the target CPU could be chosen as the
>>> candidate when the system is busy.
>>>
>>> The definition of a short-time slice task is: The average running time
>>> of the task during each run is no more than sysctl_sched_min_granularity.
>>> If a task switches in and then voluntarily relinquishes the CPU
>>> quickly, it is regarded as a short-running task. Choosing
>>> sysctl_sched_min_granularity because it is the minimal slice if there
>>> are too many runnable tasks.
>>>
>>> Reuse the nr_idle_scan of SIS_UTIL to decide if the system is busy.
>>> If yes, then a compromised "idle" CPU might be acceptable.
>>>
>>> The reason is that, if the waker is a short running task, it might
>>> relinquish the CPU soon, the wakee has the chance to be scheduled.
>>> On the other hand, if the wakee is also a short-running task, the
>>> impact it brings to the target CPU is small. If the system is
>>> already busy, maybe we could lower the bar to find an idle CPU.
>>> The effect is, the wake affine is enhanced.
>>>
>>> [Benchmark results]
>>> The baseline is 6.0-rc4.
>>>
>>> The throughput of will-it-scale.context_switch1 has been increased by
>>> 331.13% with this patch applied.
>>>
>>> netperf
>>> =======
>>> case load baseline(std%) compare%( std%)
>>> TCP_RR 28 threads 1.00 ( 0.57) +0.29 ( 0.59)
>>> TCP_RR 56 threads 1.00 ( 0.49) +0.43 ( 0.43)
>>> TCP_RR 84 threads 1.00 ( 0.34) +0.24 ( 0.34)
>>> TCP_RR 112 threads 1.00 ( 0.26) +1.57 ( 0.20)
>>> TCP_RR 140 threads 1.00 ( 0.20) +178.05 ( 8.83)
>>> TCP_RR 168 threads 1.00 ( 10.14) +0.87 ( 10.03)
>>> TCP_RR 196 threads 1.00 ( 13.51) +0.90 ( 11.84)
>>> TCP_RR 224 threads 1.00 ( 7.12) +0.66 ( 8.28)
>>> UDP_RR 28 threads 1.00 ( 0.96) -0.10 ( 0.97)
>>> UDP_RR 56 threads 1.00 ( 10.93) +0.24 ( 0.82)
>>> UDP_RR 84 threads 1.00 ( 8.99) +0.40 ( 0.71)
>>> UDP_RR 112 threads 1.00 ( 0.15) +0.72 ( 7.77)
>>> UDP_RR 140 threads 1.00 ( 11.11) +135.81 ( 13.86)
>>> UDP_RR 168 threads 1.00 ( 12.58) +147.63 ( 12.72)
>>> UDP_RR 196 threads 1.00 ( 19.47) -0.34 ( 16.14)
>>> UDP_RR 224 threads 1.00 ( 12.88) -0.35 ( 12.73)
>>>
>>> hackbench
>>> =========
>>> case load baseline(std%) compare%( std%)
>>> process-pipe 1 group 1.00 ( 1.02) +0.14 ( 0.62)
>>> process-pipe 2 groups 1.00 ( 0.73) +0.29 ( 0.51)
>>> process-pipe 4 groups 1.00 ( 0.16) +0.24 ( 0.31)
>>> process-pipe 8 groups 1.00 ( 0.06) +11.56 ( 0.11)
>>> process-sockets 1 group 1.00 ( 1.59) +0.06 ( 0.77)
>>> process-sockets 2 groups 1.00 ( 1.13) -1.86 ( 1.31)
>>> process-sockets 4 groups 1.00 ( 0.14) +1.76 ( 0.29)
>>> process-sockets 8 groups 1.00 ( 0.27) +2.73 ( 0.10)
>>> threads-pipe 1 group 1.00 ( 0.43) +0.83 ( 2.20)
>>> threads-pipe 2 groups 1.00 ( 0.52) +1.03 ( 0.55)
>>> threads-pipe 4 groups 1.00 ( 0.44) -0.08 ( 0.31)
>>> threads-pipe 8 groups 1.00 ( 0.04) +11.86 ( 0.05)
>>> threads-sockets 1 groups 1.00 ( 1.89) +3.51 ( 0.57)
>>> threads-sockets 2 groups 1.00 ( 0.04) -1.12 ( 0.69)
>>> threads-sockets 4 groups 1.00 ( 0.14) +1.77 ( 0.18)
>>> threads-sockets 8 groups 1.00 ( 0.03) +2.75 ( 0.03)
>>>
>>> tbench
>>> ======
>>> case load baseline(std%) compare%( std%)
>>> loopback 28 threads 1.00 ( 0.08) +0.51 ( 0.25)
>>> loopback 56 threads 1.00 ( 0.15) -0.89 ( 0.16)
>>> loopback 84 threads 1.00 ( 0.03) +0.35 ( 0.07)
>>> loopback 112 threads 1.00 ( 0.06) +2.84 ( 0.01)
>>> loopback 140 threads 1.00 ( 0.07) +0.69 ( 0.11)
>>> loopback 168 threads 1.00 ( 0.09) +0.14 ( 0.18)
>>> loopback 196 threads 1.00 ( 0.04) -0.18 ( 0.20)
>>> loopback 224 threads 1.00 ( 0.25) -0.37 ( 0.03)
>>>
>>> Other benchmarks are under testing.
>>
>> Discussed below are the results from running standard benchmarks on
>> a dual socket Zen3 (2 x 64C/128T) machine configured in different
>> NPS modes.
>>
>> NPS Modes are used to logically divide single socket into
>> multiple NUMA region.
>> Following is the NUMA configuration for each NPS mode on the system:
>>
>> NPS1: Each socket is a NUMA node.
>> Total 2 NUMA nodes in the dual socket machine.
>>
>> Node 0: 0-63, 128-191
>> Node 1: 64-127, 192-255
>>
>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>> Total 4 NUMA nodes exist over 2 socket.
>>
>> Node 0: 0-31, 128-159
>> Node 1: 32-63, 160-191
>> Node 2: 64-95, 192-223
>> Node 3: 96-127, 223-255
>>
>> NPS4: Each socket is logically divided into 4 NUMA regions.
>> Total 8 NUMA nodes exist over 2 socket.
>>
>> Node 0: 0-15, 128-143
>> Node 1: 16-31, 144-159
>> Node 2: 32-47, 160-175
>> Node 3: 48-63, 176-191
>> Node 4: 64-79, 192-207
>> Node 5: 80-95, 208-223
>> Node 6: 96-111, 223-231
>> Node 7: 112-127, 232-255
>>
>> Benchmark Results:
>>
>> Kernel versions:
>> - tip: 5.19.0 tip sched/core
>> - shortrun: 5.19.0 tip sched/core + this patch
>>
>> When we started testing, the tip was at:
>> commit 7e9518baed4c ("sched/fair: Move call to list_last_entry() in detach_tasks")
>>
>> ~~~~~~~~~~~~~
>> ~ hackbench ~
>> ~~~~~~~~~~~~~
>>
>> NPS1
>>
>> Test: tip shortrun
>> 1-groups: 4.23 (0.00 pct) 4.24 (-0.23 pct)
>> 2-groups: 4.93 (0.00 pct) 5.68 (-15.21 pct)
>> 4-groups: 5.32 (0.00 pct) 6.21 (-16.72 pct)
>> 8-groups: 5.46 (0.00 pct) 6.49 (-18.86 pct)
>> 16-groups: 7.31 (0.00 pct) 7.78 (-6.42 pct)
>>
>> NPS2
>>
>> Test: tip shortrun
>> 1-groups: 4.19 (0.00 pct) 4.19 (0.00 pct)
>> 2-groups: 4.77 (0.00 pct) 5.43 (-13.83 pct)
>> 4-groups: 5.15 (0.00 pct) 6.20 (-20.38 pct)
>> 8-groups: 5.47 (0.00 pct) 6.54 (-19.56 pct)
>> 16-groups: 6.63 (0.00 pct) 7.28 (-9.80 pct)
>>
>> NPS4
>>
>> Test: tip shortrun
>> 1-groups: 4.23 (0.00 pct) 4.39 (-3.78 pct)
>> 2-groups: 4.78 (0.00 pct) 5.48 (-14.64 pct)
>> 4-groups: 5.17 (0.00 pct) 6.14 (-18.76 pct)
>> 8-groups: 5.63 (0.00 pct) 6.51 (-15.63 pct)
>> 16-groups: 7.88 (0.00 pct) 7.03 (10.78 pct)
>>
>> ~~~~~~~~~~~~
>> ~ schbench ~
>> ~~~~~~~~~~~~
>>
>> NPS1
>>
>> #workers: tip shortrun
>> 1: 22.00 (0.00 pct) 36.00 (-63.63 pct)
>> 2: 34.00 (0.00 pct) 38.00 (-11.76 pct)
>> 4: 37.00 (0.00 pct) 36.00 (2.70 pct)
>> 8: 55.00 (0.00 pct) 51.00 (7.27 pct)
>> 16: 69.00 (0.00 pct) 68.00 (1.44 pct)
>> 32: 113.00 (0.00 pct) 116.00 (-2.65 pct)
>> 64: 219.00 (0.00 pct) 232.00 (-5.93 pct)
>> 128: 506.00 (0.00 pct) 1019.00 (-101.38 pct)
>> 256: 45440.00 (0.00 pct) 44864.00 (1.26 pct)
>> 512: 76672.00 (0.00 pct) 73600.00 (4.00 pct)
>>
>> NPS2
>>
>> #workers: tip shortrun
>> 1: 31.00 (0.00 pct) 36.00 (-16.12 pct)
>> 2: 36.00 (0.00 pct) 36.00 (0.00 pct)
>> 4: 45.00 (0.00 pct) 39.00 (13.33 pct)
>> 8: 47.00 (0.00 pct) 48.00 (-2.12 pct)
>> 16: 66.00 (0.00 pct) 71.00 (-7.57 pct)
>> 32: 114.00 (0.00 pct) 123.00 (-7.89 pct)
>> 64: 215.00 (0.00 pct) 248.00 (-15.34 pct)
>> 128: 495.00 (0.00 pct) 531.00 (-7.27 pct)
>> 256: 48576.00 (0.00 pct) 47552.00 (2.10 pct)
>> 512: 79232.00 (0.00 pct) 74624.00 (5.81 pct)
>>
>> NPS4
>>
>> #workers: tip shortrun
>> 1: 30.00 (0.00 pct) 36.00 (-20.00 pct)
>> 2: 34.00 (0.00 pct) 38.00 (-11.76 pct)
>> 4: 41.00 (0.00 pct) 44.00 (-7.31 pct)
>> 8: 60.00 (0.00 pct) 53.00 (11.66 pct)
>> 16: 68.00 (0.00 pct) 73.00 (-7.35 pct)
>> 32: 116.00 (0.00 pct) 125.00 (-7.75 pct)
>> 64: 224.00 (0.00 pct) 248.00 (-10.71 pct)
>> 128: 495.00 (0.00 pct) 569.00 (-14.94 pct)
>> 256: 45888.00 (0.00 pct) 38720.00 (15.62 pct)
>> 512: 78464.00 (0.00 pct) 73600.00 (6.19 pct)
>>
>>
>> ~~~~~~~~~~
>> ~ tbench ~
>> ~~~~~~~~~~
>>
>> NPS1
>>
>> Clients: tip shortrun
>> 1 550.66 (0.00 pct) 546.56 (-0.74 pct)
>> 2 1009.69 (0.00 pct) 1010.01 (0.03 pct)
>> 4 1795.32 (0.00 pct) 1782.71 (-0.70 pct)
>> 8 2971.16 (0.00 pct) 3035.58 (2.16 pct)
>> 16 4627.98 (0.00 pct) 4816.82 (4.08 pct)
>> 32 8065.15 (0.00 pct) 9269.52 (14.93 pct)
>> 64 14994.32 (0.00 pct) 14704.38 (-1.93 pct)
>> 128 5175.73 (0.00 pct) 5174.77 (-0.01 pct)
>> 256 48763.57 (0.00 pct) 49649.67 (1.81 pct)
>> 512 43780.78 (0.00 pct) 44717.04 (2.13 pct)
>> 1024 40341.84 (0.00 pct) 42078.99 (4.30 pct)
>>
>> NPS2
>>
>> Clients: tip shortrun
>> 1 551.06 (0.00 pct) 549.17 (-0.34 pct)
>> 2 1000.76 (0.00 pct) 993.75 (-0.70 pct)
>> 4 1737.02 (0.00 pct) 1773.33 (2.09 pct)
>> 8 2992.31 (0.00 pct) 2971.05 (-0.71 pct)
>> 16 4579.29 (0.00 pct) 4470.71 (-2.37 pct)
>> 32 9120.73 (0.00 pct) 8080.89 (-11.40 pct)
>> 64 14918.58 (0.00 pct) 14395.57 (-3.50 pct)
>> 128 20830.61 (0.00 pct) 20579.09 (-1.20 pct)
>> 256 47708.18 (0.00 pct) 47416.37 (-0.61 pct)
>> 512 43721.79 (0.00 pct) 43754.83 (0.07 pct)
>> 1024 40920.49 (0.00 pct) 40701.90 (-0.53 pct)
>>
>> NPS4
>>
>> Clients: tip shortrun
>> 1 549.22 (0.00 pct) 548.36 (-0.15 pct)
>> 2 1000.08 (0.00 pct) 1037.74 (3.76 pct)
>> 4 1794.78 (0.00 pct) 1802.11 (0.40 pct)
>> 8 3008.50 (0.00 pct) 2989.22 (-0.64 pct)
>> 16 4804.71 (0.00 pct) 4706.51 (-2.04 pct)
>> 32 9156.57 (0.00 pct) 8253.84 (-9.85 pct)
>> 64 14901.45 (0.00 pct) 15049.51 (0.99 pct)
>> 128 20771.20 (0.00 pct) 13229.50 (-36.30 pct)
>> 256 47033.88 (0.00 pct) 46737.17 (-0.63 pct)
>> 512 43429.01 (0.00 pct) 43246.64 (-0.41 pct)
>> 1024 39271.27 (0.00 pct) 42194.75 (7.44 pct)
>>
>>
>> ~~~~~~~~~~
>> ~ stream ~
>> ~~~~~~~~~~
>>
>> NPS1
>>
>> 10 Runs:
>>
>> Test: tip shortrun
>> Copy: 336311.52 (0.00 pct) 330116.75 (-1.84 pct)
>> Scale: 212955.82 (0.00 pct) 215330.30 (1.11 pct)
>> Add: 251518.23 (0.00 pct) 250926.53 (-0.23 pct)
>> Triad: 262077.88 (0.00 pct) 259618.70 (-0.93 pct)
>>
>> 100 Runs:
>>
>> Test: tip shortrun
>> Copy: 339533.83 (0.00 pct) 323452.74 (-4.73 pct)
>> Scale: 194736.72 (0.00 pct) 215789.55 (10.81 pct)
>> Add: 218294.54 (0.00 pct) 244916.33 (12.19 pct)
>> Triad: 262371.40 (0.00 pct) 252997.84 (-3.57 pct)
>>
>> NPS2
>>
>> 10 Runs:
>>
>> Test: tip shortrun
>> Copy: 335277.15 (0.00 pct) 305516.57 (-8.87 pct)
>> Scale: 220990.24 (0.00 pct) 207061.22 (-6.30 pct)
>> Add: 264156.13 (0.00 pct) 243368.49 (-7.86 pct)
>> Triad: 268707.53 (0.00 pct) 223486.30 (-16.82 pct)
>>
>> 100 Runs:
>>
>> Test: tip shortrun
>> Copy: 334913.73 (0.00 pct) 319677.81 (-4.54 pct)
>> Scale: 230522.47 (0.00 pct) 222757.62 (-3.36 pct)
>> Add: 264567.28 (0.00 pct) 254883.62 (-3.66 pct)
>> Triad: 272974.23 (0.00 pct) 260561.08 (-4.54 pct)
>>
>> NPS4
>>
>> 10 Runs:
>>
>> Test: tip shortrun
>> Copy: 356452.47 (0.00 pct) 255911.77 (-28.20 pct)
>> Scale: 242986.42 (0.00 pct) 171587.28 (-29.38 pct)
>> Add: 268512.09 (0.00 pct) 188244.75 (-29.89 pct)
>> Triad: 281622.43 (0.00 pct) 193271.97 (-31.37 pct)
>>
>> 100 Runs:
>>
>> Test: tip shortrun
>> Copy: 367384.81 (0.00 pct) 273101.20 (-25.66 pct)
>> Scale: 254289.04 (0.00 pct) 189986.88 (-25.28 pct)
>> Add: 273683.33 (0.00 pct) 206384.96 (-24.58 pct)
>> Triad: 285696.90 (0.00 pct) 217214.10 (-23.97 pct)
>>
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>> ~ Notes and Observations ~
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> o Schedstat data for Hackbench with 2 groups in NPS1 mode:
>>
>> ---------------------------------------------------------------------------------------------------
>> cpu: all_cpus (avg) vs cpu: all_cpus (avg)
>> ---------------------------------------------------------------------------------------------------
>> kernel: : tip shortrun
>> sched_yield count : 0, 0
>> Legacy counter can be ignored : 0, 0
>> schedule called : 53305, 40615 | -23.81|
>> schedule left the processor idle : 22406, 16919 | -24.49|
>> try_to_wake_up was called : 30822, 23625 | -23.35|
>> try_to_wake_up was called to wake up the local cpu : 984, 2583 | 162.50|
>> total runtime by tasks on this processor (in jiffies) : 596998654, 481267347 | -19.39| *
>> total waittime by tasks on this processor (in jiffies) : 514142630, 766745576 | 49.13| * Longer wait time
> Agree, the wait length is 766745576 / 481267347 = 1.59 after patched, which is
> much bigger than 514142630 / 596998654 = 0.86 before the patch.
>
>> total timeslices run on this cpu : 30893, 23691 | -23.31| *
>> ---------------------------------------------------------------------------------------------------
>>
>>
>> < -------------------------------------- Wakeup info: -------------------------------------- >
>> kernel: : tip shortrun
>> Wakeups on same SMT cpus = all_cpus (avg) : 1470, 1301 | -11.50|
>> Wakeups on same MC cpus = all_cpus (avg) : 22913, 18606 | -18.80|
>> Wakeups on same DIE cpus = all_cpus (avg) : 3634, 693 | -80.93|
>> Wakeups on same NUMA cpus = all_cpus (avg) : 1819, 440 | -75.81|
>> Affine wakeups on same SMT cpus = all_cpus (avg) : 1025, 1421 | 38.63| * More affine wakeups on possibly
>> Affine wakeups on same MC cpus = all_cpus (avg) : 14455, 17514 | 21.16| * busy runqueue leading to longer
>> Affine wakeups on same DIE cpus = all_cpus (avg) : 2828, 701 | -75.21| wait time
>> Affine wakeups on same NUMA cpus = all_cpus (avg) : 1194, 456 | -61.81|
>> ------------------------------------------------------------------------------------------------
> Agree, for SMT and MC domain, the wake affine has been enhanced to suggest to pick
> a short running CPU rather than an idle one. Then later SIS_UTIL would prefer to
> pick this candidate CPU.
>>
>> We observe a larger wait time which which the patch which points
>> to the fact that the tasks are piling on the run queue. I believe
>> Tim's suggestion will help here where we can avoid a pileup as a
>> result of waker task being a short run task.
> Yes, we'll raise the bar to pick a short running CPU.
>>
>> o Tracepoint data for Stream for 100 runs in NPS4
>>
>> Following tracepoints were enabled for Stream threads:
>> - sched_wakeup_new: To observe initial placement
>> - sched_waking: To check if migration is in wakeup context or lb contxt
>> - sched_wakeup: To check if migration is in wakeup context or lb contxt
>> - sched_migrate_task: To observe task movements
>>
>> --> tip:
>>
>> run_stream.sh-3724 [057] d..2. 450.593407: sched_wakeup_new: comm=run_stream.sh pid=3733 prio=120 target_cpu=050 *LLC: 6
>> <idle>-0 [182] d.s4. 450.594375: sched_waking: comm=stream pid=3733 prio=120 target_cpu=050
>> <idle>-0 [050] dNh2. 450.594381: sched_wakeup: comm=stream pid=3733 prio=120 target_cpu=050
>> <idle>-0 [182] d.s4. 450.594657: sched_waking: comm=stream pid=3733 prio=120 target_cpu=050
>> <idle>-0 [050] dNh2. 450.594661: sched_wakeup: comm=stream pid=3733 prio=120 target_cpu=050
>> stream-3733 [050] d..2. 450.594893: sched_wakeup_new: comm=stream pid=3735 prio=120 target_cpu=057 *LLC: 7
>> stream-3733 [050] d..2. 450.594955: sched_wakeup_new: comm=stream pid=3736 prio=120 target_cpu=078 *LLC: 9
>> stream-3733 [050] d..2. 450.594988: sched_wakeup_new: comm=stream pid=3737 prio=120 target_cpu=045 *LLC: 5
>> stream-3733 [050] d..2. 450.595016: sched_wakeup_new: comm=stream pid=3738 prio=120 target_cpu=008 *LLC: 1
>> stream-3733 [050] d..2. 450.595029: sched_waking: comm=stream pid=3737 prio=120 target_cpu=045
>> <idle>-0 [045] dNh2. 450.595037: sched_wakeup: comm=stream pid=3737 prio=120 target_cpu=045
>> stream-3737 [045] d..2. 450.595072: sched_waking: comm=stream pid=3733 prio=120 target_cpu=050
>> <idle>-0 [050] dNh2. 450.595078: sched_wakeup: comm=stream pid=3733 prio=120 target_cpu=050
>> stream-3738 [008] d..2. 450.595102: sched_waking: comm=stream pid=3733 prio=120 target_cpu=050
>> <idle>-0 [050] dNh2. 450.595111: sched_wakeup: comm=stream pid=3733 prio=120 target_cpu=050
>> stream-3733 [050] d..2. 450.595151: sched_wakeup_new: comm=stream pid=3739 prio=120 target_cpu=097 *LLC: 12
>> stream-3733 [050] d..2. 450.595181: sched_wakeup_new: comm=stream pid=3740 prio=120 target_cpu=194 *LLC: 8
>> stream-3733 [050] d..2. 450.595221: sched_wakeup_new: comm=stream pid=3741 prio=120 target_cpu=080 *LLC: 10
>> stream-3733 [050] d..2. 450.595249: sched_wakeup_new: comm=stream pid=3742 prio=120 target_cpu=144 *LLC: 2
>> stream-3733 [050] d..2. 450.595285: sched_wakeup_new: comm=stream pid=3743 prio=120 target_cpu=239 *LLC: 13
>> stream-3733 [050] d..2. 450.595320: sched_wakeup_new: comm=stream pid=3744 prio=120 target_cpu=130 *LLC: 0
>> stream-3733 [050] d..2. 450.595364: sched_wakeup_new: comm=stream pid=3745 prio=120 target_cpu=113 *LLC: 14
>> stream-3744 [130] d..2. 450.595407: sched_waking: comm=stream pid=3733 prio=120 target_cpu=050
>> <idle>-0 [050] dNh2. 450.595416: sched_wakeup: comm=stream pid=3733 prio=120 target_cpu=050
>> stream-3733 [050] d..2. 450.595423: sched_waking: comm=stream pid=3745 prio=120 target_cpu=113
>> <idle>-0 [113] dNh2. 450.595433: sched_wakeup: comm=stream pid=3745 prio=120 target_cpu=113
>> stream-3733 [050] d..2. 450.595452: sched_wakeup_new: comm=stream pid=3746 prio=120 target_cpu=160 *LLC: 4
>> stream-3733 [050] d..2. 450.595486: sched_wakeup_new: comm=stream pid=3747 prio=120 target_cpu=255 *LLC: 15
>> stream-3733 [050] d..2. 450.595513: sched_wakeup_new: comm=stream pid=3748 prio=120 target_cpu=159 *LLC: 3
>> stream-3746 [160] d..2. 450.595533: sched_waking: comm=stream pid=3733 prio=120 target_cpu=050
>> <idle>-0 [050] dNh2. 450.595542: sched_wakeup: comm=stream pid=3733 prio=120 target_cpu=050
>> stream-3747 [255] d..2. 450.595562: sched_waking: comm=stream pid=3733 prio=120 target_cpu=050
>> <idle>-0 [050] dNh2. 450.595573: sched_wakeup: comm=stream pid=3733 prio=120 target_cpu=050
>> stream-3733 [050] d..2. 450.595614: sched_wakeup_new: comm=stream pid=3749 prio=120 target_cpu=222 *LLC: 11
>> stream-3740 [194] d..2. 451.140510: sched_waking: comm=stream pid=3747 prio=120 target_cpu=255
>> <idle>-0 [255] dNh2. 451.140523: sched_wakeup: comm=stream pid=3747 prio=120 target_cpu=255
>> stream-3733 [050] d..2. 451.617257: sched_waking: comm=stream pid=3740 prio=120 target_cpu=194
>> stream-3733 [050] d..2. 451.617267: sched_waking: comm=stream pid=3746 prio=120 target_cpu=160
>> stream-3733 [050] d..2. 451.617269: sched_waking: comm=stream pid=3739 prio=120 target_cpu=097
>> stream-3733 [050] d..2. 451.617272: sched_waking: comm=stream pid=3742 prio=120 target_cpu=144
>> stream-3733 [050] d..2. 451.617275: sched_waking: comm=stream pid=3749 prio=120 target_cpu=222
>> ... (No migrations observed)
>>
>> In most cases, each LLCs is running only 1 stream thread leading to optimal performance.
>>
>> --> with patch:
>>
>> run_stream.sh-4383 [070] d..2. 1237.764236: sched_wakeup_new: comm=run_stream.sh pid=4392 prio=120 target_cpu=206 *LLC: 9
>> stream-4392 [206] d..2. 1237.765121: sched_wakeup_new: comm=stream pid=4394 prio=120 target_cpu=070 *LLC: 8
>> stream-4392 [206] d..2. 1237.765171: sched_wakeup_new: comm=stream pid=4395 prio=120 target_cpu=169 *LLC: 5
>> stream-4392 [206] d..2. 1237.765204: sched_wakeup_new: comm=stream pid=4396 prio=120 target_cpu=111 *LLC: 13
>> stream-4392 [206] d..2. 1237.765243: sched_wakeup_new: comm=stream pid=4397 prio=120 target_cpu=130 *LLC: 0
>> stream-4392 [206] d..2. 1237.765249: sched_waking: comm=stream pid=4396 prio=120 target_cpu=111
>> <idle>-0 [111] dNh2. 1237.765260: sched_wakeup: comm=stream pid=4396 prio=120 target_cpu=111
>> stream-4392 [206] d..2. 1237.765281: sched_wakeup_new: comm=stream pid=4398 prio=120 target_cpu=182 *LLC: 6
>> stream-4392 [206] d..2. 1237.765318: sched_wakeup_new: comm=stream pid=4399 prio=120 target_cpu=060 *LLC: 7
>> stream-4392 [206] d..2. 1237.765368: sched_wakeup_new: comm=stream pid=4400 prio=120 target_cpu=124 *LLC: 15
>> stream-4392 [206] d..2. 1237.765408: sched_wakeup_new: comm=stream pid=4401 prio=120 target_cpu=031 *LLC: 3
>> stream-4392 [206] d..2. 1237.765439: sched_wakeup_new: comm=stream pid=4402 prio=120 target_cpu=095 *LLC: 11
>> stream-4392 [206] d..2. 1237.765475: sched_wakeup_new: comm=stream pid=4403 prio=120 target_cpu=015 *LLC: 1
>> stream-4401 [031] d..2. 1237.765497: sched_waking: comm=stream pid=4392 prio=120 target_cpu=206
>> stream-4401 [031] d..2. 1237.765506: sched_migrate_task: comm=stream pid=4392 prio=120 orig_cpu=206 dest_cpu=152 *LLC: 9 -> 3
>> <idle>-0 [152] dNh2. 1237.765540: sched_wakeup: comm=stream pid=4392 prio=120 target_cpu=152
>> stream-4403 [015] d..2. 1237.765562: sched_waking: comm=stream pid=4392 prio=120 target_cpu=152
>> stream-4403 [015] d..2. 1237.765570: sched_migrate_task: comm=stream pid=4392 prio=120 orig_cpu=152 dest_cpu=136 *LLC: 3 -> 1
>> <idle>-0 [136] dNh2. 1237.765602: sched_wakeup: comm=stream pid=4392 prio=120 target_cpu=136
>> stream-4392 [136] d..2. 1237.765799: sched_wakeup_new: comm=stream pid=4404 prio=120 target_cpu=097 *LLC: 12
>> stream-4392 [136] d..2. 1237.765893: sched_wakeup_new: comm=stream pid=4405 prio=120 target_cpu=084 *LLC: 10
>> stream-4392 [136] d..2. 1237.765957: sched_wakeup_new: comm=stream pid=4406 prio=120 target_cpu=119 *LLC: 14
>> stream-4392 [136] d..2. 1237.766018: sched_wakeup_new: comm=stream pid=4407 prio=120 target_cpu=038 *LLC: 4
>> stream-4406 [119] d..2. 1237.766044: sched_waking: comm=stream pid=4392 prio=120 target_cpu=136
>> stream-4406 [119] d..2. 1237.766050: sched_migrate_task: comm=stream pid=4392 prio=120 orig_cpu=136 dest_cpu=240 *LLC: 1 -> 14
>> <idle>-0 [240] dNh2. 1237.766154: sched_wakeup: comm=stream pid=4392 prio=120 target_cpu=240
>> stream-4392 [240] d..2. 1237.766361: sched_wakeup_new: comm=stream pid=4408 prio=120 target_cpu=023 *LLC: 2
>> stream-4399 [060] d..2. 1238.300605: sched_waking: comm=stream pid=4406 prio=120 target_cpu=119 *LLC: 14 <--- Two stream threads are
>> stream-4399 [060] d..2. 1238.300611: sched_waking: comm=stream pid=4392 prio=120 target_cpu=240 *LLC: 14 <--- on the same LLC leading to
>> <idle>-0 [119] dNh2. 1238.300620: sched_wakeup: comm=stream pid=4406 prio=120 target_cpu=119 *LLC: 14 cache contention, degrading
>> <idle>-0 [240] dNh2. 1238.300621: sched_wakeup: comm=stream pid=4392 prio=120 target_cpu=240 *LLC: 14 the Stream throughput.
>> ... (No more migrations observed)
>>
>> After all the wakeups and migrations, LLC 14 contains two stream threads (pid: 4392 and 4406)
>> All the migrations happen between the events sched_waking and sched_wakeup showing the migrations
>> happens during a wakeup and not as a resutl of load balancing.
>>
>>>
>>> This patch is more about enhancing the wake affine, rather than improving
>>> the SIS efficiency, so Mel's SIS statistic patch was not deployed for now.
>>>
>>> [Limitations]
>>> When the number of CPUs suggested by SIS_UTIL is lower than 60% of the LLC
>>> CPUs, the LLC domain is regarded as relatively busy. However, the 60% is
>>> somewhat hacky, because it indicates that the util_avg% is around 50%,
>>> a half busy LLC. I don't have other lightweight/accurate method in mind to
>>> check if the LLC domain is busy or not.
>>>
>>> [Misc]
>>> At LPC we received useful suggestions. The first one is that we should look at
>>> the time from the task is woken up, to the time the task goes back to sleep.
>>> I assume this is aligned with what is proposed here - we consider the average
>>> running time, rather than the total running time. The second one is that we
>>> should consider the long-running task. And this is under investigation.
>>>
>>> Besides, Prateek has mentioned that the SIS_UTIL is unable to deal with
>>> burst workload. Because there is a delay to reflect the instantaneous
>>> utilization and SIS_UTIL expects the workload to be stable. If the system
>>> is idle most of the time, but suddenly the workloads burst, the SIS_UTIL
>>> overscans. The current patch might mitigate this symptom somehow, as burst
>>> workload is usually regarded as a short-running task.
>>>
>>> Suggested-by: Tim Chen <tim.c.chen@...el.com>
>>> Signed-off-by: Chen Yu <yu.c.chen@...el.com>
>>> ---
>>> kernel/sched/fair.c | 31 ++++++++++++++++++++++++++++++-
>>> 1 file changed, 30 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 914096c5b1ae..7519ab5b911c 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6020,6 +6020,19 @@ static int wake_wide(struct task_struct *p)
>>> return 1;
>>> }
>>>
>>> +/*
>>> + * If a task switches in and then voluntarily relinquishes the
>>> + * CPU quickly, it is regarded as a short running task.
>>> + * sysctl_sched_min_granularity is chosen as the threshold,
>>> + * as this value is the minimal slice if there are too many
>>> + * runnable tasks, see __sched_period().
>>> + */
>>> +static int is_short_task(struct task_struct *p)
>>> +{
>>> + return (p->se.sum_exec_runtime <=
>>> + (p->nvcsw * sysctl_sched_min_granularity));
>>> +}
>>> +
>>> /*
>>> * The purpose of wake_affine() is to quickly determine on which CPU we can run
>>> * soonest. For the purpose of speed we only consider the waking and previous
>>> @@ -6050,7 +6063,8 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
>>> if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
>>> return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
>>>
>>> - if (sync && cpu_rq(this_cpu)->nr_running == 1)
>>> + if ((sync && cpu_rq(this_cpu)->nr_running == 1) ||
>>> + is_short_task(cpu_curr(this_cpu)))
Seems it a bit breaks idle (or will be idle) purpose of
wake_affine_idle() here. Maybe we can do it something like this?
if ((sync || is_short_task(cpu_curr(this_cpu))) &&
cpu_rq(this_cpu)->nr_running == 1)
Thanks,
Honglei
>>
>> This change seems to optimize for affine wakeup which benefits
>> tasks with producer-consumer pattern but is not ideal for Stream.
>> Currently the logic ends will do an affine wakeup even if sync
>> flag is not set:
>>
>> stream-4135 [029] d..2. 353.580953: sched_waking: comm=stream pid=4129 prio=120 target_cpu=082
>> stream-4135 [029] d..2. 353.580957: select_task_rq_fair: wake_affine_idle: Select this_cpu: sync(0) rq->nr_running(1) is_short_task(1)
>> stream-4135 [029] d..2. 353.580960: sched_migrate_task: comm=stream pid=4129 prio=120 orig_cpu=82 dest_cpu=30
>> <idle>-0 [030] dNh2. 353.580993: sched_wakeup: comm=stream pid=4129 prio=120 target_cpu=030
>>
>> I believe a consideration should be made for the sync flag when
>> going for an affine wakeup. Also the check for short running could
>> be at the end after checking if prev_cpu is an available_idle_cpu.
>>
> We can move the short running check after the prev_cpu check. If we
> add the sync flag check would it shrink the coverage of this change?
> Since I found that there is limited scenario would enable the sync
> flag and we want to make the short running check a generic optimization.
> But yes, we can test with/without sync flag constrain to see which one
> gives better data.
>>> return this_cpu;
>>>
>>> if (available_idle_cpu(prev_cpu))
>>> @@ -6434,6 +6448,21 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>> /* overloaded LLC is unlikely to have idle cpu/core */
>>> if (nr == 1)
>>> return -1;
>>> +
>>> + /*
>>> + * If nr is smaller than 60% of llc_weight, it
>>> + * indicates that the util_avg% is higher than 50%.
>>> + * This is calculated by SIS_UTIL in
>>> + * update_idle_cpu_scan(). The 50% util_avg indicates
>>> + * a half-busy LLC domain. System busier than this
>>> + * level could lower its bar to choose a compromised
>>> + * "idle" CPU. If the waker on target CPU is a short
>>> + * task and the wakee is also a short task, pick
>>> + * target directly.
>>> + */
>>> + if (!has_idle_core && (5 * nr < 3 * sd->span_weight) &&
>>> + is_short_task(p) && is_short_task(cpu_curr(target)))
>>> + return target;
>>
>> Pileup seen in hackbench could also be a result of an early
>> bailout here for smaller LLCs but I don't have any data to
>> substantiate that claim currently.
>>
>>> }
>>> }
>>>
>> Please let me know if you need any more data from the test
>> system for any of the benchmarks covered or if you would like
>> me to run any other benchmark on the test system.
> Thank you for your testing, I'll enable SNC to divide the LLC domain
> into smaller ones, and to see if the issue could be reproduced
> on my platform too, then I'll update my finding on this.
>
> thanks,
> Chenyu
>> --
>> Thanks and Regards,
>> Prateek
Powered by blists - more mailing lists