[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <18912151-82c0-2936-10b1-f5fe7132287b@amd.com>
Date: Thu, 3 Aug 2023 11:23:36 +0530
From: Swapnil Sapkal <Swapnil.Sapkal@....com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Shrikanth Hegde <sshegde@...ux.vnet.ibm.com>
Cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>,
Aaron Lu <aaron.lu@...el.com>, x86@...nel.org,
Peter Zijlstra <peterz@...radead.org>,
Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
Subject: Re: [RFC PATCH 1/1] sched: Extend cpu idle state for 1ms
Hello Mathieu,
On 7/27/2023 12:26 AM, Mathieu Desnoyers wrote:
> On 7/26/23 13:40, Shrikanth Hegde wrote:
>>
>>
>> On 7/26/23 7:37 PM, Mathieu Desnoyers wrote:
>>> On 7/26/23 04:04, Shrikanth Hegde wrote:
>>>>
>>>>
>>>> On 7/26/23 1:00 AM, Mathieu Desnoyers wrote:
>>>>> Allow select_task_rq to consider a cpu as idle for 1ms after that cpu
>>>>> has exited the idle loop.
>>>>>
>>>>> This speeds up the following hackbench workload on a 192 cores AMD EPYC
>>>>> 9654 96-Core Processor (over 2 sockets):
>>>>>
>>>>> hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
>>>>>
>>>>> from 49s to 34s. (30% speedup)
>>>>>
>>>>> My working hypothesis for why this helps is: queuing more than a single
>>>>> task on the runqueue of a cpu which just exited idle rather than
>>>>> spreading work over other idle cpus helps power efficiency on systems
>>>>> with large number of cores.
>>>>>
>>>>> This was developed as part of the investigation into a weird regression
>>>>> reported by AMD where adding a raw spinlock in the scheduler context
>>>>> switch accelerated hackbench.
>>
>> Do you have SMT here? What is the system utilization when you are running
>> this workload?
>
> Yes, SMT is enabled, which brings the number of logical cpus to 384.
>
> CPU utilization (through htop):
>
> * 6.4.4: 27500%
> * 6.4.4 with the extend-idle+nr_running<=4 patch: 30500%
>
>>
>>>>>
>>>>> It turned out that changing this raw spinlock for a loop of 10000x
>>>>> cpu_relax within do_idle() had similar benefits.
>>>>>
>>>>> This patch achieve a similar effect without the busy-waiting by
>>>>> introducing a runqueue state sampling the sched_clock() when exiting
>>>>> idle, which allows select_task_rq to consider "as idle" a cpu which has
>>>>> recently exited idle.
>>>>>
>>>>> This patch should be considered "food for thoughts", and I would be glad
>>>>> to hear feedback on whether it causes regressions on _other_ workloads,
>>>>> and whether it helps with the hackbench workload on large Intel system
>>>>> as well.
>>>>>
>>>>> Link:
>>>>> https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
>>>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
>>>>> Cc: Ingo Molnar <mingo@...hat.com>
>>>>> Cc: Peter Zijlstra <peterz@...radead.org>
>>>>> Cc: Valentin Schneider <vschneid@...hat.com>
>>>>> Cc: Steven Rostedt <rostedt@...dmis.org>
>>>>> Cc: Ben Segall <bsegall@...gle.com>
>>>>> Cc: Mel Gorman <mgorman@...e.de>
>>>>> Cc: Daniel Bristot de Oliveira <bristot@...hat.com>
>>>>> Cc: Vincent Guittot <vincent.guittot@...aro.org>
>>>>> Cc: Juri Lelli <juri.lelli@...hat.com>
>>>>> Cc: Swapnil Sapkal <Swapnil.Sapkal@....com>
>>>>> Cc: Aaron Lu <aaron.lu@...el.com>
>>>>> Cc: x86@...nel.org
>>>>> ---
>>>>> kernel/sched/core.c | 4 ++++
>>>>> kernel/sched/sched.h | 3 +++
>>>>> 2 files changed, 7 insertions(+)
>>>>>
>>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>>> index a68d1276bab0..d40e3a0a5ced 100644
>>>>> --- a/kernel/sched/core.c
>>>>> +++ b/kernel/sched/core.c
>>>>> @@ -6769,6 +6769,7 @@ void __sched schedule_idle(void)
>>>>> * TASK_RUNNING state.
>>>>> */
>>>>> WARN_ON_ONCE(current->__state);
>>>>> + WRITE_ONCE(this_rq()->idle_end_time, sched_clock());
>>>>> do {
>>>>> __schedule(SM_NONE);
>>>>> } while (need_resched());
>>>>> @@ -7300,6 +7301,9 @@ int idle_cpu(int cpu)
>>>>> {
>>>>> struct rq *rq = cpu_rq(cpu);
>>>>> + if (sched_clock() < READ_ONCE(rq->idle_end_time) +
>>>>> IDLE_CPU_DELAY_NS)
>>>>
>>>>
>>>> Wouldn't this hurt the latency badly? Specially on a loaded system with
>>>> a workload that does a lot of wakeup.
>>>
>>> Good point !
>>>
>>> Can you try your benchmark replacing the if () statement above by:
>>>
>>> + if (sched_clock() < READ_ONCE(rq->idle_end_time) +
>>> IDLE_CPU_DELAY_NS &&
>>> + READ_ONCE(rq->nr_running) <= 4)
>>> + return 1;
>>
>>
>> Tried with this change. I think it does help in reducing latency compared to
>> earlier specially till 95th percentile.
>
> For the records, I also tried with nr_running <= 2 and still had decent performance
> (32s with nr_running <= 2 instead of 30s for nr_running <= 4). It did drop with
> nr_running <= 1 (40s). nr_running <= 5 was similar to 4, and performances start
> degrading with nr_running <= 8 (31s).
>
> So it might be interesting to measure the latency with nr_running <= 2 as well.
> Perhaps nr_running <= 2 would be a good compromise between throughput and tail
> latency.
>
>> 6.5-rc3 6.5-rc3+RFC_Patch 6.5-rc3_RFC_Patch
>> + nr<4
>> 4 Groups
>> 50.0th: 18.00 18.50 18.50
>> 75.0th: 21.50 26.00 23.50
>> 90.0th: 56.00 940.50 501.00
>> 95.0th: 678.00 1896.00 1392.00
>> 99.0th: 2484.00 3756.00 3708.00
>> 99.5th: 3224.00 4616.00 5088.00
>> 99.9th: 4960.00 6824.00 8068.00
>> 8 Groups
>> 50.0th: 23.50 25.50 23.00
>> 75.0th: 30.50 421.50 30.50
>> 90.0th: 443.50 1722.00 741.00
>> 95.0th: 1410.00 2736.00 1670.00
>> 99.0th: 3942.00 5496.00 4032.00
>> 99.5th: 5232.00 7016.00 5064.00
>> 99.9th: 7996.00 8896.00 8012.00
>> 16 Groups
>> 50.0th: 33.50 41.50 32.50
>> 75.0th: 49.00 752.00 47.00
>> 90.0th: 1067.50 2332.00 994.50
>> 95.0th: 2093.00 3468.00 2117.00
>> 99.0th: 5048.00 6728.00 5568.00
>> 99.5th: 6760.00 7624.00 6960.00
>> 99.9th: 8592.00 9504.00 11104.00
>> 32 Groups
>> 50.0th: 60.00 79.00 53.00
>> 75.0th: 456.50 1712.00 209.50
>> 90.0th: 2788.00 3996.00 2752.00
>> 95.0th: 4544.00 5768.00 5024.00
>> 99.0th: 8444.00 9104.00 10352.00
>> 99.5th: 9168.00 9808.00 12720.00
>> 99.9th: 11984.00 12448.00 17624.00
>
> [...]
>
>>>>> @@ -1010,6 +1012,7 @@ struct rq {
>>>>> struct task_struct __rcu *curr;
>>>>> struct task_struct *idle;
>>>>> + u64 idle_end_time;
>>
>> There is clock_idle already in the rq. Can that be used for the same?
>
> Good point! And I'll change my use of "sched_clock()" in idle_cpu() for a
> proper "sched_clock_cpu(cpu_of(rq))", which will work better on systems
> without constant tsc.
>
> The updated patch:
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a68d1276bab0..1c7d5bd2968b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7300,6 +7300,10 @@ int idle_cpu(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
>
> + if (READ_ONCE(rq->nr_running) <= IDLE_CPU_DELAY_MAX_RUNNING &&
> + sched_clock_cpu(cpu_of(rq)) < READ_ONCE(rq->clock_idle) + IDLE_CPU_DELAY_NS)
> + return 1;
> +
> if (rq->curr != rq->idle)
> return 0;
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 81ac605b9cd5..57a49a5524f0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -97,6 +97,9 @@
> # define SCHED_WARN_ON(x) ({ (void)(x), 0; })
> #endif
>
> +#define IDLE_CPU_DELAY_NS 1000000 /* 1ms */
> +#define IDLE_CPU_DELAY_MAX_RUNNING 4
> +
> struct rq;
> struct cpuidle_state;
I have run some standard micro-benchmarks to test this patch on a 2 Socket Bergamo System
which has 256C/512T (2 X 128C/Socket). The following are the results:
base : 6.5.0-rc4
base + extend-idle : base + Original patch which has change to to extend idle state by 1ms.
base + extend-idle + nr_running : base + updated patch which contains both extend idle and nr_running limit.
========================= hackbench =========================
Test: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
1-groups: 19.95 (0.00 pct) 15.92 (20.20 pct) 15.30 (23.30 pct)
2-groups: 21.33 (0.00 pct) 23.00 (-7.82 pct) 19.70 (7.64 pct)
4-groups: 22.57 (0.00 pct) 30.02 (-33.00 pct) 26.74 (-18.47 pct)
8-groups: 24.68 (0.00 pct) 32.54 (-31.84 pct) 28.27 (-14.54 pct)
16-groups: 31.20 (0.00 pct) 32.47 (-4.07 pct) 27.70 (11.21 pct)
Observation: Hackbench shows improvement with lower and higher number of groups still
it shows dip in performance for middle order.
========================= new_schbench =========================
Metric: wakeup_lat_summary
#workers: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
1: 30.00 (0.00 pct) 33.00 (-10.00 pct) 33.00 (-10.00 pct)
2: 26.00 (0.00 pct) 32.00 (-23.07 pct) 33.00 (-26.92 pct)
4: 26.00 (0.00 pct) 32.00 (-23.07 pct) 32.00 (-23.07 pct)
8: 9.00 (0.00 pct) 10.00 (-11.11 pct) 9.00 (0.00 pct)
16: 8.00 (0.00 pct) 10.00 (-25.00 pct) 9.00 (-12.50 pct)
32: 8.00 (0.00 pct) 9.00 (-12.50 pct) 10.00 (-25.00 pct)
64: 8.00 (0.00 pct) 10.00 (-25.00 pct) 10.00 (-25.00 pct)
128: 8.00 (0.00 pct) 11.00 (-37.50 pct) 12.00 (-50.00 pct)
256: 102.00 (0.00 pct) 48.00 (52.94 pct) 50.00 (50.98 pct)
512: 20704.00 (0.00 pct) 23200.00 (-12.05 pct) 23200.00 (-12.05 pct)
Metric: request_lat_summary
#workers: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
1: 6712.00 (0.00 pct) 6744.00 (-0.47 pct) 6760.00 (-0.71 pct)
2: 6792.00 (0.00 pct) 6840.00 (-0.70 pct) 6872.00 (-1.17 pct)
4: 6792.00 (0.00 pct) 6840.00 (-0.70 pct) 6856.00 (-0.94 pct)
8: 6776.00 (0.00 pct) 6824.00 (-0.70 pct) 6872.00 (-1.41 pct)
16: 6760.00 (0.00 pct) 6792.00 (-0.47 pct) 6872.00 (-1.65 pct)
32: 6808.00 (0.00 pct) 6776.00 (0.47 pct) 6872.00 (-0.94 pct)
64: 6808.00 (0.00 pct) 6776.00 (0.47 pct) 6872.00 (-0.94 pct)
128: 12208.00 (0.00 pct) 11856.00 (2.88 pct) 12784.00 (-4.71 pct)
256: 13264.00 (0.00 pct) 13296.00 (-0.24 pct) 13680.00 (-3.13 pct)
512: 84096.00 (0.00 pct) 100992.00 (-20.09 pct) 110208.00 (-31.05 pct)
Metric: rps_summary
#workers: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
1: 305.00 (0.00 pct) 302.00 (0.98 pct) 296.00 (2.95 pct)
2: 607.00 (0.00 pct) 601.00 (0.98 pct) 593.00 (2.30 pct)
4: 1214.00 (0.00 pct) 1202.00 (0.98 pct) 1190.00 (1.97 pct)
8: 2436.00 (0.00 pct) 2420.00 (0.65 pct) 2372.00 (2.62 pct)
16: 4888.00 (0.00 pct) 4872.00 (0.32 pct) 4808.00 (1.63 pct)
32: 9776.00 (0.00 pct) 9744.00 (0.32 pct) 9680.00 (0.98 pct)
64: 19616.00 (0.00 pct) 19488.00 (0.65 pct) 19360.00 (1.30 pct)
128: 38592.00 (0.00 pct) 38720.00(-0.33 pct) 38080.00 (1.32 pct)
256: 41024.00 (0.00 pct) 40896.00 (0.31 pct) 38976.00 (4.99 pct)
512: 36800.00 (0.00 pct) 35776.00 (2.78 pct) 33728.00 (8.34 pct)
Observation: new_schbench is showing regression in wakeup latencies while request
latencies and rps latencies shows no change except for highly loaded case in request
latency.
========================= schbench =========================
#workers: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
1: 185.00 (0.00 pct) 181.00 (2.16 pct) 181.00 (2.16 pct)
2: 191.00 (0.00 pct) 192.00 (-0.52 pct) 189.00 (1.04 pct)
4: 191.00 (0.00 pct) 192.00 (-0.52 pct) 194.00 (-1.57 pct)
8: 218.00 (0.00 pct) 198.00 (9.17 pct) 200.00 (8.25 pct)
16: 226.00 (0.00 pct) 225.00 (0.44 pct) 224.00 (0.88 pct)
32: 537.00 (0.00 pct) 537.00 (0.00 pct) 539.00 (-0.37 pct)
64: 605.00 (0.00 pct) 621.00 (-2.64 pct) 619.00 (-2.31 pct)
128: 765.00 (0.00 pct) 795.00 (-3.92 pct) 781.00 (-2.09 pct)
256: 1122.00 (0.00 pct) 1150.00 (-2.49 pct) 1150.00 (-2.49 pct)
512: 2276.00 (0.00 pct) 2476.00 (-8.78 pct) 2404.00 (-5.62 pct)
========================= tbench =========================
Clients: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
1 386.07 (0.00 pct) 390.13 (1.05 pct) 463.49 (20.05 pct)
2 718.40 (0.00 pct) 856.37 (19.20 pct) 799.31 (11.26 pct)
4 1399.34 (0.00 pct) 1514.03 (8.19 pct) 1482.01 (5.90 pct)
8 2716.56 (0.00 pct) 3000.02 (10.43 pct) 2823.23 (3.92 pct)
16 5275.97 (0.00 pct) 5468.17 (3.64 pct) 5430.77 (2.93 pct)
32 10534.37 (0.00 pct) 10442.13 (-0.87 pct) 10386.92 (-1.39 pct)
64 22079.03 (0.00 pct) 19161.30 (-13.21 pct) 16773.73 (-24.02 pct)
128 41051.13 (0.00 pct) 29923.57 (-27.10 pct) 22510.13 (-45.16 pct)
256 55603.43 (0.00 pct) 43203.67 (-22.30 pct) 40343.17 (-27.44 pct)
512 130673.33 (0.00 pct) 76581.40 (-41.39 pct) 69152.47 (-47.07 pct)
1024 133323.67 (0.00 pct) 114910.67 (-13.81 pct) 107728.67 (-19.19 pct)
2048 143674.33 (0.00 pct) 123842.67 (-13.80 pct) 107202.00 (-25.38 pct)
Observation: tbench is showing dip in throughput for 64 clients and onwards.
====================== stream 10 RUNS ======================
Test: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
Copy: 354190.51 (0.00 pct) 356650.82 (0.69 pct) 353287.34 (-0.25 pct)
Scale: 355427.44 (0.00 pct) 356686.34 (0.35 pct) 354406.79 (-0.28 pct)
Add: 373800.46 (0.00 pct) 376610.56 (0.75 pct) 374609.00 (0.21 pct)
Triad: 374697.25 (0.00 pct) 377635.98 (0.78 pct) 375343.44 (0.17 pct)
====================== stream 100 RUNS ======================
Test: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
Copy: 357922.89 (0.00 pct) 356560.50 (-0.38 pct) 356507.22 (-0.39 pct)
Scale: 358118.38 (0.00 pct) 357435.86 (-0.19 pct) 358033.29 (-0.02 pct)
Add: 375307.34 (0.00 pct) 376046.70 (0.19 pct) 375586.33 (0.07 pct)
Triad: 375656.40 (0.00 pct) 376674.43 (0.27 pct) 376581.95 (0.24 pct)
========================== netperf ==========================
Clients: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
1-clients: 114299.07 (0.00 pct) 110695.40 (-3.15 pct) 111533.30 (-2.41 pct)
2-clients: 114130.01 (0.00 pct) 110192.51 (-3.45 pct) 111682.01 (-2.14 pct)
4-clients: 109126.45 (0.00 pct) 107275.60 (-1.69 pct) 109574.77 (0.41 pct)
8-clients: 111209.21 (0.00 pct) 104360.40 (-6.15 pct) 106518.41 (-4.21 pct)
16-clients: 102955.20 (0.00 pct) 100968.38 (-1.92 pct) 101897.25 (-1.02 pct)
32-clients: 98537.18 (0.00 pct) 103018.09 ( 4.54 pct) 103917.70 (5.46 pct)
64-clients: 103619.68 (0.00 pct) 100376.37 (-3.13 pct) 102651.24 (-0.93 pct)
128-clients: 98536.55 (0.00 pct) 77845.69 (-20.99 pct) 98566.87 (0.03 pct)
256-clients: 51934.45 (0.00 pct) 52844.10 (1.75 pct) 53562.99 (3.13 pct)
I have also tried the same experiment on one socket Genoa system with 96C/192T. On that
system also I am seeing similar behavior.
Can you share your build config just in case I am missing something.
>
> And using it now brings the hackbench wall time at 28s :)
>
> Thanks,
>
> Mathieu
>
>>
>>>>> struct task_struct *stop;
>>>>> unsigned long next_balance;
>>>>> struct mm_struct *prev_mm;
>>>
>
--
Thanks and regards,
Swapnil
Powered by blists - more mailing lists