[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <906747ff-148c-f058-dc94-7a9225125f52@amd.com>
Date: Tue, 15 Nov 2022 16:58:37 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Abel Wu <wuyun.abel@...edance.com>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>, Mel Gorman <mgorman@...e.de>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Valentin Schneider <valentin.schneider@....com>
Cc: Josh Don <joshdon@...gle.com>, Chen Yu <yu.c.chen@...el.com>,
Tim Chen <tim.c.chen@...ux.intel.com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
Aubrey Li <aubrey.li@...el.com>,
Qais Yousef <qais.yousef@....com>,
Juri Lelli <juri.lelli@...hat.com>,
Rik van Riel <riel@...riel.com>,
Yicong Yang <yangyicong@...wei.com>,
Barry Song <21cnbao@...il.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
Hello Abel,
Thank you for taking a look at the report.
On 11/15/2022 2:01 PM, Abel Wu wrote:
> Hi Prateek, thanks very much for your detailed testing!
>
> On 11/14/22 1:45 PM, K Prateek Nayak wrote:
>> Hello Abel,
>>
>> Sorry for the delay. I've tested the patch on a dual socket Zen3 system
>> (2 x 64C/128T)
>>
>> tl;dr
>>
>> o I do not notice any regressions with the standard benchmarks.
>> o schbench sees a nice improvement to the tail latency when the number
>> of worker are equal to the number of cores in the system in NPS1 and
>> NPS2 mode. (Marked with "^")
>> o Few data points show improvements in tbench in NPS1 and NPS2 mode.
>> (Marked with "^")
>>
>> I'm still in the process of running larger workloads. If there is any
>> specific workload you would like me to run on the test system, please
>> do let me know. Below is the detailed report:
>
> Not particularly in my mind, and I think testing larger workloads is
> great. Thanks!
>
>>
>> Following are the results from running standard benchmarks on a
>> dual socket Zen3 (2 x 64C/128T) machine configured in different
>> NPS modes.
>>
>> NPS Modes are used to logically divide single socket into
>> multiple NUMA region.
>> Following is the NUMA configuration for each NPS mode on the system:
>>
>> NPS1: Each socket is a NUMA node.
>> Total 2 NUMA nodes in the dual socket machine.
>>
>> Node 0: 0-63, 128-191
>> Node 1: 64-127, 192-255
>>
>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>> Total 4 NUMA nodes exist over 2 socket.
>> Node 0: 0-31, 128-159
>> Node 1: 32-63, 160-191
>> Node 2: 64-95, 192-223
>> Node 3: 96-127, 223-255
>>
>> NPS4: Each socket is logically divided into 4 NUMA regions.
>> Total 8 NUMA nodes exist over 2 socket.
>> Node 0: 0-15, 128-143
>> Node 1: 16-31, 144-159
>> Node 2: 32-47, 160-175
>> Node 3: 48-63, 176-191
>> Node 4: 64-79, 192-207
>> Node 5: 80-95, 208-223
>> Node 6: 96-111, 223-231
>> Node 7: 112-127, 232-255
>>
>> Benchmark Results:
>>
>> Kernel versions:
>> - tip: 5.19.0 tip sched/core
>> - sis_core: 5.19.0 tip sched/core + this series
>>
>> When we started testing, the tip was at:
>> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
>>
>> ~~~~~~~~~~~~~
>> ~ hackbench ~
>> ~~~~~~~~~~~~~
>>
>> o NPS1
>>
>> Test: tip sis_core
>> 1-groups: 4.06 (0.00 pct) 4.26 (-4.92 pct) *
>> 1-groups: 4.14 (0.00 pct) 4.09 (1.20 pct) [Verification Run]
>> 2-groups: 4.76 (0.00 pct) 4.71 (1.05 pct)
>> 4-groups: 5.22 (0.00 pct) 5.11 (2.10 pct)
>> 8-groups: 5.35 (0.00 pct) 5.31 (0.74 pct)
>> 16-groups: 7.21 (0.00 pct) 6.80 (5.68 pct)
>>
>> o NPS2
>>
>> Test: tip sis_core
>> 1-groups: 4.09 (0.00 pct) 4.08 (0.24 pct)
>> 2-groups: 4.70 (0.00 pct) 4.69 (0.21 pct)
>> 4-groups: 5.05 (0.00 pct) 4.92 (2.57 pct)
>> 8-groups: 5.35 (0.00 pct) 5.26 (1.68 pct)
>> 16-groups: 6.37 (0.00 pct) 6.34 (0.47 pct)
>>
>> o NPS4
>>
>> Test: tip sis_core
>> 1-groups: 4.07 (0.00 pct) 3.99 (1.96 pct)
>> 2-groups: 4.65 (0.00 pct) 4.59 (1.29 pct)
>> 4-groups: 5.13 (0.00 pct) 5.00 (2.53 pct)
>> 8-groups: 5.47 (0.00 pct) 5.43 (0.73 pct)
>> 16-groups: 6.82 (0.00 pct) 6.56 (3.81 pct)
>
> Although each cpu will get 2.5 tasks when 16-groups, which can
> be considered overloaded, I tested in AMD EPYC 7Y83 machine and
> the total cpu usage was ~82% (with some older kernel version),
> so there is still lots of idle time.
>
> I guess cutting off at 16-groups is because it's enough loaded
> compared to the real workloads, so testing more groups might just
> be a waste of time?
The machine has 16 LLCs so I capped the results at 16-groups.
Previously I had seen some run-to-run variance with larger group counts
so I limited the reports to 16-groups. I'll run hackbench with more
number of groups (32, 64, 128, 256) and get back to you with the
results along with results for a couple of long running workloads.
>
> Thanks & Best Regards,
> Abel
>
> [..snip..]
>
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists