[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <fc6a2cd3-1425-40de-99a3-605d3215c0cd@amd.com>
Date: Wed, 25 Jun 2025 10:00:41 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Tim Chen <tim.c.chen@...ux.intel.com>,
Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>
Cc: Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
Tim Chen <tim.c.chen@...el.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Libo Chen <libo.chen@...cle.com>,
Abel Wu <wuyun.abel@...edance.com>,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
Hillf Danton <hdanton@...a.com>, Len Brown <len.brown@...el.com>,
linux-kernel@...r.kernel.org, Chen Yu <yu.c.chen@...el.com>
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling
Hello Tim,
On 6/25/2025 6:00 AM, Tim Chen wrote:
>> o Benchmark that prefer co-location and run in threaded mode see
>> a benefit including hackbench at high utilization and schbench
>> at low utilization.
>>
>> o schbench (both new and old but particularly the old) regresses
>> quite a bit on the tial latency metric when #workers cross the
>> LLC size.
>
> Will take closer look at the cases where #workers just exceed LLC size.
> Perhaps adjusting the threshold to spread the load earlier at a
> lower LLC utilization will help.
I too will test with different number of fd pairs to see if I can
spot a trend.
>
>>
>> o client-server benchmarks where client and servers are threads
>> from different processes (netserver-netperf, tbench_srv-tbench,
>> services of DeathStarBench) seem to noticeably regress due to
>> lack of co-location between the communicating client and server.
>>
>> Not sure if WF_SYNC can be an indicator to temporarily ignore
>> the preferred LLC hint.
>
> Currently we do not aggregate tasks from different processes.
> The case where client and server actually reside on the same
> system I think is the exception rather than the rule for real
> workloads where clients and servers reside on different systems.
>
> But I do see tasks from different processes talking to each
> other via pipe/socket in real workload. Do you know of good
> use cases for such scenario that would justify extending task
> aggregation to multi-processes?
We've seen cases with Kubernetes deployments where co-locating
processes of different services from the same pod can help with
throughput and latency. Perhaps it can happen indirectly where
co-location on WF_SYNC can actually help increase the cache
occupancy for a the other process and they both arrive at the
same preferred LLC. I'll see if I can get my hands on a setup
which is closer to these real world deployment.
>
>>
>> o stream regresses in some runs where the occupancy metrics trip
>> and assign a preferred LLC for all the stream threads bringing
>> down performance in !50% of the runs.
>>
>
> Yes, stream does not have cache benefit from co-locating threads, and
> get hurt from sharing common resource like memory controller.
>
>
>> Full data from my testing is as follows:
>>
>> o Machine details
>>
>> - 3rd Generation EPYC System
>> - 2 sockets each with 64C/128T
>> - NPS1 (Each socket is a NUMA node)
>> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
>>
>>
>> ==================================================================
>> Test : Various longer running benchmarks
>> Units : %diff in throughput reported
>> Interpretation: Higher is better
>> Statistic : Median
>> ==================================================================
>> Benchmarks: %diff
>> ycsb-cassandra -0.99%
>> ycsb-mongodb -0.96%
>> deathstarbench-1x -2.09%
>> deathstarbench-2x -0.26%
>> deathstarbench-3x -3.34%
>> deathstarbench-6x -3.03%
>> hammerdb+mysql 16VU -2.15%
>> hammerdb+mysql 64VU -3.77%
>>
>
> The clients and server of the benchmarks are co-located on the same
> system, right?
Yes that is correct. I'm using a 2P systems and our runner scripts
pin the workload to the first socket, and the workload driver runs
from the second socket. One side effect of this is that changes can
influence the placement of workload driver and that can lead to
some inconsistencies. I'll check if the the stats for the workload
driver is way off between the baseline and with this series.
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists