[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2c72e2ada1bcc86053c01c67ba4a03cf1b4f132d.camel@linux.intel.com>
Date: Tue, 24 Jun 2025 17:30:37 -0700
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: K Prateek Nayak <kprateek.nayak@....com>, Peter Zijlstra
<peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, "Gautham R .
Shenoy" <gautham.shenoy@....com>
Cc: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben
Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin
Schneider <vschneid@...hat.com>, Tim Chen <tim.c.chen@...el.com>, Vincent
Guittot <vincent.guittot@...aro.org>, Libo Chen <libo.chen@...cle.com>,
Abel Wu <wuyun.abel@...edance.com>, Madadi Vineeth Reddy
<vineethr@...ux.ibm.com>, Hillf Danton <hdanton@...a.com>, Len Brown
<len.brown@...el.com>, linux-kernel@...r.kernel.org, Chen Yu
<yu.c.chen@...el.com>
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling
On Tue, 2025-06-24 at 10:30 +0530, K Prateek Nayak wrote:
> Hello Tim,
>
> I have similar observation from my testing.
>
>
Prateek,
Thanks for the testing that you did. Much appreciated.
Some follow up to Chen, Yu's comments.
>
> o Benchmark that prefer co-location and run in threaded mode see
> a benefit including hackbench at high utilization and schbench
> at low utilization.
>
> o schbench (both new and old but particularly the old) regresses
> quite a bit on the tial latency metric when #workers cross the
> LLC size.
Will take closer look at the cases where #workers just exceed LLC size.
Perhaps adjusting the threshold to spread the load earlier at a
lower LLC utilization will help.
>
> o client-server benchmarks where client and servers are threads
> from different processes (netserver-netperf, tbench_srv-tbench,
> services of DeathStarBench) seem to noticeably regress due to
> lack of co-location between the communicating client and server.
>
> Not sure if WF_SYNC can be an indicator to temporarily ignore
> the preferred LLC hint.
Currently we do not aggregate tasks from different processes.
The case where client and server actually reside on the same
system I think is the exception rather than the rule for real
workloads where clients and servers reside on different systems.
But I do see tasks from different processes talking to each
other via pipe/socket in real workload. Do you know of good
use cases for such scenario that would justify extending task
aggregation to multi-processes?
>
> o stream regresses in some runs where the occupancy metrics trip
> and assign a preferred LLC for all the stream threads bringing
> down performance in !50% of the runs.
>
Yes, stream does not have cache benefit from co-locating threads, and
get hurt from sharing common resource like memory controller.
> Full data from my testing is as follows:
>
> o Machine details
>
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
>
>
> ==================================================================
> Test : Various longer running benchmarks
> Units : %diff in throughput reported
> Interpretation: Higher is better
> Statistic : Median
> ==================================================================
> Benchmarks: %diff
> ycsb-cassandra -0.99%
> ycsb-mongodb -0.96%
> deathstarbench-1x -2.09%
> deathstarbench-2x -0.26%
> deathstarbench-3x -3.34%
> deathstarbench-6x -3.03%
> hammerdb+mysql 16VU -2.15%
> hammerdb+mysql 64VU -3.77%
>
The clients and server of the benchmarks are co-located on the same
system, right?
> >
> > This patch set is applied on v6.15 kernel.
> >
> > There are some further work needed for future versions in this
> > patch set. We will need to align NUMA balancing with LLC aggregations
> > such that LLC aggregation will align with the preferred NUMA node.
> >
> > Comments and tests are much appreciated.
>
> I'll rerun the test once with the SCHED_FEAT() disabled just to make
> sure I'm not regressing because of some other factors. For the major
> regressions, I'll get the "perf sched stats" data to see if anything
> stands out.
>
> I'm also planning on getting the data from a Zen5c system with larger
> LLC to see if there is any difference in the trend (I'll start with the
> microbenchmarks since setting the larger ones will take some time)
>
> Sorry for the lack of engagement on previous versions but I plan on
> taking a better look at the series this time around. If you need any
> specific data from my setup, please do let me know.
>
Will do. Thanks.
Tim
Powered by blists - more mailing lists