[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8c2f4839-20d5-4ac6-a52a-b0a8986781cb@intel.com>
Date: Fri, 4 Jul 2025 18:09:49 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>
CC: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
<vschneid@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, "Libo
Chen" <libo.chen@...cle.com>, Abel Wu <wuyun.abel@...edance.com>, "Madadi
Vineeth Reddy" <vineethr@...ux.ibm.com>, Hillf Danton <hdanton@...a.com>,
"Len Brown" <len.brown@...el.com>, <linux-kernel@...r.kernel.org>, Peter
Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, "Gautham R .
Shenoy" <gautham.shenoy@....com>, K Prateek Nayak <kprateek.nayak@....com>,
Tim Chen <tim.c.chen@...el.com>
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling
On 7/4/2025 4:00 AM, Shrikanth Hegde wrote:
>
>>
>> tl;dr
>>
>> o Benchmark that prefer co-location and run in threaded mode see
>> a benefit including hackbench at high utilization and schbench
>> at low utilization.
>>
>> o schbench (both new and old but particularly the old) regresses
>> quite a bit on the tial latency metric when #workers cross the
>> LLC size.
>>
>> o client-server benchmarks where client and servers are threads
>> from different processes (netserver-netperf, tbench_srv-tbench,
>> services of DeathStarBench) seem to noticeably regress due to
>> lack of co-location between the communicating client and server.
>>
>> Not sure if WF_SYNC can be an indicator to temporarily ignore
>> the preferred LLC hint.
>>
>> o stream regresses in some runs where the occupancy metrics trip
>> and assign a preferred LLC for all the stream threads bringing
>> down performance in !50% of the runs.
>>
>
> - When you have SMT systems, threads will go faster if they run in ST mode.
> If aggregation happens in a LLC, they might end up with lower IPC.
>
OK, the number of SMT within a core should also be considered to
control how aggressive the aggregation is.
Regarding the regression from the stream, it was caused by the working
set size. When the working set size is 2.9G in Prateek's test scenario,
there is a regression with task aggregation. If we reduce it to a lower
value, say 512MB, the regression disappears. Therefore, we are trying to
tweak this by comparing the process's RSS with the L3 cache size.
thanks,
Chenyu
thanks,
Chenyu
Powered by blists - more mailing lists