linux-kernel - Re: [RFC patch v3 00/20] Cache aware scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2c72e2ada1bcc86053c01c67ba4a03cf1b4f132d.camel@linux.intel.com>
Date: Tue, 24 Jun 2025 17:30:37 -0700
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: K Prateek Nayak <kprateek.nayak@....com>, Peter Zijlstra
 <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, "Gautham R .
 Shenoy" <gautham.shenoy@....com>
Cc: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
 <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben
 Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin
 Schneider <vschneid@...hat.com>, Tim Chen <tim.c.chen@...el.com>, Vincent
 Guittot <vincent.guittot@...aro.org>, Libo Chen <libo.chen@...cle.com>,
 Abel Wu <wuyun.abel@...edance.com>, Madadi Vineeth Reddy
 <vineethr@...ux.ibm.com>,  Hillf Danton <hdanton@...a.com>, Len Brown
 <len.brown@...el.com>, linux-kernel@...r.kernel.org, Chen Yu
 <yu.c.chen@...el.com>
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling

On Tue, 2025-06-24 at 10:30 +0530, K Prateek Nayak wrote:
> Hello Tim,
> 
> I have similar observation from my testing.
> 
> 
Prateek,

Thanks for the testing that you did. Much appreciated.
Some follow up to Chen, Yu's comments.

> 
> o Benchmark that prefer co-location and run in threaded mode see
>    a benefit including hackbench at high utilization and schbench
>    at low utilization.
> 
> o schbench (both new and old but particularly the old) regresses
>    quite a bit on the tial latency metric when #workers cross the
>    LLC size.

Will take closer look at the cases where #workers just exceed LLC size.
Perhaps adjusting the threshold to spread the load earlier at a
lower LLC utilization will help.

> 
> o client-server benchmarks where client and servers are threads
>    from different processes (netserver-netperf, tbench_srv-tbench,
>    services of DeathStarBench) seem to noticeably regress due to
>    lack of co-location between the communicating client and server.
> 
>    Not sure if WF_SYNC can be an indicator to temporarily ignore
>    the preferred LLC hint.

Currently we do not aggregate tasks from different processes.
The case where client and server actually reside on the same
system I think is the exception rather than the rule for real
workloads where clients and servers reside on different systems.

But I do see tasks from different processes talking to each
other via pipe/socket in real workload.  Do you know of good
use cases for such scenario that would justify extending task
aggregation to multi-processes?

> 
> o stream regresses in some runs where the occupancy metrics trip
>    and assign a preferred LLC for all the stream threads bringing
>    down performance in !50% of the runs.
> 

Yes, stream does not have cache benefit from co-locating threads, and
get hurt from sharing common resource like memory controller.

> Full data from my testing is as follows:
> 
> o Machine details
> 
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
> 
> 
>      ==================================================================
>      Test          : Various longer running benchmarks
>      Units         : %diff in throughput reported
>      Interpretation: Higher is better
>      Statistic     : Median
>      ==================================================================
>      Benchmarks:                  %diff
>      ycsb-cassandra              -0.99%
>      ycsb-mongodb                -0.96%
>      deathstarbench-1x           -2.09%
>      deathstarbench-2x           -0.26%
>      deathstarbench-3x           -3.34%
>      deathstarbench-6x           -3.03%
>      hammerdb+mysql 16VU         -2.15%
>      hammerdb+mysql 64VU         -3.77%
> 

The clients and server of the benchmarks are co-located on the same
system, right?

> > 
> > This patch set is applied on v6.15 kernel.
> >   
> > There are some further work needed for future versions in this
> > patch set.  We will need to align NUMA balancing with LLC aggregations
> > such that LLC aggregation will align with the preferred NUMA node.
> > 
> > Comments and tests are much appreciated.
> 
> I'll rerun the test once with the SCHED_FEAT() disabled just to make
> sure I'm not regressing because of some other factors. For the major
> regressions, I'll get the "perf sched stats" data to see if anything
> stands out.
> 
> I'm also planning on getting the data from a Zen5c system with larger
> LLC to see if there is any difference in the trend (I'll start with the
> microbenchmarks since setting the larger ones will take some time)
> 
> Sorry for the lack of engagement on previous versions but I plan on
> taking a better look at the series this time around. If you need any
> specific data from my setup, please do let me know.
> 

Will do.  Thanks.

Tim