linux-kernel - Re: [RFC patch v3 00/20] Cache aware scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dcb83693-d57e-40d8-baeb-fb49b61d7750@amd.com>
Date: Wed, 25 Jun 2025 09:49:00 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: Juri Lelli <juri.lelli@...hat.com>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 Tim Chen <tim.c.chen@...el.com>, Vincent Guittot
 <vincent.guittot@...aro.org>, Libo Chen <libo.chen@...cle.com>,
 Abel Wu <wuyun.abel@...edance.com>,
 Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
 Hillf Danton <hdanton@...a.com>, Len Brown <len.brown@...el.com>,
 linux-kernel@...r.kernel.org, Tim Chen <tim.c.chen@...ux.intel.com>,
 Peter Zijlstra <peterz@...radead.org>,
 "Gautham R . Shenoy" <gautham.shenoy@....com>, Ingo Molnar <mingo@...hat.com>
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling

Hello Chenyu,

On 6/24/2025 5:46 PM, Chen, Yu C wrote:

[..snip..]

>> tl;dr
>>
>> o Benchmark that prefer co-location and run in threaded mode see
>>    a benefit including hackbench at high utilization and schbench
>>    at low utilization.
>>
> 
> Previously, we tested hackbench with one group using different
> fd pairs. The number of fds (1–6) was lower than the number
> of CPUs (8) within one CCX. If I understand correctly, the
> default number of fd pairs in hackbench is 20.

Yes that is correct. I'm using the default configuration with
20 messengers are 20 receivers over 20 fd pairs. I'll check
if changing this to nr_llc and nr_llc / 2 makes a difference.

> We might need
> to handle cases where the number of threads (nr_thread)
> exceeds the number of CPUs per LLC—perhaps by
> skipping task aggregation in such scenarios.
> 
>> o schbench (both new and old but particularly the old) regresses
>>    quite a bit on the tial latency metric when #workers cross the
>>    LLC size.
>>
> 
> As mentioned above, maybe re-consider the nr_thread vs nr_cpus_per_llc
> could mitigate the issue. Besides, maybe introduce a rate limit
> for cache aware aggregation would help.
> 
>> o client-server benchmarks where client and servers are threads
>>    from different processes (netserver-netperf, tbench_srv-tbench,
>>    services of DeathStarBench) seem to noticeably regress due to
>>    lack of co-location between the communicating client and server.
>>
>>    Not sure if WF_SYNC can be an indicator to temporarily ignore
>>    the preferred LLC hint.
> 
> WF_SYNC is used in wakeup path, the current v3 version does the
> task aggregation in the load balance path. We'll look into this
> C/S scenario.
> 
>>
>> o stream regresses in some runs where the occupancy metrics trip
>>    and assign a preferred LLC for all the stream threads bringing
>>    down performance in !50% of the runs.
>>
> 
> May I know if you tested the stream with mmtests under OMP mode,
> and what do stream-10 and stream-100 mean?

I'm using STREAM in OMP mode. The "10" and "100" refer to the
"NTIMES" argument. I'm passing this during the time of binary
creation as:

     gcc -DSTREAM_ARRAY_SIZE=$ARRAY_SIZE -DNTIMES=$NUM_TIMES -fopenmp -O2 stream.c -o stream

This repeats the main loop of stream benchmark NTIMES. 10 runs
is used to spot any imbalances for shorter runs of b/w intensive
tasks and 100 runs are used to spot trends / ability to correct
an incorrect placement over a longer run.

> Stream is an example
> where all threads have their private memory buffers—no
> interaction with each other. For this benchmark, spreading
> them across different Nodes gets higher memory bandwidth because
> stream allocates the buffer to be at least 4X the L3 cache size.
> We lack a metric that can indicate when threads share a lot of
> data (e.g., both Thread 1 and Thread 2 read from the same
> buffer). In such cases, we should aggregate the threads;
> otherwise, do not aggregate them (as in the stream case).
> On the other hand, stream-omp seems like an unrealistic
> scenario—if threads do not share buffer, why create them
> in the same process?

Not very sure why that is the case but from what I know, HPC
heavily relies on OMP and I believe using threads can reduce
the overhead of fork + join when amount of parallelism in
OMP loops vary?

[..snip..]

>>
>>>
>>> This patch set is applied on v6.15 kernel.
>>> There are some further work needed for future versions in this
>>> patch set.  We will need to align NUMA balancing with LLC aggregations
>>> such that LLC aggregation will align with the preferred NUMA node.
>>>
>>> Comments and tests are much appreciated.
>>
>> I'll rerun the test once with the SCHED_FEAT() disabled just to make
>> sure I'm not regressing because of some other factors. For the major
>> regressions, I'll get the "perf sched stats" data to see if anything
>> stands out.
> 
> It seems that task migration and task bouncing between its preferred
> LLC and non-preferred LLC is one symptom that caused regression.

That could be the case! I'll also include some migration data to see
if it reveals anything.

-- 
Thanks and Regards,
Prateek