linux-kernel - Re: [RFC PATCH 0/2] sched/fair: Reorder scheduling related structs to reduce cache misses

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJUgMyLDwyK-WgNFOr7bGmXPG9eAEnG7mNtjfPSTeJnJT8bAVg@mail.gmail.com>
Date: Fri, 18 Apr 2025 16:58:59 -0400
From: ZECHENG LI <zecheng@...gle.com>
To: 20250402212904.8866-1-zecheng@...gle.com
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>, Xu Liu <xliuprof@...gle.com>, 
	Blake Jones <blakejones@...gle.com>, Josh Don <joshdon@...gle.com>, linux-kernel@...r.kernel.org, 
	Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [RFC PATCH 0/2] sched/fair: Reorder scheduling related structs to
 reduce cache misses

Hi Madadi Vineeth Reddy,

> This patch is based on optimizations by reordering for 64 byte systems.
> In case of 128 byte L1 D-cache systems like Power10, this might or might
> not be beneficial. Moreover lot of space(almost half) would be wasted
> on the cache line due to APIs like `__cacheline_group_begin_aligned`
> and `__cacheline_group_end_aligned` that may restrict size to 64 bytes.
>
> Since this is in generic code, any ideas on how to make sure that
> other architectures with different cache size don't suffer?

We propose to conditionally align to the cacheline boundary only when
the cacheline size is 64 bytes, since this is the most common size.

For architectures with 128-byte cachelines (like PowerPC), this
approach will still collocate the hot fields, providing some
performance benefit from improved locality, but it will not enforce
alignment to the larger 128-byte boundary. This avoids wasting cache
space on those architectures due to padding introduced by the
alignment, while still gaining benefits from collocating frequently
accessed fields.

> Due to the reordering of the fields, there might be some workloads
> that could take a hit. May be try running workloads of different
> kinds(latency and throughput oriented) and make sure that regression
> is not high.

For workloads running without a cgroup hierarchy, we expect a small
performance impact. This is because there is only one cfs_rq per CPU
in this scenario, which is likely in cache due to frequent access.

For workloads with the cgroup hierarchy, I have tested sysbench threads
and hackbench --thread, there are no obvious regression.

Heavy load on 1024 instances of sysbench:
Latency (ms), after-patch, origial
avg avg: 2133.51, 2150.97
avg min: 21.9629, 20.9413
avg max: 5955.8, 5966.78

Avg runtime for 1024 instances of ./hackbench --thread -g 2 -l 1000
in a cgroup hierarchy:
After-patch: 34.9458s, Original: 36.8647s

We plan to include more benchmark results in the v2 patch. Do you have
suggestions for other benchmarks you would like us to test?

Regards,
Zecheng