linux-kernel - Re: [RFC PATCH 0/2] sched/fair: Reorder scheduling related structs to reduce cache misses

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <1d0a5987-1aa9-450e-a37e-97bbefeaa649@linux.ibm.com>
Date: Sun, 20 Apr 2025 09:08:23 +0530
From: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
To: ZECHENG LI <zecheng@...gle.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
        Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
        Xu Liu <xliuprof@...gle.com>, Blake Jones <blakejones@...gle.com>,
        Josh Don <joshdon@...gle.com>, linux-kernel@...r.kernel.org,
        Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [RFC PATCH 0/2] sched/fair: Reorder scheduling related structs to
 reduce cache misses

On 19/04/25 02:28, ZECHENG LI wrote:
> Hi Madadi Vineeth Reddy,
> 
>> This patch is based on optimizations by reordering for 64 byte systems.
>> In case of 128 byte L1 D-cache systems like Power10, this might or might
>> not be beneficial. Moreover lot of space(almost half) would be wasted
>> on the cache line due to APIs like `__cacheline_group_begin_aligned`
>> and `__cacheline_group_end_aligned` that may restrict size to 64 bytes.
>>
>> Since this is in generic code, any ideas on how to make sure that
>> other architectures with different cache size don't suffer?
> 
> We propose to conditionally align to the cacheline boundary only when
> the cacheline size is 64 bytes, since this is the most common size.
> 
> For architectures with 128-byte cachelines (like PowerPC), this
> approach will still collocate the hot fields, providing some
> performance benefit from improved locality, but it will not enforce
> alignment to the larger 128-byte boundary. This avoids wasting cache

I don't see the check to enforce the alignment only for 64 bytes. IIUC,
the macros seem to apply the alignment unconditionally based on arch
specific cacheline size. I might be missing something, could you
clarify this?

> space on those architectures due to padding introduced by the
> alignment, while still gaining benefits from collocating frequently
> accessed fields.
> 
>> Due to the reordering of the fields, there might be some workloads
>> that could take a hit. May be try running workloads of different
>> kinds(latency and throughput oriented) and make sure that regression
>> is not high.
> 
> For workloads running without a cgroup hierarchy, we expect a small
> performance impact. This is because there is only one cfs_rq per CPU
> in this scenario, which is likely in cache due to frequent access.
> 
> For workloads with the cgroup hierarchy, I have tested sysbench threads
> and hackbench --thread, there are no obvious regression.
> 
> Heavy load on 1024 instances of sysbench:
> Latency (ms), after-patch, origial
> avg avg: 2133.51, 2150.97
> avg min: 21.9629, 20.9413
> avg max: 5955.8, 5966.78
> 
> Avg runtime for 1024 instances of ./hackbench --thread -g 2 -l 1000
> in a cgroup hierarchy:
> After-patch: 34.9458s, Original: 36.8647s
> 
> We plan to include more benchmark results in the v2 patch. Do you have
> suggestions for other benchmarks you would like us to test?

May be some throughput oriented workloads like ebizzy, sysbench and also
some real life workloads would be good to include.

Thanks,
Madadi Vineeth Reddy

> 
> Regards,
> Zecheng