linux-kernel - Re: [RFC patch v3 00/20] Cache aware scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <tencent_6FC67FBE2D41106D17474BDCC318C1909D07@qq.com>
Date: Thu, 19 Jun 2025 22:12:03 +0800
From: Yangyu Chen <cyy@...self.name>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: Tim Chen <tim.c.chen@...ux.intel.com>,
 Peter Zijlstra <peterz@...radead.org>,
 Ingo Molnar <mingo@...hat.com>,
 K Prateek Nayak <kprateek.nayak@....com>,
 "Gautham R . Shenoy" <gautham.shenoy@....com>,
 Juri Lelli <juri.lelli@...hat.com>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>,
 Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>,
 Valentin Schneider <vschneid@...hat.com>,
 Tim Chen <tim.c.chen@...el.com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Libo Chen <libo.chen@...cle.com>,
 Abel Wu <wuyun.abel@...edance.com>,
 Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
 Hillf Danton <hdanton@...a.com>,
 Len Brown <len.brown@...el.com>,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling

> On 19 Jun 2025, at 21:21, Chen, Yu C <yu.c.chen@...el.com> wrote:
> 
> On 6/19/2025 2:39 PM, Yangyu Chen wrote:
>> Nice work!
>> I've tested your patch based on commit fb4d33ab452e and found it
>> incredibly helpful for Verilator with large RTL simulations like
>> XiangShan [1] on AMD EPYC Geona.
>> I've created a simple benchmark [2] using a static build of an
>> 8-thread Verilator of XiangShan. Simply clone the repository and
>> run `make run`.
>> In a static allocated 8-CCX KVM (with a total of 128 vCPUs) on EPYC
>> 9T24, before the patch, we have a simulation time of 49.348ms. This
>> was because each thread was distributed across every CCX, resulting
>> in extremely high core-to-core latency. However, after applying the
>> patch, the entire 8-thread Verilator is allocated to a single CCX.
>> Consequently, the simulation time was reduced to 24.196ms, which
>> is a remarkable 2.03x faster than before. We don't need numactl
>> anymore!
>> [1] https://github.com/OpenXiangShan/XiangShan
>> [2] https://github.com/cyyself/chacha20-xiangshan
>> Tested-by: Yangyu Chen <cyy@...self.name>
> 
> Thanks Yangyu for your test. May I know if these 8-threads have any
> data sharing with each other, or each thread has their dedicated
> data? Or, there is 1 main thread, the other 7 threads do the
> chacha20 rotate and put the result to the main thread?

Ah, I had forgotten to mention the benchmark. The workload is not
about chacha20 itself. This benchmark utilizes a RTL-level simulator
[1] that runs an Open Source OoO CPU core called XiangShan [2]. The
chacha20 algorithm is executed on the guest CPU within this simulator.

The verilator partitions a large RTL design into multiple blocks
of functions and distributes them to each thread. These signals
require synchronization every guest cycle, and synchronization is
also necessary when a dependency exists. Given that we have
approximately 5K guest cycles per second, there is a significant
amount of data that needs to be transferred between each thread.
If there are signal dependencies, this could lead to latency-bound
performance.

[1] https://github.com/verilator/verilator
[2] https://github.com/OpenXiangShan/XiangShan

Thanks,
Yangyu Chen

> Anyway I tested it on a Xeon EMR with turbo-disabled and saw ~20%
> reduction in the total time.

Nice result!

> 
> Thanks,
> Chenyu