[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <tencent_6FC67FBE2D41106D17474BDCC318C1909D07@qq.com>
Date: Thu, 19 Jun 2025 22:12:03 +0800
From: Yangyu Chen <cyy@...self.name>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: Tim Chen <tim.c.chen@...ux.intel.com>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>,
K Prateek Nayak <kprateek.nayak@....com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>,
Tim Chen <tim.c.chen@...el.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Libo Chen <libo.chen@...cle.com>,
Abel Wu <wuyun.abel@...edance.com>,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
Hillf Danton <hdanton@...a.com>,
Len Brown <len.brown@...el.com>,
linux-kernel@...r.kernel.org
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling
> On 19 Jun 2025, at 21:21, Chen, Yu C <yu.c.chen@...el.com> wrote:
>
> On 6/19/2025 2:39 PM, Yangyu Chen wrote:
>> Nice work!
>> I've tested your patch based on commit fb4d33ab452e and found it
>> incredibly helpful for Verilator with large RTL simulations like
>> XiangShan [1] on AMD EPYC Geona.
>> I've created a simple benchmark [2] using a static build of an
>> 8-thread Verilator of XiangShan. Simply clone the repository and
>> run `make run`.
>> In a static allocated 8-CCX KVM (with a total of 128 vCPUs) on EPYC
>> 9T24, before the patch, we have a simulation time of 49.348ms. This
>> was because each thread was distributed across every CCX, resulting
>> in extremely high core-to-core latency. However, after applying the
>> patch, the entire 8-thread Verilator is allocated to a single CCX.
>> Consequently, the simulation time was reduced to 24.196ms, which
>> is a remarkable 2.03x faster than before. We don't need numactl
>> anymore!
>> [1] https://github.com/OpenXiangShan/XiangShan
>> [2] https://github.com/cyyself/chacha20-xiangshan
>> Tested-by: Yangyu Chen <cyy@...self.name>
>
> Thanks Yangyu for your test. May I know if these 8-threads have any
> data sharing with each other, or each thread has their dedicated
> data? Or, there is 1 main thread, the other 7 threads do the
> chacha20 rotate and put the result to the main thread?
Ah, I had forgotten to mention the benchmark. The workload is not
about chacha20 itself. This benchmark utilizes a RTL-level simulator
[1] that runs an Open Source OoO CPU core called XiangShan [2]. The
chacha20 algorithm is executed on the guest CPU within this simulator.
The verilator partitions a large RTL design into multiple blocks
of functions and distributes them to each thread. These signals
require synchronization every guest cycle, and synchronization is
also necessary when a dependency exists. Given that we have
approximately 5K guest cycles per second, there is a significant
amount of data that needs to be transferred between each thread.
If there are signal dependencies, this could lead to latency-bound
performance.
[1] https://github.com/verilator/verilator
[2] https://github.com/OpenXiangShan/XiangShan
Thanks,
Yangyu Chen
> Anyway I tested it on a Xeon EMR with turbo-disabled and saw ~20%
> reduction in the total time.
Nice result!
>
> Thanks,
> Chenyu
Powered by blists - more mailing lists