[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9dbdfec2-0a2d-4110-b8ef-b9d16900ff87@intel.com>
Date: Thu, 19 Jun 2025 21:21:37 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Yangyu Chen <cyy@...self.name>, Tim Chen <tim.c.chen@...ux.intel.com>,
Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, "K
Prateek Nayak" <kprateek.nayak@....com>, "Gautham R . Shenoy"
<gautham.shenoy@....com>
CC: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
<vschneid@...hat.com>, Tim Chen <tim.c.chen@...el.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Libo Chen <libo.chen@...cle.com>, Abel Wu
<wuyun.abel@...edance.com>, Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
Hillf Danton <hdanton@...a.com>, Len Brown <len.brown@...el.com>,
<linux-kernel@...r.kernel.org>
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling
On 6/19/2025 2:39 PM, Yangyu Chen wrote:
> Nice work!
>
> I've tested your patch based on commit fb4d33ab452e and found it
> incredibly helpful for Verilator with large RTL simulations like
> XiangShan [1] on AMD EPYC Geona.
>
> I've created a simple benchmark [2] using a static build of an
> 8-thread Verilator of XiangShan. Simply clone the repository and
> run `make run`.
>
> In a static allocated 8-CCX KVM (with a total of 128 vCPUs) on EPYC
> 9T24, before the patch, we have a simulation time of 49.348ms. This
> was because each thread was distributed across every CCX, resulting
> in extremely high core-to-core latency. However, after applying the
> patch, the entire 8-thread Verilator is allocated to a single CCX.
> Consequently, the simulation time was reduced to 24.196ms, which
> is a remarkable 2.03x faster than before. We don't need numactl
> anymore!
>
> [1] https://github.com/OpenXiangShan/XiangShan
> [2] https://github.com/cyyself/chacha20-xiangshan
>
> Tested-by: Yangyu Chen <cyy@...self.name>
>
Thanks Yangyu for your test. May I know if these 8-threads have any
data sharing with each other, or each thread has their dedicated
data? Or, there is 1 main thread, the other 7 threads do the
chacha20 rotate and put the result to the main thread?
Anyway I tested it on a Xeon EMR with turbo-disabled and saw ~20%
reduction in the total time.
Thanks,
Chenyu
Powered by blists - more mailing lists