[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <E120CFCF-8BFF-44BB-96B7-C70E020E2A31@amazon.com>
Date: Fri, 2 May 2025 18:06:05 +0000
From: "Prundeanu, Cristian" <cpru@...zon.com>
To: K Prateek Nayak <kprateek.nayak@....com>, Peter Zijlstra
<peterz@...radead.org>
CC: "Mohamed Abuelfotoh, Hazem" <abuehaze@...zon.com>, "Saidi, Ali"
<alisaidi@...zon.com>, Benjamin Herrenschmidt <benh@...nel.crashing.org>,
"Blake, Geoff" <blakgeof@...zon.com>, "Csoma, Csaba" <csabac@...zon.com>,
"Doebel, Bjoern" <doebel@...zon.de>, Gautham Shenoy <gautham.shenoy@....com>,
Swapnil Sapkal <swapnil.sapkal@....com>, Joseph Salisbury
<joseph.salisbury@...cle.com>, Dietmar Eggemann <dietmar.eggemann@....com>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
"linux-arm-kernel@...ts.infradead.org"
<linux-arm-kernel@...ts.infradead.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "linux-tip-commits@...r.kernel.org"
<linux-tip-commits@...r.kernel.org>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: EEVDF regression still exists
Hi Prateek,
On 2025-05-02, 01:33, "K Prateek Nayak" <kprateek.nayak@....com <mailto:kprateek.nayak@....com>> wrote:
>> Could you also provide some information on your LDG machine - its
>> configuration and he kernel it is running (although this shouldn't
>> really matter as long as it is same across runs)
>
> So I'm looking at logs at LDG side which is a 4th Generation EPYC system
> with 192CPUs running the repro on baremetal and I see:
>
> [20250502.061627] [INFO] STARTING TEST
> [20250502.061627] [INFO] 768 VU
>
> 768VU each processing 1000000000000 transactions sent to a 16vCPU
> SUT instance seems like a highly overloaded (and unrealistic) scenario
> but perhaps your LDG is also a similar 16vCPU instance which caps the
> VU at 64?
You're right, my LDG is smaller. I'm using a 64 vCPU 128GB RAM Graviton3
instance (this is mentioned in the test results README [1]), resulting
in 256 VUs.
The VU count should really be based on the SUT core count, and be at least
8 * SUT vCPUs to ensure a full load. Currently the reproducer does not
support querying the SUT vCPUs from the LDG side, which is why it defaults
to using the LDG core count instead - but the assumption of those counts
being correlated needs revisiting.
[1] https://github.com/aws/repro-collection/blob/main/repros/repro-mysql-EEVDF-regression/results/20250428/README.md
> Currently doing a trial run, staring at logs to see what I need to
> adjust based on the errors. I'll adjust the LDG based on your comments
> and try to reproduce the scenario over the weekend.
Your help is much appreciated!
A couple more thoughts on the setup:
The LDG should mainly be able to cover enough load to not be a bottleneck.
Same goes for the network connection. At the same time, the SUT needs to
have a fast enough disk so it doesn't become the limiting factor (I've seen
this issue in the past; the results will show a minimal difference only).
Powered by blists - more mailing lists