linux-kernel - Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and RUN_TO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9aa93862-932c-4a17-a3ba-f6335649e555@arm.com>
Date: Tue, 26 Nov 2024 16:12:04 +0100
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Cristian Prundeanu <cpru@...zon.com>
Cc: kprateek.nayak@....com, abuehaze@...zon.com, alisaidi@...zon.com,
 benh@...nel.crashing.org, blakgeof@...zon.com, csabac@...zon.com,
 doebel@...zon.com, gautham.shenoy@....com, joseph.salisbury@...cle.com,
 linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
 linux-tip-commits@...r.kernel.org, mingo@...hat.com, peterz@...radead.org,
 x86@...nel.org
Subject: Re: [PATCH 0/2] [tip: sched/core] sched: Disable PLACE_LAG and
 RUN_TO_PARITY and move them to sysctl

On 25/11/2024 12:35, Cristian Prundeanu wrote:
> Here are more results with recent 6.12 code, and also using SCHED_BATCH.
> The control tests were run anew on Ubuntu 22.04 with the current pre-built
> kernels 6.5 (baseline) and 6.8 (regression out of the box).
> 
> When updating mysql from 8.0.30 to 8.4.2, the regression grew even larger.
> Disabling PLACE_LAG and RUN _TO_PARITY improved the results more than
> using SCHED_BATCH.
> 
> Kernel   | default  | NO_PLACE_LAG and | SCHED_BATCH | mysql
>          | config   | NO_RUN_TO_PARITY |             | version
> ---------+----------+------------------+-------------+---------
> 6.8      | -15.3%   |                  |             | 8.0.30
> 6.12-rc7 | -11.4%   | -9.2%            | -11.6%      | 8.0.30
>          |          |                  |             |
> 6.8      | -18.1%   |                  |             | 8.4.2
> 6.12-rc7 | -14.0%   | -10.2%           | -12.7%      | 8.4.2
> ---------+----------+------------------+-------------+---------
> 
> Confidence intervals for all tests are smaller than +/- 0.5%.
> 
> I expect to have the repro package ready by the end of the week. Thank you
> for your collective patience and efforts to confirm these results.

The results I got look different:

SUT kernel arm64 (mysql-8.4.0)

(1) 6.5.13					    baseline	

(2) 6.12.0-rc4 					    -12.9%
	
(3) 6.12.0-rc4 NO_PLACE_LAG			     +6.4%		

(4) v6.12-rc4 SCHED_BATCH			    +10.8%

5 test runs each: confidence level (95%) <= ±0.56%

(2) is still in sync but (3)/(4) looks way better for me.

Maybe a difference in our test setup can explain the different test results:

I use:

HammerDB Load Generator <-> MySQL SUT
192 VCPUs               <-> 16 VCPUs

Virtual users: 256
Warehouse count: 64
3 min rampup
10 min test run time
performance data: NOPM (New Operations Per Minute)

So I have 256 'connection' tasks running on the 16 SUT VCPUS.

> On 2024-11-01, Peter Zijlstra wrote:
> 
>>> (At the risk of stating the obvious, using SCHED_BATCH only to get back to 
>>> the default CFS performance is still only a workaround,
>>
>> It is not really -- it is impossible to schedule all the various
>> workloads without them telling us what they really like. The quest is to
>> find interfaces that make sense and are implementable. But fundamentally
>> tasks will have to start telling us what they need. We've long since ran
>> out of crystal balls.
> 
> Completely agree that the best performance is obtained when the tasks are
> individually tuned to the scheduler and explicitly set running parameters.
> This isn't different from before.
> 
> But shouldn't our gold standard for default performance be CFS? There is a
> significant regression out of the box when using EEVDF; how is seeking
> additional tuning just to recover the lost performance not a workaround?
> 
> (Not to mention that this additional tuning means shifting the burden on
> many users who may not be familiar enough with scheduler functionality.
> We're essentially asking everyone to spend considerable effort to maintain
> status quo from kernel 6.5.)
> 
> 
> On 2024-11-14, Joseph Salisbury wrote:
> 
>> This is a confirmation that we are also seeing a 9% performance
>> regression with the TPCC benchmark after v6.6-rc1.  We narrowed down the
>> regression was caused due to commit:
>> 86bfbb7ce4f6 ("sched/fair: Add lag based placement")
>>
>> This regression was reported via this thread:
>> https://lore.kernel.org/lkml/1c447727-92ed-416c-bca1-a7ca0974f0df@oracle.com/
>>
>> Phil Auld suggested to try turning off the PLACE_LAG sched feature. We
>> tested with NO_PLACE_LAG and can confirm it brought back 5% of the
>> performance loss.  We do not yet know what effect NO_PLACE_LAG will have
>> on other benchmarks, but it indeed helps TPCC.
> 
> Thank you for confirming the regression. I've been monitoring performance
> on the v6.12-rcX tags since this thread started, and the results have been
> largely constant.
> 
> I've also tested other benchmarks to verify whether (1) the regression
> exists and (2) the patch proposed in this thread negatively affects them.
> On postgresql and wordpress/nginx there is a regression which is improved
> when applying the patch; on mongo and mariadb no regression manifested, and
> the patch did not make their performance worse.
> 
> 
> On 2024-11-19, Dietmar Eggemann wrote:
> 
>> #cat /etc/systemd/system/mysql.service
>>
>> [Service]
>> CPUSchedulingPolicy=batch
>> ExecStart=/usr/local/mysql/bin/mysqld_safe
> 
> This is the approach I used as well to get the results above.

OK.

>> My hunch is that this is due to the 'connection' threads (1 per virtual
>> user) running in SCHED_BATCH. I yet have to confirm this by only
>> changing the 'connection' tasks to SCHED_BATCH.
> 
> Did you have a chance to run with this scenario?

Yeah, I did. The results where worse than running all mysqld threads in
SCHED_BATCH but still better than the baseline.

(5) v6.12-rc4 'connection' tasks in SCHED_BATCH		+6.8%