linux-kernel - Re: EEVDF regression still exists

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d6692902-837a-4f30-913b-763f01a5a7ea@arm.com>
Date: Wed, 14 May 2025 23:26:18 +0200
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: "Prundeanu, Cristian" <cpru@...zon.com>,
 Peter Zijlstra <peterz@...radead.org>
Cc: K Prateek Nayak <kprateek.nayak@....com>,
 "Mohamed Abuelfotoh, Hazem" <abuehaze@...zon.com>,
 "Saidi, Ali" <alisaidi@...zon.com>,
 Benjamin Herrenschmidt <benh@...nel.crashing.org>,
 "Blake, Geoff" <blakgeof@...zon.com>, "Csoma, Csaba" <csabac@...zon.com>,
 "Doebel, Bjoern" <doebel@...zon.de>, Gautham Shenoy
 <gautham.shenoy@....com>, Swapnil Sapkal <swapnil.sapkal@....com>,
 Joseph Salisbury <joseph.salisbury@...cle.com>,
 Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
 "linux-arm-kernel@...ts.infradead.org"
 <linux-arm-kernel@...ts.infradead.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "linux-tip-commits@...r.kernel.org" <linux-tip-commits@...r.kernel.org>,
 "x86@...nel.org" <x86@...nel.org>, Chris Redpath <Chris.Redpath@....com>
Subject: Re: EEVDF regression still exists

+ Chris Redpath <Chris.Redpath@....com>

On 02/05/2025 18:52, Prundeanu, Cristian wrote:
> On 2025-05-02, 03:50, "Peter Zijlstra" <peterz@...radead.org <mailto:peterz@...radead.org>> wrote:
> 
>> On Thu, May 01, 2025 at 04:16:07PM +0000, Prundeanu, Cristian wrote:
>>
>>> (Please keep in mind that the target isn't to get SCHED_BATCH to the same
>>> level as 6.5-default; it's to resolve the regression from 6.5-default to
>>> 6.6+ default, and from 6.5-SCHED_BATCH to 6.6+ SCHED_BATCH).
>>
>> No, the target definitely is not to make 6.6+ default match 6.5 default.
>>
>> The target very much is getting you performance similar to the 6.5
>> default that you were happy with with knobs we can live with.
> 
> If we're talking about new knobs in 6.6+, absolutely.
> 
> For this particular case, SCHED_BATCH existed before 6.6. Users who already
> enable SCHED_BATCH now have no recourse. We can't, with a straight face,
> claim that this is a sufficient fix, or that there is no regression.
> 
> I am, of course, interested to discuss any knob tweaks as a stop-gap measure.
> (That is also why I proposed moving NO_PLACE_LAG and NO_RUN_TO_PARITY to sysctl
> a few months back: to give users, including distro maintainers, a reasonable
> way to preconfigure their systems in a standard, persistent way, while this is
> being worked on).
> None of this should be considered a permanent solution though. It's not a fix,
> and was never meant to be anything but a short-term relief while debugging the
> regression is ongoing.

I've been running those tests as well on an environment pretty close to
yours. I use c7g.16xlarge and m7gd.16xlarge ('maxcpus=16 nr_cpus=16') AWS
instances for LoadGen (hammerdb) and SUT (mysqld).

We tried to figure out whether only changing the mysql (SUT) 'connection'
tasks to SCHED_BATCH is sufficient to see a performance uplift. There is
one of those tasks per virtual user.

I ran (1)-(3) (like you) plus (4):


(1) default

(2) NO_PL NO_RTP ... run w/ NO_PLACE_LAG and NO_RUN_TO_PARITY

(3) SCHED_BATCH  ... launch mysqld.service with 'CPUSchedulingPolicy=batch'
                     [/lib/systemd/system/mysql.service]

(4) mysql patch  ... run 'connection' threads as SCHED_BATCH


Kernel   | Runtime      | mysql  | Throughput | P50 latency
aarch64  | parameters   | patch* | (NOPM)     | (larger is worse)
---------+--------------+--------+------------+------------------
6.5      | default      |        |  baseline  |  baseline
         | SCHED_BATCH  |        |  +10.9%    |  -42.9%
         | default      |   x    |   +9.5%    |  -33.0%
---------+--------------+--------+------------+------------------
6.6      | default      |        |   -2.7%    |  -23.7%
         | NO_PL NO_RTP |        |   +4.4%    |   +8.8%
         | SCHED_BATCH  |        |   +4.5%    |    -*
         | default      |   x    |   +4.2%    |  -38.8%
---------+--------------+--------+------------+------------------
6.8      | default      |        |   -3.7%    |    -
         | NO_PL NO_RTP |        |   +2.5%    |  -24.0%
         | SCHED_BATCH  |        |   +6.2%    |  -38.6%
         | default      |   x    |   +2.7%    |  -37.0%
---------+--------------+--------+------------+------------------
6.12     | default      |        |   -6.3%    |    -
         | NO_PL NO_RTP |        |   -4.0%    |  -34.1%
         | SCHED_BATCH  |        |   -2.3%    |  -35.9%
         | default      |   x    |   -2.1%    |  -33.6%
---------+--------------+--------+------------+------------------
6.13     | default      |        |   -7.3%    |   -9.2%
         | NO_PL NO_RTP |        |   -3.7%    |  -35.0%
         | SCHED_BATCH  |        |      0%    |  -38.2%
         | default      |   x    |   -1.7%    |  -34.3%
---------+--------------+--------+------------+------------------
6.14     | default      |        |   -7.3%    |  -19.3%
         | NO_PL NO_RTP |        |   -5.3%    |  -36.6%
         | SCHED_BATCH  |        |   -2.9%    |  -40.1%
         | default      |   x    |   -2.4%    |  -39.0%
---------+--------------+--------+------------+------------------
6.15-rc5 | default      |        |   -9.6%    |  -19.3%
         | NO_PL NO_RTP |        |   -7.7%    |  -34.7%
         | SCHED_BATCH  |        |   -5.1%    |  -38.6%
         | default      |   x    |   -5.6%    |    -
---------+--------------+------------+--------+------------------

'-'* 'repro-regression' didn't provide latency numbers

Looks like (4) is almost as good as (3). And we see this uplift also on
CFS (6.5). The patch below is trivial and easily to apply. 

That said, I also see the policy unrelated regression you're describing
(especially from '6.8 -> 6.12' and then from '6.14 -> 6.15-rc5'.

I will have time the next couple of days to also look into these issues
using our setup.

---

Patch applied to mysql-8.0-8.0.42 source package (Ubuntu 22.04):

-->8--

From: Chris Redpath <chris.redpath@....com>
Date: Thu, 13 Mar 2025 16:30:13 +0000
Subject: [PATCH] Make sure we use SCHED_BATCH for thread-per-connection mode

Hack in a small change in the thread init code for the handlers to choose
the correct scheduler policy.

Signed-off-by: "Chris Redpath" <chris.redpath@....com>
---
 sql/conn_handler/connection_handler_per_thread.cc | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/sql/conn_handler/connection_handler_per_thread.cc b/sql/conn_handler/connection_handler_per_thread.cc
index 68641b55723..10086cd3c6f 100644
--- a/sql/conn_handler/connection_handler_per_thread.cc
+++ b/sql/conn_handler/connection_handler_per_thread.cc
@@ -249,6 +249,7 @@ static void *handle_connection(void *arg) {
       Connection_handler_manager::get_instance();
   Channel_info *channel_info = static_cast<Channel_info *>(arg);
   bool pthread_reused [[maybe_unused]] = false;
+  struct sched_param param = {0};
 
   if (my_thread_init()) {
     connection_errors_internal++;
@@ -260,6 +261,15 @@ static void *handle_connection(void *arg) {
     return nullptr;
   }
 
+  // Set the scheduling policy to SCHED_BATCH
+  if (sched_setscheduler(0, SCHED_BATCH, &param) == -1) {
+    perror("sched_setscheduler");
+    // Handle the error as needed
+    delete channel_info;
+    my_thread_exit(nullptr);
+    return nullptr;
+  }
+
   for (;;) {
     THD *thd = init_new_thd(channel_info);
     if (thd == nullptr) {
-- 
2.34.1