linux-kernel - Re: [REGRESSION] sched/fair: Reimplement NEXT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ru2hghsvnlke2iqwpdepogwxvh25lfyo7ygvzzfbgmbaii6ann@l6r7nwvd5iqs>
Date: Thu, 15 Jan 2026 10:16:56 +0000
From: Mel Gorman <mgorman@...hsingularity.net>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>, 
	"Peter Zijlstra (Intel)" <peterz@...radead.org>, x86@...nel.org, linux-kernel@...r.kernel.org, 
	Aishwarya TCV <Aishwarya.TCV@....com>, Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with
 EEVDF goals

On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote:
> On 08/01/2026 13:15, Ryan Roberts wrote:
> > On 08/01/2026 08:50, Mel Gorman wrote:
> >> On Wed, Jan 07, 2026 at 04:30:09PM +0100, Dietmar Eggemann wrote:
> >>> On 05.01.26 12:45, Ryan Roberts wrote:
> >>>> On 02/01/2026 15:52, Dietmar Eggemann wrote:
> >>>>> On 02.01.26 13:38, Ryan Roberts wrote:
> >>>
> >>> [...]
> >>>
> >>
> >> Sorry for slow responses. I'm still back from holidays yet and unfortunately
> >> do not have access to test machines right now cannot revalidate any of
> >> the results against 6.19-rc*.
> > 
> > No problem, thanks for getting back to me!
> > 
> >>
> >>>>>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
> >>>>>>> statistically significant regression/improvement, where "statistically 
> >>>>>>> significant" means the 95% confidence intervals do not overlap".
> >>>>>
> >>>>> You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
> >>>>> Reimplement NEXT_BUDDY to align with EEVDF goals'.
> >>>>>
> >>>>> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
> >>>>> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?
> >>>>
> >>>> Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful?
> >>>
> >>> Well, I assume this would be more valuable.
> >>
> >> Agreed because we need to know if it's NEXT_BUDDY that is conceptually
> >> an issue with EEVDF in these cases or the specific implementation. The
> >> comparison between 
> >>
> >> 6.18A					(baseline)
> >> 6.19-rcN vanilla			(New NEXT_BUDDY implementation enabled)
> >> 6.19-rcN revert patches 1+2		(NEXT_BUDDY disabled)
> >> 6.19-rcN revert patch 2 only		(Old NEXT_BUDDY implementation enabled)
> > 
> > OK, I've already got 1, 2 and 4. Let me grab 3 and come back to you - hopefully
> > tomorrow. Then we can take it from there.
> 
> Hi Mel, Dietmar,
> 
> Here are the updated results, now including column for "revert #1 & #2".
> 
> 6-18-0 (base)		(baseline)
> 6-19-0-rc1		(New NEXT_BUDDY implementation enabled)
> revert #1 & #2		(NEXT_BUDDY disabled)
> revert #2		(Old NEXT_BUDDY implementation enabled)
> 

Thanks.

> 
> The regressions that are fixed by "revert #2" (as originally reported) are still 
> fixed in "revert #1 & #2". Interestingly, performance actually improves further 
> for the latter in the multi-node mysql benchmark (which is our VIP workload). 

It suggests that NEXT_BUDDY in general is harmful to this workload. In an
ideal world, this would also be checked against the NEXT_BUDDY implementation
CFS but it would be a waste of time for many reasons. I find it particularly
interesting that it is only measurable with the 2-machine test as it
suggests, but not proves, that the problem may be related to WF_SYNC
wakeups from the network layer

> There are a couple of hackbench cases (sockets with high thread counts) that 
> showed an improvement with "revert #2" but which is gone with "revert #1 & #2".
> 
> Let me know if I can usefully do anything else.
> 


> Multi-node SUT (workload running across 2 machines):
> 
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> | Benchmark                       | Result Class                                       | 6-18-0 (base) |  6-19-0-rc1 |  revert #2 | revert #1 & #2 |
> +=================================+====================================================+===============+=============+============+================+
> | repro-collection/mysql-workload | db transaction rate (transactions/min)             |     646267.33 |  (R) -1.33% |  (I) 5.87% |      (I) 7.63% |
> |                                 | new order rate (orders/min)                        |     213256.50 |  (R) -1.32% |  (I) 5.87% |      (I) 7.64% |
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+

Ok, fairly clear there

> Single-node SUT (workload running on single machine):
> 
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> | Benchmark                       | Result Class                                       | 6-18-0 (base) |  6-19-0-rc1 |  revert #2 | revert #1 & #2 |
> +=================================+====================================================+===============+=============+============+================+
> | specjbb/composite               | critical-jOPS (jOPS)                               |      94700.00 |  (R) -5.10% |     -0.90% |         -0.37% |
> |                                 | max-jOPS (jOPS)                                    |     113984.50 |  (R) -3.90% |     -0.65% |          0.65% |
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+

I assume this is specjbb2015. I'm a little cautious of this results as
specjbb2015 focuses on peak performance. It starts with low CPU usage and
scales up to find the point where performance reaches a peak. This metric can
be gamed and what works for specjbb, particularly as the machines approches
being heavily utilised and transitions to overloaded can be problematic.

Can you look at the detailed results for specjbb2015 and determine if the
peak was picked from different load points?

> | repro-collection/mysql-workload | db transaction rate (transactions/min)             |     245438.25 |  (R) -3.88% |     -0.13% |          0.24% |
> |                                 | new order rate (orders/min)                        |      80985.75 |  (R) -3.78% |     -0.07% |          0.29% |
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> | pts/pgbench                     | Scale: 1 Clients: 1 Read Only (TPS)                |      63124.00 |   (I) 2.90% |      0.74% |          0.85% |
> |                                 | Scale: 1 Clients: 1 Read Only - Latency (ms)       |         0.016 |   (I) 5.49% |      1.05% |          1.05% |
> |                                 | Scale: 1 Clients: 1 Read Write (TPS)               |        974.92 |       0.11% |     -0.08% |         -0.03% |
> |                                 | Scale: 1 Clients: 1 Read Write - Latency (ms)      |          1.03 |       0.12% |     -0.06% |         -0.06% |
> |                                 | Scale: 1 Clients: 250 Read Only (TPS)              |    1915931.58 |  (R) -2.25% |  (I) 2.12% |          1.62% |
> |                                 | Scale: 1 Clients: 250 Read Only - Latency (ms)     |          0.13 |  (R) -2.37% |  (I) 2.09% |          1.69% |
> |                                 | Scale: 1 Clients: 250 Read Write (TPS)             |        855.67 |      -1.36% |     -0.14% |         -0.12% |
> |                                 | Scale: 1 Clients: 250 Read Write - Latency (ms)    |        292.39 |      -1.31% |     -0.08% |         -0.08% |
> |                                 | Scale: 1 Clients: 1000 Read Only (TPS)             |    1534130.08 | (R) -11.37% |      0.08% |          0.48% |
> |                                 | Scale: 1 Clients: 1000 Read Only - Latency (ms)    |          0.65 | (R) -11.38% |      0.08% |          0.44% |
> |                                 | Scale: 1 Clients: 1000 Read Write (TPS)            |        578.75 |      -1.11% |      2.15% |         -0.96% |
> |                                 | Scale: 1 Clients: 1000 Read Write - Latency (ms)   |       1736.98 |      -1.26% |      2.47% |         -0.90% |
> |                                 | Scale: 100 Clients: 1 Read Only (TPS)              |      57170.33 |       1.68% |      0.10% |          0.22% |
> |                                 | Scale: 100 Clients: 1 Read Only - Latency (ms)     |         0.018 |       1.94% |      0.00% |          0.96% |
> |                                 | Scale: 100 Clients: 1 Read Write (TPS)             |        836.58 |      -0.37% |     -0.41% |          0.07% |
> |                                 | Scale: 100 Clients: 1 Read Write - Latency (ms)    |          1.20 |      -0.37% |     -0.40% |          0.06% |
> |                                 | Scale: 100 Clients: 250 Read Only (TPS)            |    1773440.67 |      -1.61% |      1.67% |          1.34% |
> |                                 | Scale: 100 Clients: 250 Read Only - Latency (ms)   |          0.14 |      -1.40% |      1.56% |          1.20% |
> |                                 | Scale: 100 Clients: 250 Read Write (TPS)           |       5505.50 |      -0.17% |     -0.86% |         -1.66% |
> |                                 | Scale: 100 Clients: 250 Read Write - Latency (ms)  |         45.42 |      -0.17% |     -0.85% |         -1.67% |
> |                                 | Scale: 100 Clients: 1000 Read Only (TPS)           |    1393037.50 | (R) -10.31% |     -0.19% |          0.53% |
> |                                 | Scale: 100 Clients: 1000 Read Only - Latency (ms)  |          0.72 | (R) -10.30% |     -0.17% |          0.53% |
> |                                 | Scale: 100 Clients: 1000 Read Write (TPS)          |       5085.92 |       0.27% |      0.07% |         -0.79% |
> |                                 | Scale: 100 Clients: 1000 Read Write - Latency (ms) |        196.79 |       0.23% |      0.05% |         -0.81% |

A few points of concern but nothing as severe as the mysql Multi-node
SUT. The worst regressions are when the number of clients exceeds the number
of CPUs and at that point any wakeup preemption is potentially harmful.

> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> | mmtests/hackbench               | hackbench-process-pipes-1 (seconds)                |          0.14 |      -1.51% |     -1.05% |         -1.51% |
> |                                 | hackbench-process-pipes-4 (seconds)                |          0.44 |   (I) 6.49% |  (I) 5.42% |      (I) 6.06% |
> |                                 | hackbench-process-pipes-7 (seconds)                |          0.68 | (R) -18.36% |  (I) 3.40% |         -0.41% |

So hackbench is all over the place with a mix of gains and losses so no
clear winner.

> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> | pts/nginx                       | Connections: 200 (Requests Per Second)             |     252332.60 |  (I) 17.54% |     -0.53% |         -0.61% |
> |                                 | Connections: 1000 (Requests Per Second)            |     248591.29 |  (I) 20.41% |      0.10% |          0.57% |
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> 

And this is the main winner. The results confirm that NEXT_BUDDY is not
a universal win but the mysql results and Daytrader results from Madadi
are a concern.

I still don't have access to test machines to investigate this properly
and may not have access for 1-2 weeks. I think the best approach for now
is to disable NEXT_BUDDY by default again until it's determined exactly
why mysql multi-host and daytrader suffered.

Can you this this to be sure please?

--8<--
sched/fair: Disable scheduler feature NEXT_BUDDY

NEXT_BUDDY was disabled with the introduction of EEVDF and enabled again
after NEXT_BUDDY was rewritten for EEVDF by commit e837456fdca8 ("sched/fair:
Reimplement NEXT_BUDDY to align with EEVDF goals"). It was not expected
that this would be a universal win without a crystal ball instruction
but the reported regressions are a concern [1][2] even if gains were
also reported. Specifically;

o mysql with client/server running on different servers regresses
o specjbb reports lower peak metrics
o daytrader regresses

The mysql is realistic and a concern. It needs to be confirmed if
specjbb is simply shifting the point where peak performance is measured
but still a concern. daytrader is considered to be representative of a
real workload.

Access to test machines is currently problematic for verifying any fix to
this problem. Disable NEXT_BUDDY for now by default until the root causes
are addressed.

Link: https://lore.kernel.org/lkml/4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com [1]
Link: https://lore.kernel.org/lkml/ec3ea66f-3a0d-4b5a-ab36-ce778f159b5b@linux.ibm.com [2]
Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>
---
 kernel/sched/features.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 980d92bab8ab..136a6584be79 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.
  */
-SCHED_FEAT(NEXT_BUDDY, true)
+SCHED_FEAT(NEXT_BUDDY, false)
 
 /*
  * Allow completely ignoring cfs_rq->next; which can be set from various