linux-kernel - Re: [REGRESSION] sched/fair: Reimplement NEXT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <kne2v4fhsql5fltb6q5wbsohh224fzs6mov4l324q4wppzpayk@4ayzckp6bs2p>
Date: Thu, 8 Jan 2026 08:50:15 +0000
From: Mel Gorman <mgorman@...hsingularity.net>
To: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Ryan Roberts <ryan.roberts@....com>, 
	"Peter Zijlstra (Intel)" <peterz@...radead.org>, x86@...nel.org, linux-kernel@...r.kernel.org, 
	Aishwarya TCV <Aishwarya.TCV@....com>
Subject: Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with
 EEVDF goals

On Wed, Jan 07, 2026 at 04:30:09PM +0100, Dietmar Eggemann wrote:
> On 05.01.26 12:45, Ryan Roberts wrote:
> > On 02/01/2026 15:52, Dietmar Eggemann wrote:
> >> On 02.01.26 13:38, Ryan Roberts wrote:
> 
> [...]
> 

Sorry for slow responses. I'm still back from holidays yet and unfortunately
do not have access to test machines right now cannot revalidate any of
the results against 6.19-rc*.

> >>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
> >>>> statistically significant regression/improvement, where "statistically 
> >>>> significant" means the 95% confidence intervals do not overlap".
> >>
> >> You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
> >> Reimplement NEXT_BUDDY to align with EEVDF goals'.
> >>
> >> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
> >> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?
> > 
> > Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful?
> 
> Well, I assume this would be more valuable.

Agreed because we need to know if it's NEXT_BUDDY that is conceptually
an issue with EEVDF in these cases or the specific implementation. The
comparison between 

6.18A					(baseline)
6.19-rcN vanilla			(New NEXT_BUDDY implementation enabled)
6.19-rcN revert patches 1+2		(NEXT_BUDDY disabled)
6.19-rcN revert patch 2 only		(Old NEXT_BUDDY implementation enabled)

It was known that NEXT_BUDDY was always a tradeoff but one that is workload,
architecture and specific arch implementation dependent. If it cannot be
sanely reconciled then it may be best to completely remove NEXT_BUDDY from
EEVDF or disable it by default again for now. I don't think NEXT_BUDDY as
it existed in CFS can be sanely implemented against EEVDF so it'll never
be equivalent. Related to that I doubt anyone has good data on NEXT_BUDDY
vs !NEXT_BUDDY even on CFS as it was enabled for so long.

> >> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
> >> domain? I guess topology has influence in benchmark numbers here as well.
> > 
> > I can't easily enable scheduler debugging right now (which I think is needed to 
> > get this info directly?). But that's what I'd expect, yes. lscpu confirms there 
> > is a single NUMA node and topology for cpu0 gives this if it helps:
> > 
> > /sys/devices/system/cpu/cpu0/topology$ grep "" -r .
> > ./cluster_cpus:ffffffff,ffffffff
> > ./cluster_cpus_list:0-63
> > ./physical_package_id:0
> > ./core_cpus_list:0
> > ./core_siblings:ffffffff,ffffffff
> > ./cluster_id:0
> > ./core_siblings_list:0-63
> > ./package_cpus:ffffffff,ffffffff
> > ./package_cpus_list:0-63
> 
> [...]
> 
> OK, so single (flat) MC domain with 64 CPUs.
> 

That is what the OS sees but does it reflect reality? e.g. does Graviton3
have multiple caches that are simply not advertised to the OS?

> >> There was also a lot of improvement on schbench (wakeup latency) on
> >> higher percentiles (>= 99.0th) on the 2-socket machine with those 2
> >> patches. I guess you haven't seen those on Grav3?
> >>
> > 
> > I don't have schbench results for 6.18 but I do have them for 6.19-rc1 and for 
> > revert-next-buddy. The means have moved a bit but there are only a couple of 
> > cases that we consisder statistically significant (marked (R)egression / 
> > (I)mprovement):
> > 
> > +----------------------------+------------------------------------------------------+-------------+-------------------+
> > | Benchmark                  | Result Class                                         |  6-19-0-rc1 | revert-next-buddy |
> > +============================+======================================================+=============+===================+
> > | schbench/thread-contention | -m 16 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     1263.97 |            -6.43% |
> > |                            | -m 16 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.28% |
> > |                            | -m 16 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
> > |                            | -m 16 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     6433.07 |           -10.99% |
> > |                            | -m 16 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.39% |
> > |                            | -m 16 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        4.17 |       (R) -16.67% |
> > |                            | -m 16 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |     1458.33 |            -1.57% |
> > |                            | -m 16 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |   813056.00 |            15.46% |
> > |                            | -m 16 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    14240.00 |            -5.97% |
> > |                            | -m 16 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      434.22 |             3.21% |
> > |                            | -m 16 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 11354112.00 |             2.92% |
> > |                            | -m 16 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63168.00 |            -2.87% |
> > |                            | -m 32 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     2828.63 |             2.58% |
> > |                            | -m 32 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |             0.00% |
> > |                            | -m 32 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
> > |                            | -m 32 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     3182.15 |             5.18% |
> > |                            | -m 32 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   116266.67 |             8.22% |
> > |                            | -m 32 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |     6186.67 |        (R) -5.34% |
> > |                            | -m 32 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      749.20 |             2.91% |
> > |                            | -m 32 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |  3702784.00 |        (I) 13.76% |
> > |                            | -m 32 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    33514.67 |             0.24% |
> > |                            | -m 32 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      392.23 |             3.42% |
> > |                            | -m 32 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 16695296.00 |         (I) 5.82% |
> > |                            | -m 32 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   120618.67 |            -3.22% |
> > |                            | -m 64 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     5951.15 |             5.02% |
> > |                            | -m 64 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15157.33 |             0.42% |
> > |                            | -m 64 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.67 |            -4.35% |
> > |                            | -m 64 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     1510.23 |            -1.38% |
> > |                            | -m 64 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   802816.00 |            13.73% |
> > |                            | -m 64 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |    14890.67 |           -10.44% |
> > |                            | -m 64 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      458.87 |             4.60% |
> > |                            | -m 64 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    | 11348650.67 |         (I) 2.67% |
> > |                            | -m 64 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63445.33 |        (R) -5.48% |
> > |                            | -m 64 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      541.33 |             2.65% |
> > |                            | -m 64 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 36743850.67 |        (I) 10.95% |
> > |                            | -m 64 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   211370.67 |            -1.94% |
> > +----------------------------+------------------------------------------------------+-------------+-------------------+
> > 
> > I could get the results for 6.18 if useful, but I think what I have probably 
> > shows enough of the picture: This patch has not impacted schbench much on 
> > this HW.
> 
> I see. IMHO, task scheduler tests are all about putting the right about
> of stress onto the system, not too little and not too much.
> 
> I guess, these ~ `-m 64 -t 4` tests should be fine for a 64 CPUs system.

Agreed. It's not the full picture but it's a valuable part.

> Not sure which parameter set Mel was using on his 2 socket machine. And
> I still assume he tested w/o (base) against with these 2 patches.
> 

He did. The most comparable test I used was NUM_CPUS so 64 for Graviton
is ok.

> The other test Mel was using is this modified dbench4 (prefers
> throughput (less preemption)). Not sure if this is part of the MmTests
> suite?
> 

It is. The modifications are not extensive. dbench by default reports overall
throughput over time which masks actual throughput at a point in time. The
new metric tracks time taken to process "loadfiles" over time which is
more sensible to analyse. Other metrics such as loadfiles processed per
client would be easily extracted but isn't at the moment as dbench itself
is not designed for measuring fairness of forward progress as such.

> It would be nice to be able to run the same tests on different machines
> (with a parameter set adapted to the number of CPUs), so we have only
> the arch and the topology as variables). But there is definitely more
> variety (e.g. used filesystem, etc) ... so this is not trivial.
> 

>From a topology perspective it is fairly trivial though. For example,
MMTESTS has a schbench configuration that runs one message thread per NUMA
node communicating with nr_cpus/nr_nodes to evaluate placement. A similar
configuration could use (nr_cpus/nr_nodes)-1 to ensure tasks are packed
properly or nr_llcs could also be used fairly trivially. You're right that
once filesystems are involved then it all gets more interesting. ext4 and
xfs use kernel threads differently (jbd vs kworkers), the underlying storage
is a factor, workset size vs RAM impacts dirty throttling and reclaim and
NUMA sizes all play a part. dbench is useful in this regard because while
it interacts with the filesystem a wakeups between userspace and kernel
threads get exercised, the amount of IO is relatively small.

Lets start with getting figures for 6.18, new-NEXT, old-NEXT and
no-NEXT side-by-side. Ideally I'd do the same across a range of x86-64
machines but I can't start that yet.

-- 
Mel Gorman
SUSE Labs