linux-kernel - Re: [REGRESSION] sched/fair: Reimplement NEXT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <051285c7-56c0-455d-ab3e-9959edb5d13f@arm.com>
Date: Wed, 7 Jan 2026 16:30:09 +0100
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Ryan Roberts <ryan.roberts@....com>,
 Mel Gorman <mgorman@...hsingularity.net>,
 "Peter Zijlstra (Intel)" <peterz@...radead.org>
Cc: x86@...nel.org, linux-kernel@...r.kernel.org,
 Aishwarya TCV <Aishwarya.TCV@....com>
Subject: Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with
 EEVDF goals

On 05.01.26 12:45, Ryan Roberts wrote:
> On 02/01/2026 15:52, Dietmar Eggemann wrote:
>> On 02.01.26 13:38, Ryan Roberts wrote:

[...]

>>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
>>>> statistically significant regression/improvement, where "statistically 
>>>> significant" means the 95% confidence intervals do not overlap".
>>
>> You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
>> Reimplement NEXT_BUDDY to align with EEVDF goals'.
>>
>> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
>> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?
> 
> Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful?

Well, I assume this would be more valuable. Before this patch-set (e.g
v6.18), NEXT_BUDDY was disabled and this is what people are running.

Now (>= v6.19-rc1) we have NEXT_BUDDY=true (1/2) and 'NEXT_BUDDY aligned
to EEVDF' (2/2). This is what people will run when they switch to v6.19
later.

But patch 2/2 changes more than the 'if (sched_feat(NEXT_BUDDY) ...'
condition. So testing 'w/o 2/2' vs. 'w/ 2/2' and 'NEXT_BUDDY=false'
could be helpful as well.

>> ---
>>
>> Mel mentioned that he tested on a 2-socket machine. So I guess something
>> like my Intel Xeon Silver 4314:
>>
>> cpu0 0 0
>> domain0 SMT 00000001,00000001
>> domain1 MC 55555555,55555555
>> domain2 NUMA ffffffff,ffffffff
>>
>> node distances:
>> node   0   1
>>   0:  10  20
>>   1:  20  10
>>
>> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
>> domain? I guess topology has influence in benchmark numbers here as well.
> 
> I can't easily enable scheduler debugging right now (which I think is needed to 
> get this info directly?). But that's what I'd expect, yes. lscpu confirms there 
> is a single NUMA node and topology for cpu0 gives this if it helps:
> 
> /sys/devices/system/cpu/cpu0/topology$ grep "" -r .
> ./cluster_cpus:ffffffff,ffffffff
> ./cluster_cpus_list:0-63
> ./physical_package_id:0
> ./core_cpus_list:0
> ./core_siblings:ffffffff,ffffffff
> ./cluster_id:0
> ./core_siblings_list:0-63
> ./package_cpus:ffffffff,ffffffff
> ./package_cpus_list:0-63

[...]

OK, so single (flat) MC domain with 64 CPUs.

>> There was also a lot of improvement on schbench (wakeup latency) on
>> higher percentiles (>= 99.0th) on the 2-socket machine with those 2
>> patches. I guess you haven't seen those on Grav3?
>>
> 
> I don't have schbench results for 6.18 but I do have them for 6.19-rc1 and for 
> revert-next-buddy. The means have moved a bit but there are only a couple of 
> cases that we consisder statistically significant (marked (R)egression / 
> (I)mprovement):
> 
> +----------------------------+------------------------------------------------------+-------------+-------------------+
> | Benchmark                  | Result Class                                         |  6-19-0-rc1 | revert-next-buddy |
> +============================+======================================================+=============+===================+
> | schbench/thread-contention | -m 16 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     1263.97 |            -6.43% |
> |                            | -m 16 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.28% |
> |                            | -m 16 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
> |                            | -m 16 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     6433.07 |           -10.99% |
> |                            | -m 16 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.39% |
> |                            | -m 16 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        4.17 |       (R) -16.67% |
> |                            | -m 16 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |     1458.33 |            -1.57% |
> |                            | -m 16 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |   813056.00 |            15.46% |
> |                            | -m 16 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    14240.00 |            -5.97% |
> |                            | -m 16 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      434.22 |             3.21% |
> |                            | -m 16 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 11354112.00 |             2.92% |
> |                            | -m 16 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63168.00 |            -2.87% |
> |                            | -m 32 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     2828.63 |             2.58% |
> |                            | -m 32 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |             0.00% |
> |                            | -m 32 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
> |                            | -m 32 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     3182.15 |             5.18% |
> |                            | -m 32 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   116266.67 |             8.22% |
> |                            | -m 32 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |     6186.67 |        (R) -5.34% |
> |                            | -m 32 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      749.20 |             2.91% |
> |                            | -m 32 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |  3702784.00 |        (I) 13.76% |
> |                            | -m 32 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    33514.67 |             0.24% |
> |                            | -m 32 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      392.23 |             3.42% |
> |                            | -m 32 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 16695296.00 |         (I) 5.82% |
> |                            | -m 32 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   120618.67 |            -3.22% |
> |                            | -m 64 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     5951.15 |             5.02% |
> |                            | -m 64 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15157.33 |             0.42% |
> |                            | -m 64 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.67 |            -4.35% |
> |                            | -m 64 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     1510.23 |            -1.38% |
> |                            | -m 64 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   802816.00 |            13.73% |
> |                            | -m 64 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |    14890.67 |           -10.44% |
> |                            | -m 64 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      458.87 |             4.60% |
> |                            | -m 64 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    | 11348650.67 |         (I) 2.67% |
> |                            | -m 64 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63445.33 |        (R) -5.48% |
> |                            | -m 64 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      541.33 |             2.65% |
> |                            | -m 64 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 36743850.67 |        (I) 10.95% |
> |                            | -m 64 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   211370.67 |            -1.94% |
> +----------------------------+------------------------------------------------------+-------------+-------------------+
> 
> I could get the results for 6.18 if useful, but I think what I have probably 
> shows enough of the picture: This patch has not impacted schbench much on 
> this HW.

I see. IMHO, task scheduler tests are all about putting the right about
of stress onto the system, not too little and not too much.

I guess, these ~ `-m 64 -t 4` tests should be fine for a 64 CPUs system.
Not sure which parameter set Mel was using on his 2 socket machine. And
I still assume he tested w/o (base) against with these 2 patches.

The other test Mel was using is this modified dbench4 (prefers
throughput (less preemption)). Not sure if this is part of the MmTests
suite?

It would be nice to be able to run the same tests on different machines
(with a parameter set adapted to the number of CPUs), so we have only
the arch and the topology as variables). But there is definitely more
variety (e.g. used filesystem, etc) ... so this is not trivial.

[...]