linux-kernel - Re: [REGRESSION] sched/fair: Reimplement NEXT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6c137b30-d5f7-4dd9-9699-d7e22c174285@arm.com>
Date: Thu, 8 Jan 2026 13:15:18 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Mel Gorman <mgorman@...hsingularity.net>,
 Dietmar Eggemann <dietmar.eggemann@....com>
Cc: "Peter Zijlstra (Intel)" <peterz@...radead.org>, x86@...nel.org,
 linux-kernel@...r.kernel.org, Aishwarya TCV <Aishwarya.TCV@....com>
Subject: Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with
 EEVDF goals

On 08/01/2026 08:50, Mel Gorman wrote:
> On Wed, Jan 07, 2026 at 04:30:09PM +0100, Dietmar Eggemann wrote:
>> On 05.01.26 12:45, Ryan Roberts wrote:
>>> On 02/01/2026 15:52, Dietmar Eggemann wrote:
>>>> On 02.01.26 13:38, Ryan Roberts wrote:
>>
>> [...]
>>
> 
> Sorry for slow responses. I'm still back from holidays yet and unfortunately
> do not have access to test machines right now cannot revalidate any of
> the results against 6.19-rc*.

No problem, thanks for getting back to me!

> 
>>>>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
>>>>>> statistically significant regression/improvement, where "statistically 
>>>>>> significant" means the 95% confidence intervals do not overlap".
>>>>
>>>> You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
>>>> Reimplement NEXT_BUDDY to align with EEVDF goals'.
>>>>
>>>> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
>>>> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?
>>>
>>> Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful?
>>
>> Well, I assume this would be more valuable.
> 
> Agreed because we need to know if it's NEXT_BUDDY that is conceptually
> an issue with EEVDF in these cases or the specific implementation. The
> comparison between 
> 
> 6.18A					(baseline)
> 6.19-rcN vanilla			(New NEXT_BUDDY implementation enabled)
> 6.19-rcN revert patches 1+2		(NEXT_BUDDY disabled)
> 6.19-rcN revert patch 2 only		(Old NEXT_BUDDY implementation enabled)

OK, I've already got 1, 2 and 4. Let me grab 3 and come back to you - hopefully
tomorrow. Then we can take it from there.

I appreciate your time on this!

Thanks,
Ryan


> 
> It was known that NEXT_BUDDY was always a tradeoff but one that is workload,
> architecture and specific arch implementation dependent. If it cannot be
> sanely reconciled then it may be best to completely remove NEXT_BUDDY from
> EEVDF or disable it by default again for now. I don't think NEXT_BUDDY as
> it existed in CFS can be sanely implemented against EEVDF so it'll never
> be equivalent. Related to that I doubt anyone has good data on NEXT_BUDDY
> vs !NEXT_BUDDY even on CFS as it was enabled for so long.
> 
>>>> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
>>>> domain? I guess topology has influence in benchmark numbers here as well.
>>>
>>> I can't easily enable scheduler debugging right now (which I think is needed to 
>>> get this info directly?). But that's what I'd expect, yes. lscpu confirms there 
>>> is a single NUMA node and topology for cpu0 gives this if it helps:
>>>
>>> /sys/devices/system/cpu/cpu0/topology$ grep "" -r .
>>> ./cluster_cpus:ffffffff,ffffffff
>>> ./cluster_cpus_list:0-63
>>> ./physical_package_id:0
>>> ./core_cpus_list:0
>>> ./core_siblings:ffffffff,ffffffff
>>> ./cluster_id:0
>>> ./core_siblings_list:0-63
>>> ./package_cpus:ffffffff,ffffffff
>>> ./package_cpus_list:0-63
>>
>> [...]
>>
>> OK, so single (flat) MC domain with 64 CPUs.
>>
> 
> That is what the OS sees but does it reflect reality? e.g. does Graviton3
> have multiple caches that are simply not advertised to the OS?
> 
>>>> There was also a lot of improvement on schbench (wakeup latency) on
>>>> higher percentiles (>= 99.0th) on the 2-socket machine with those 2
>>>> patches. I guess you haven't seen those on Grav3?
>>>>
>>>
>>> I don't have schbench results for 6.18 but I do have them for 6.19-rc1 and for 
>>> revert-next-buddy. The means have moved a bit but there are only a couple of 
>>> cases that we consisder statistically significant (marked (R)egression / 
>>> (I)mprovement):
>>>
>>> +----------------------------+------------------------------------------------------+-------------+-------------------+
>>> | Benchmark                  | Result Class                                         |  6-19-0-rc1 | revert-next-buddy |
>>> +============================+======================================================+=============+===================+
>>> | schbench/thread-contention | -m 16 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     1263.97 |            -6.43% |
>>> |                            | -m 16 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.28% |
>>> |                            | -m 16 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
>>> |                            | -m 16 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     6433.07 |           -10.99% |
>>> |                            | -m 16 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.39% |
>>> |                            | -m 16 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        4.17 |       (R) -16.67% |
>>> |                            | -m 16 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |     1458.33 |            -1.57% |
>>> |                            | -m 16 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |   813056.00 |            15.46% |
>>> |                            | -m 16 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    14240.00 |            -5.97% |
>>> |                            | -m 16 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      434.22 |             3.21% |
>>> |                            | -m 16 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 11354112.00 |             2.92% |
>>> |                            | -m 16 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63168.00 |            -2.87% |
>>> |                            | -m 32 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     2828.63 |             2.58% |
>>> |                            | -m 32 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |             0.00% |
>>> |                            | -m 32 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
>>> |                            | -m 32 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     3182.15 |             5.18% |
>>> |                            | -m 32 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   116266.67 |             8.22% |
>>> |                            | -m 32 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |     6186.67 |        (R) -5.34% |
>>> |                            | -m 32 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      749.20 |             2.91% |
>>> |                            | -m 32 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |  3702784.00 |        (I) 13.76% |
>>> |                            | -m 32 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    33514.67 |             0.24% |
>>> |                            | -m 32 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      392.23 |             3.42% |
>>> |                            | -m 32 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 16695296.00 |         (I) 5.82% |
>>> |                            | -m 32 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   120618.67 |            -3.22% |
>>> |                            | -m 64 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     5951.15 |             5.02% |
>>> |                            | -m 64 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15157.33 |             0.42% |
>>> |                            | -m 64 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.67 |            -4.35% |
>>> |                            | -m 64 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     1510.23 |            -1.38% |
>>> |                            | -m 64 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   802816.00 |            13.73% |
>>> |                            | -m 64 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |    14890.67 |           -10.44% |
>>> |                            | -m 64 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      458.87 |             4.60% |
>>> |                            | -m 64 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    | 11348650.67 |         (I) 2.67% |
>>> |                            | -m 64 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63445.33 |        (R) -5.48% |
>>> |                            | -m 64 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      541.33 |             2.65% |
>>> |                            | -m 64 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 36743850.67 |        (I) 10.95% |
>>> |                            | -m 64 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   211370.67 |            -1.94% |
>>> +----------------------------+------------------------------------------------------+-------------+-------------------+
>>>
>>> I could get the results for 6.18 if useful, but I think what I have probably 
>>> shows enough of the picture: This patch has not impacted schbench much on 
>>> this HW.
>>
>> I see. IMHO, task scheduler tests are all about putting the right about
>> of stress onto the system, not too little and not too much.
>>
>> I guess, these ~ `-m 64 -t 4` tests should be fine for a 64 CPUs system.
> 
> Agreed. It's not the full picture but it's a valuable part.
> 
>> Not sure which parameter set Mel was using on his 2 socket machine. And
>> I still assume he tested w/o (base) against with these 2 patches.
>>
> 
> He did. The most comparable test I used was NUM_CPUS so 64 for Graviton
> is ok.
> 
>> The other test Mel was using is this modified dbench4 (prefers
>> throughput (less preemption)). Not sure if this is part of the MmTests
>> suite?
>>
> 
> It is. The modifications are not extensive. dbench by default reports overall
> throughput over time which masks actual throughput at a point in time. The
> new metric tracks time taken to process "loadfiles" over time which is
> more sensible to analyse. Other metrics such as loadfiles processed per
> client would be easily extracted but isn't at the moment as dbench itself
> is not designed for measuring fairness of forward progress as such.
> 
>> It would be nice to be able to run the same tests on different machines
>> (with a parameter set adapted to the number of CPUs), so we have only
>> the arch and the topology as variables). But there is definitely more
>> variety (e.g. used filesystem, etc) ... so this is not trivial.
>>
> 
> From a topology perspective it is fairly trivial though. For example,
> MMTESTS has a schbench configuration that runs one message thread per NUMA
> node communicating with nr_cpus/nr_nodes to evaluate placement. A similar
> configuration could use (nr_cpus/nr_nodes)-1 to ensure tasks are packed
> properly or nr_llcs could also be used fairly trivially. You're right that
> once filesystems are involved then it all gets more interesting. ext4 and
> xfs use kernel threads differently (jbd vs kworkers), the underlying storage
> is a factor, workset size vs RAM impacts dirty throttling and reclaim and
> NUMA sizes all play a part. dbench is useful in this regard because while
> it interacts with the filesystem a wakeups between userspace and kernel
> threads get exercised, the amount of IO is relatively small.
> 
> Lets start with getting figures for 6.18, new-NEXT, old-NEXT and
> no-NEXT side-by-side. Ideally I'd do the same across a range of x86-64
> machines but I can't start that yet.
>