[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6c137b30-d5f7-4dd9-9699-d7e22c174285@arm.com>
Date: Thu, 8 Jan 2026 13:15:18 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Mel Gorman <mgorman@...hsingularity.net>,
Dietmar Eggemann <dietmar.eggemann@....com>
Cc: "Peter Zijlstra (Intel)" <peterz@...radead.org>, x86@...nel.org,
linux-kernel@...r.kernel.org, Aishwarya TCV <Aishwarya.TCV@....com>
Subject: Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with
EEVDF goals
On 08/01/2026 08:50, Mel Gorman wrote:
> On Wed, Jan 07, 2026 at 04:30:09PM +0100, Dietmar Eggemann wrote:
>> On 05.01.26 12:45, Ryan Roberts wrote:
>>> On 02/01/2026 15:52, Dietmar Eggemann wrote:
>>>> On 02.01.26 13:38, Ryan Roberts wrote:
>>
>> [...]
>>
>
> Sorry for slow responses. I'm still back from holidays yet and unfortunately
> do not have access to test machines right now cannot revalidate any of
> the results against 6.19-rc*.
No problem, thanks for getting back to me!
>
>>>>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean
>>>>>> statistically significant regression/improvement, where "statistically
>>>>>> significant" means the 95% confidence intervals do not overlap".
>>>>
>>>> You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
>>>> Reimplement NEXT_BUDDY to align with EEVDF goals'.
>>>>
>>>> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
>>>> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?
>>>
>>> Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful?
>>
>> Well, I assume this would be more valuable.
>
> Agreed because we need to know if it's NEXT_BUDDY that is conceptually
> an issue with EEVDF in these cases or the specific implementation. The
> comparison between
>
> 6.18A (baseline)
> 6.19-rcN vanilla (New NEXT_BUDDY implementation enabled)
> 6.19-rcN revert patches 1+2 (NEXT_BUDDY disabled)
> 6.19-rcN revert patch 2 only (Old NEXT_BUDDY implementation enabled)
OK, I've already got 1, 2 and 4. Let me grab 3 and come back to you - hopefully
tomorrow. Then we can take it from there.
I appreciate your time on this!
Thanks,
Ryan
>
> It was known that NEXT_BUDDY was always a tradeoff but one that is workload,
> architecture and specific arch implementation dependent. If it cannot be
> sanely reconciled then it may be best to completely remove NEXT_BUDDY from
> EEVDF or disable it by default again for now. I don't think NEXT_BUDDY as
> it existed in CFS can be sanely implemented against EEVDF so it'll never
> be equivalent. Related to that I doubt anyone has good data on NEXT_BUDDY
> vs !NEXT_BUDDY even on CFS as it was enabled for so long.
>
>>>> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
>>>> domain? I guess topology has influence in benchmark numbers here as well.
>>>
>>> I can't easily enable scheduler debugging right now (which I think is needed to
>>> get this info directly?). But that's what I'd expect, yes. lscpu confirms there
>>> is a single NUMA node and topology for cpu0 gives this if it helps:
>>>
>>> /sys/devices/system/cpu/cpu0/topology$ grep "" -r .
>>> ./cluster_cpus:ffffffff,ffffffff
>>> ./cluster_cpus_list:0-63
>>> ./physical_package_id:0
>>> ./core_cpus_list:0
>>> ./core_siblings:ffffffff,ffffffff
>>> ./cluster_id:0
>>> ./core_siblings_list:0-63
>>> ./package_cpus:ffffffff,ffffffff
>>> ./package_cpus_list:0-63
>>
>> [...]
>>
>> OK, so single (flat) MC domain with 64 CPUs.
>>
>
> That is what the OS sees but does it reflect reality? e.g. does Graviton3
> have multiple caches that are simply not advertised to the OS?
>
>>>> There was also a lot of improvement on schbench (wakeup latency) on
>>>> higher percentiles (>= 99.0th) on the 2-socket machine with those 2
>>>> patches. I guess you haven't seen those on Grav3?
>>>>
>>>
>>> I don't have schbench results for 6.18 but I do have them for 6.19-rc1 and for
>>> revert-next-buddy. The means have moved a bit but there are only a couple of
>>> cases that we consisder statistically significant (marked (R)egression /
>>> (I)mprovement):
>>>
>>> +----------------------------+------------------------------------------------------+-------------+-------------------+
>>> | Benchmark | Result Class | 6-19-0-rc1 | revert-next-buddy |
>>> +============================+======================================================+=============+===================+
>>> | schbench/thread-contention | -m 16 -t 1 -r 10 -s 1000, avg_rps (req/sec) | 1263.97 | -6.43% |
>>> | | -m 16 -t 1 -r 10 -s 1000, req_latency_p99 (usec) | 15088.00 | -0.28% |
>>> | | -m 16 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec) | 3.00 | 0.00% |
>>> | | -m 16 -t 4 -r 10 -s 1000, avg_rps (req/sec) | 6433.07 | -10.99% |
>>> | | -m 16 -t 4 -r 10 -s 1000, req_latency_p99 (usec) | 15088.00 | -0.39% |
>>> | | -m 16 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec) | 4.17 | (R) -16.67% |
>>> | | -m 16 -t 16 -r 10 -s 1000, avg_rps (req/sec) | 1458.33 | -1.57% |
>>> | | -m 16 -t 16 -r 10 -s 1000, req_latency_p99 (usec) | 813056.00 | 15.46% |
>>> | | -m 16 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) | 14240.00 | -5.97% |
>>> | | -m 16 -t 64 -r 10 -s 1000, avg_rps (req/sec) | 434.22 | 3.21% |
>>> | | -m 16 -t 64 -r 10 -s 1000, req_latency_p99 (usec) | 11354112.00 | 2.92% |
>>> | | -m 16 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) | 63168.00 | -2.87% |
>>> | | -m 32 -t 1 -r 10 -s 1000, avg_rps (req/sec) | 2828.63 | 2.58% |
>>> | | -m 32 -t 1 -r 10 -s 1000, req_latency_p99 (usec) | 15088.00 | 0.00% |
>>> | | -m 32 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec) | 3.00 | 0.00% |
>>> | | -m 32 -t 4 -r 10 -s 1000, avg_rps (req/sec) | 3182.15 | 5.18% |
>>> | | -m 32 -t 4 -r 10 -s 1000, req_latency_p99 (usec) | 116266.67 | 8.22% |
>>> | | -m 32 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec) | 6186.67 | (R) -5.34% |
>>> | | -m 32 -t 16 -r 10 -s 1000, avg_rps (req/sec) | 749.20 | 2.91% |
>>> | | -m 32 -t 16 -r 10 -s 1000, req_latency_p99 (usec) | 3702784.00 | (I) 13.76% |
>>> | | -m 32 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) | 33514.67 | 0.24% |
>>> | | -m 32 -t 64 -r 10 -s 1000, avg_rps (req/sec) | 392.23 | 3.42% |
>>> | | -m 32 -t 64 -r 10 -s 1000, req_latency_p99 (usec) | 16695296.00 | (I) 5.82% |
>>> | | -m 32 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) | 120618.67 | -3.22% |
>>> | | -m 64 -t 1 -r 10 -s 1000, avg_rps (req/sec) | 5951.15 | 5.02% |
>>> | | -m 64 -t 1 -r 10 -s 1000, req_latency_p99 (usec) | 15157.33 | 0.42% |
>>> | | -m 64 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec) | 3.67 | -4.35% |
>>> | | -m 64 -t 4 -r 10 -s 1000, avg_rps (req/sec) | 1510.23 | -1.38% |
>>> | | -m 64 -t 4 -r 10 -s 1000, req_latency_p99 (usec) | 802816.00 | 13.73% |
>>> | | -m 64 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec) | 14890.67 | -10.44% |
>>> | | -m 64 -t 16 -r 10 -s 1000, avg_rps (req/sec) | 458.87 | 4.60% |
>>> | | -m 64 -t 16 -r 10 -s 1000, req_latency_p99 (usec) | 11348650.67 | (I) 2.67% |
>>> | | -m 64 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) | 63445.33 | (R) -5.48% |
>>> | | -m 64 -t 64 -r 10 -s 1000, avg_rps (req/sec) | 541.33 | 2.65% |
>>> | | -m 64 -t 64 -r 10 -s 1000, req_latency_p99 (usec) | 36743850.67 | (I) 10.95% |
>>> | | -m 64 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) | 211370.67 | -1.94% |
>>> +----------------------------+------------------------------------------------------+-------------+-------------------+
>>>
>>> I could get the results for 6.18 if useful, but I think what I have probably
>>> shows enough of the picture: This patch has not impacted schbench much on
>>> this HW.
>>
>> I see. IMHO, task scheduler tests are all about putting the right about
>> of stress onto the system, not too little and not too much.
>>
>> I guess, these ~ `-m 64 -t 4` tests should be fine for a 64 CPUs system.
>
> Agreed. It's not the full picture but it's a valuable part.
>
>> Not sure which parameter set Mel was using on his 2 socket machine. And
>> I still assume he tested w/o (base) against with these 2 patches.
>>
>
> He did. The most comparable test I used was NUM_CPUS so 64 for Graviton
> is ok.
>
>> The other test Mel was using is this modified dbench4 (prefers
>> throughput (less preemption)). Not sure if this is part of the MmTests
>> suite?
>>
>
> It is. The modifications are not extensive. dbench by default reports overall
> throughput over time which masks actual throughput at a point in time. The
> new metric tracks time taken to process "loadfiles" over time which is
> more sensible to analyse. Other metrics such as loadfiles processed per
> client would be easily extracted but isn't at the moment as dbench itself
> is not designed for measuring fairness of forward progress as such.
>
>> It would be nice to be able to run the same tests on different machines
>> (with a parameter set adapted to the number of CPUs), so we have only
>> the arch and the topology as variables). But there is definitely more
>> variety (e.g. used filesystem, etc) ... so this is not trivial.
>>
>
> From a topology perspective it is fairly trivial though. For example,
> MMTESTS has a schbench configuration that runs one message thread per NUMA
> node communicating with nr_cpus/nr_nodes to evaluate placement. A similar
> configuration could use (nr_cpus/nr_nodes)-1 to ensure tasks are packed
> properly or nr_llcs could also be used fairly trivially. You're right that
> once filesystems are involved then it all gets more interesting. ext4 and
> xfs use kernel threads differently (jbd vs kworkers), the underlying storage
> is a factor, workset size vs RAM impacts dirty throttling and reclaim and
> NUMA sizes all play a part. dbench is useful in this regard because while
> it interacts with the filesystem a wakeups between userspace and kernel
> threads get exercised, the amount of IO is relatively small.
>
> Lets start with getting figures for 6.18, new-NEXT, old-NEXT and
> no-NEXT side-by-side. Ideally I'd do the same across a range of x86-64
> machines but I can't start that yet.
>
Powered by blists - more mailing lists