[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a33dc104-cfd3-4c29-976b-ea370f45e24d@arm.com>
Date: Fri, 2 Jan 2026 12:38:58 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Mel Gorman <mgorman@...hsingularity.net>,
"Peter Zijlstra (Intel)" <peterz@...radead.org>
Cc: x86@...nel.org, linux-kernel@...r.kernel.org,
Aishwarya TCV <Aishwarya.TCV@....com>
Subject: Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with
EEVDF goals
Hi, I appreciate I sent this report just before Xmas so most likely you haven't
had a chance to look, but wanted to bring it back to the top of your mailbox in
case it was missed.
Happy new year!
Thanks,
Ryan
On 22/12/2025 10:57, Ryan Roberts wrote:
> Hi Mel, Peter,
>
> We are building out a kernel performance regression monitoring lab at Arm, and
> I've noticed some fairly large perofrmance regressions in real-world workloads,
> for which bisection has fingered this patch.
>
> We are looking at performance changes between v6.18 and v6.19-rc1, and by
> reverting this patch on top of v6.19-rc1 many regressions are resolved. (We plan
> to move the testing to linux-next over the next couple of quarters so hopefully
> we will be able to deliver this sort of news prior to merging in future).
>
> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean
> statistically significant regression/improvement, where "statistically
> significant" means the 95% confidence intervals do not overlap".
>
> The below is a large scale mysql workload, running across 2 AWS instances (a
> load generator and the mysql server). We have a partner for whom this is a very
> important workload. Performance regresses by 1.3% between 6.18 and 6.19-rc1
> (where the patch is added). By reverting the patch, the regression is not only
> fixed by performance is now nearly 6% better than v6.18:
>
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert-next-buddy |
> +=================================+====================================================+=================+==============+===================+
> | repro-collection/mysql-workload | db transaction rate (transactions/min) | 646267.33 | (R) -1.33% | (I) 5.87% |
> | | new order rate (orders/min) | 213256.50 | (R) -1.32% | (I) 5.87% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
>
>
> Next are a bunch of benchmarks all running on a single system. specjbb is the
> SPEC Java Business Benchmark. The mysql one is the same as above but this time
> both loadgen and server are on the same system. pgbench is the PostgreSQL
> benchmark.
>
> I'm showing hackbench for completeness, but I don't consider it a high priority
> issue.
>
> Interestingly, nginx improves significantly with the patch.
>
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert-next-buddy |
> +=================================+====================================================+=================+==============+===================+
> | specjbb/composite | critical-jOPS (jOPS) | 94700.00 | (R) -5.10% | -0.90% |
> | | max-jOPS (jOPS) | 113984.50 | (R) -3.90% | -0.65% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | repro-collection/mysql-workload | db transaction rate (transactions/min) | 245438.25 | (R) -3.88% | -0.13% |
> | | new order rate (orders/min) | 80985.75 | (R) -3.78% | -0.07% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | pts/pgbench | Scale: 1 Clients: 1 Read Only (TPS) | 63124.00 | (I) 2.90% | 0.74% |
> | | Scale: 1 Clients: 1 Read Only - Latency (ms) | 0.016 | (I) 5.49% | 1.05% |
> | | Scale: 1 Clients: 1 Read Write (TPS) | 974.92 | 0.11% | -0.08% |
> | | Scale: 1 Clients: 1 Read Write - Latency (ms) | 1.03 | 0.12% | -0.06% |
> | | Scale: 1 Clients: 250 Read Only (TPS) | 1915931.58 | (R) -2.25% | (I) 2.12% |
> | | Scale: 1 Clients: 250 Read Only - Latency (ms) | 0.13 | (R) -2.37% | (I) 2.09% |
> | | Scale: 1 Clients: 250 Read Write (TPS) | 855.67 | -1.36% | -0.14% |
> | | Scale: 1 Clients: 250 Read Write - Latency (ms) | 292.39 | -1.31% | -0.08% |
> | | Scale: 1 Clients: 1000 Read Only (TPS) | 1534130.08 | (R) -11.37% | 0.08% |
> | | Scale: 1 Clients: 1000 Read Only - Latency (ms) | 0.65 | (R) -11.38% | 0.08% |
> | | Scale: 1 Clients: 1000 Read Write (TPS) | 578.75 | -1.11% | 2.15% |
> | | Scale: 1 Clients: 1000 Read Write - Latency (ms) | 1736.98 | -1.26% | 2.47% |
> | | Scale: 100 Clients: 1 Read Only (TPS) | 57170.33 | 1.68% | 0.10% |
> | | Scale: 100 Clients: 1 Read Only - Latency (ms) | 0.018 | 1.94% | 0.00% |
> | | Scale: 100 Clients: 1 Read Write (TPS) | 836.58 | -0.37% | -0.41% |
> | | Scale: 100 Clients: 1 Read Write - Latency (ms) | 1.20 | -0.37% | -0.40% |
> | | Scale: 100 Clients: 250 Read Only (TPS) | 1773440.67 | -1.61% | 1.67% |
> | | Scale: 100 Clients: 250 Read Only - Latency (ms) | 0.14 | -1.40% | 1.56% |
> | | Scale: 100 Clients: 250 Read Write (TPS) | 5505.50 | -0.17% | -0.86% |
> | | Scale: 100 Clients: 250 Read Write - Latency (ms) | 45.42 | -0.17% | -0.85% |
> | | Scale: 100 Clients: 1000 Read Only (TPS) | 1393037.50 | (R) -10.31% | -0.19% |
> | | Scale: 100 Clients: 1000 Read Only - Latency (ms) | 0.72 | (R) -10.30% | -0.17% |
> | | Scale: 100 Clients: 1000 Read Write (TPS) | 5085.92 | 0.27% | 0.07% |
> | | Scale: 100 Clients: 1000 Read Write - Latency (ms) | 196.79 | 0.23% | 0.05% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | mmtests/hackbench | hackbench-process-pipes-1 (seconds) | 0.14 | -1.51% | -1.05% |
> | | hackbench-process-pipes-4 (seconds) | 0.44 | (I) 6.49% | (I) 5.42% |
> | | hackbench-process-pipes-7 (seconds) | 0.68 | (R) -18.36% | (I) 3.40% |
> | | hackbench-process-pipes-12 (seconds) | 1.24 | (R) -19.89% | -0.45% |
> | | hackbench-process-pipes-21 (seconds) | 1.81 | (R) -8.41% | -1.22% |
> | | hackbench-process-pipes-30 (seconds) | 2.39 | (R) -9.06% | (R) -2.95% |
> | | hackbench-process-pipes-48 (seconds) | 3.18 | (R) -11.68% | (R) -4.10% |
> | | hackbench-process-pipes-79 (seconds) | 3.84 | (R) -9.74% | (R) -3.25% |
> | | hackbench-process-pipes-110 (seconds) | 4.68 | (R) -6.57% | (R) -2.12% |
> | | hackbench-process-pipes-141 (seconds) | 5.75 | (R) -5.86% | (R) -3.44% |
> | | hackbench-process-pipes-172 (seconds) | 6.80 | (R) -4.28% | (R) -2.81% |
> | | hackbench-process-pipes-203 (seconds) | 7.94 | (R) -4.01% | (R) -3.00% |
> | | hackbench-process-pipes-234 (seconds) | 9.02 | (R) -3.52% | (R) -2.81% |
> | | hackbench-process-pipes-256 (seconds) | 9.78 | (R) -3.24% | (R) -2.81% |
> | | hackbench-process-sockets-1 (seconds) | 0.29 | 0.50% | 0.26% |
> | | hackbench-process-sockets-4 (seconds) | 0.76 | (I) 17.44% | (I) 16.31% |
> | | hackbench-process-sockets-7 (seconds) | 1.16 | (I) 12.10% | (I) 9.78% |
> | | hackbench-process-sockets-12 (seconds) | 1.86 | (I) 10.19% | (I) 9.83% |
> | | hackbench-process-sockets-21 (seconds) | 3.12 | (I) 9.38% | (I) 9.20% |
> | | hackbench-process-sockets-30 (seconds) | 4.30 | (I) 6.43% | (I) 6.11% |
> | | hackbench-process-sockets-48 (seconds) | 6.58 | (I) 3.00% | (I) 2.19% |
> | | hackbench-process-sockets-79 (seconds) | 10.56 | (I) 2.87% | (I) 3.31% |
> | | hackbench-process-sockets-110 (seconds) | 13.85 | -1.15% | (I) 2.33% |
> | | hackbench-process-sockets-141 (seconds) | 19.23 | -1.40% | (I) 14.53% |
> | | hackbench-process-sockets-172 (seconds) | 26.33 | (I) 3.52% | (I) 30.37% |
> | | hackbench-process-sockets-203 (seconds) | 30.27 | 1.10% | (I) 27.20% |
> | | hackbench-process-sockets-234 (seconds) | 35.12 | 1.60% | (I) 28.24% |
> | | hackbench-process-sockets-256 (seconds) | 38.74 | 0.70% | (I) 28.74% |
> | | hackbench-thread-pipes-1 (seconds) | 0.17 | -1.32% | -0.76% |
> | | hackbench-thread-pipes-4 (seconds) | 0.45 | (I) 6.91% | (I) 7.64% |
> | | hackbench-thread-pipes-7 (seconds) | 0.74 | (R) -7.51% | (I) 5.26% |
> | | hackbench-thread-pipes-12 (seconds) | 1.32 | (R) -8.40% | (I) 2.32% |
> | | hackbench-thread-pipes-21 (seconds) | 1.95 | (R) -2.95% | 0.91% |
> | | hackbench-thread-pipes-30 (seconds) | 2.50 | (R) -4.61% | 1.47% |
> | | hackbench-thread-pipes-48 (seconds) | 3.32 | (R) -5.45% | (I) 2.15% |
> | | hackbench-thread-pipes-79 (seconds) | 4.04 | (R) -5.53% | 1.85% |
> | | hackbench-thread-pipes-110 (seconds) | 4.94 | (R) -2.33% | 1.51% |
> | | hackbench-thread-pipes-141 (seconds) | 6.04 | (R) -2.47% | 1.15% |
> | | hackbench-thread-pipes-172 (seconds) | 7.15 | -0.91% | 1.48% |
> | | hackbench-thread-pipes-203 (seconds) | 8.31 | -1.29% | 0.77% |
> | | hackbench-thread-pipes-234 (seconds) | 9.49 | -1.03% | 0.77% |
> | | hackbench-thread-pipes-256 (seconds) | 10.30 | -0.80% | 0.42% |
> | | hackbench-thread-sockets-1 (seconds) | 0.31 | 0.05% | -0.05% |
> | | hackbench-thread-sockets-4 (seconds) | 0.79 | (I) 18.91% | (I) 16.82% |
> | | hackbench-thread-sockets-7 (seconds) | 1.16 | (I) 12.57% | (I) 10.63% |
> | | hackbench-thread-sockets-12 (seconds) | 1.87 | (I) 12.65% | (I) 12.26% |
> | | hackbench-thread-sockets-21 (seconds) | 3.16 | (I) 11.62% | (I) 12.74% |
> | | hackbench-thread-sockets-30 (seconds) | 4.32 | (I) 7.35% | (I) 8.89% |
> | | hackbench-thread-sockets-48 (seconds) | 6.45 | (I) 2.69% | (I) 3.06% |
> | | hackbench-thread-sockets-79 (seconds) | 10.15 | (I) 3.30% | 1.98% |
> | | hackbench-thread-sockets-110 (seconds) | 13.45 | -0.25% | (I) 3.68% |
> | | hackbench-thread-sockets-141 (seconds) | 17.87 | (R) -2.18% | (I) 8.46% |
> | | hackbench-thread-sockets-172 (seconds) | 24.38 | 1.02% | (I) 24.33% |
> | | hackbench-thread-sockets-203 (seconds) | 28.38 | -0.99% | (I) 24.20% |
> | | hackbench-thread-sockets-234 (seconds) | 32.75 | -0.42% | (I) 24.35% |
> | | hackbench-thread-sockets-256 (seconds) | 36.49 | -1.30% | (I) 26.22% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | pts/nginx | Connections: 200 (Requests Per Second) | 252332.60 | (I) 17.54% | -0.53% |
> | | Connections: 1000 (Requests Per Second) | 248591.29 | (I) 20.41% | 0.10% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
>
> All of the benchmarks have been run multiple times and I have high confidence in
> the results. I can share min/mean/max/stdev/ci95 stats if that's helpful though.
>
> I'm not providing the data, but we also see similar regressions on AmpereOne
> (another arm64 server system). And we have seen a few functional tests (kvm
> selftests) that have started to timeout due to this patch slowing things down on
> arm64.
>
> I'm hoping you can advise on the best way to proceed? We have a bigger library
> than what I'm showing, but the only improvement I see due to this patch is
> nginx. So based on that, my preference would be to revert the patch upstream
> until the issues can be worked out. I'm guessing the story is quite different
> for x86 though?
>
> Thanks,
> Ryan
>
>
>
> On 17/11/2025 16:23, tip-bot2 for Mel Gorman wrote:
>> The following commit has been merged into the sched/core branch of tip:
>>
>> Commit-ID: e837456fdca81899a3c8e47b3fd39e30eae6e291
>> Gitweb: https://git.kernel.org/tip/e837456fdca81899a3c8e47b3fd39e30eae6e291
>> Author: Mel Gorman <mgorman@...hsingularity.net>
>> AuthorDate: Wed, 12 Nov 2025 12:25:21
>> Committer: Peter Zijlstra <peterz@...radead.org>
>> CommitterDate: Mon, 17 Nov 2025 17:13:15 +01:00
>>
>> sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
>>
>> Reimplement NEXT_BUDDY preemption to take into account the deadline and
>> eligibility of the wakee with respect to the waker. In the event
>> multiple buddies could be considered, the one with the earliest deadline
>> is selected.
>>
>> Sync wakeups are treated differently to every other type of wakeup. The
>> WF_SYNC assumption is that the waker promises to sleep in the very near
>> future. This is violated in enough cases that WF_SYNC should be treated
>> as a suggestion instead of a contract. If a waker does go to sleep almost
>> immediately then the delay in wakeup is negligible. In other cases, it's
>> throttled based on the accumulated runtime of the waker so there is a
>> chance that some batched wakeups have been issued before preemption.
>>
>> For all other wakeups, preemption happens if the wakee has a earlier
>> deadline than the waker and eligible to run.
>>
>> While many workloads were tested, the two main targets were a modified
>> dbench4 benchmark and hackbench because the are on opposite ends of the
>> spectrum -- one prefers throughput by avoiding preemption and the other
>> relies on preemption.
>>
>> First is the dbench throughput data even though it is a poor metric but
>> it is the default metric. The test machine is a 2-socket machine and the
>> backing filesystem is XFS as a lot of the IO work is dispatched to kernel
>> threads. It's important to note that these results are not representative
>> across all machines, especially Zen machines, as different bottlenecks
>> are exposed on different machines and filesystems.
>>
>> dbench4 Throughput (misleading but traditional)
>> 6.18-rc1 6.18-rc1
>> vanilla sched-preemptnext-v5
>> Hmean 1 1268.80 ( 0.00%) 1269.74 ( 0.07%)
>> Hmean 4 3971.74 ( 0.00%) 3950.59 ( -0.53%)
>> Hmean 7 5548.23 ( 0.00%) 5420.08 ( -2.31%)
>> Hmean 12 7310.86 ( 0.00%) 7165.57 ( -1.99%)
>> Hmean 21 8874.53 ( 0.00%) 9149.04 ( 3.09%)
>> Hmean 30 9361.93 ( 0.00%) 10530.04 ( 12.48%)
>> Hmean 48 9540.14 ( 0.00%) 11820.40 ( 23.90%)
>> Hmean 79 9208.74 ( 0.00%) 12193.79 ( 32.42%)
>> Hmean 110 8573.12 ( 0.00%) 11933.72 ( 39.20%)
>> Hmean 141 7791.33 ( 0.00%) 11273.90 ( 44.70%)
>> Hmean 160 7666.60 ( 0.00%) 10768.72 ( 40.46%)
>>
>> As throughput is misleading, the benchmark is modified to use a short
>> loadfile report the completion time duration in milliseconds.
>>
>> dbench4 Loadfile Execution Time
>> 6.18-rc1 6.18-rc1
>> vanilla sched-preemptnext-v5
>> Amean 1 14.62 ( 0.00%) 14.69 ( -0.46%)
>> Amean 4 18.76 ( 0.00%) 18.85 ( -0.45%)
>> Amean 7 23.71 ( 0.00%) 24.38 ( -2.82%)
>> Amean 12 31.25 ( 0.00%) 31.87 ( -1.97%)
>> Amean 21 45.12 ( 0.00%) 43.69 ( 3.16%)
>> Amean 30 61.07 ( 0.00%) 54.33 ( 11.03%)
>> Amean 48 95.91 ( 0.00%) 77.22 ( 19.49%)
>> Amean 79 163.38 ( 0.00%) 123.08 ( 24.66%)
>> Amean 110 243.91 ( 0.00%) 175.11 ( 28.21%)
>> Amean 141 343.47 ( 0.00%) 239.10 ( 30.39%)
>> Amean 160 401.15 ( 0.00%) 283.73 ( 29.27%)
>> Stddev 1 0.52 ( 0.00%) 0.51 ( 2.45%)
>> Stddev 4 1.36 ( 0.00%) 1.30 ( 4.04%)
>> Stddev 7 1.88 ( 0.00%) 1.87 ( 0.72%)
>> Stddev 12 3.06 ( 0.00%) 2.45 ( 19.83%)
>> Stddev 21 5.78 ( 0.00%) 3.87 ( 33.06%)
>> Stddev 30 9.85 ( 0.00%) 5.25 ( 46.76%)
>> Stddev 48 22.31 ( 0.00%) 8.64 ( 61.27%)
>> Stddev 79 35.96 ( 0.00%) 18.07 ( 49.76%)
>> Stddev 110 59.04 ( 0.00%) 30.93 ( 47.61%)
>> Stddev 141 85.38 ( 0.00%) 40.93 ( 52.06%)
>> Stddev 160 96.38 ( 0.00%) 39.72 ( 58.79%)
>>
>> That is still looking good and the variance is reduced quite a bit.
>> Finally, fairness is a concern so the next report tracks how many
>> milliseconds does it take for all clients to complete a workfile. This
>> one is tricky because dbench makes to effort to synchronise clients so
>> the durations at benchmark start time differ substantially from typical
>> runtimes. This problem could be mitigated by warming up the benchmark
>> for a number of minutes but it's a matter of opinion whether that
>> counts as an evasion of inconvenient results.
>>
>> dbench4 All Clients Loadfile Execution Time
>> 6.18-rc1 6.18-rc1
>> vanilla sched-preemptnext-v5
>> Amean 1 15.06 ( 0.00%) 15.07 ( -0.03%)
>> Amean 4 603.81 ( 0.00%) 524.29 ( 13.17%)
>> Amean 7 855.32 ( 0.00%) 1331.07 ( -55.62%)
>> Amean 12 1890.02 ( 0.00%) 2323.97 ( -22.96%)
>> Amean 21 3195.23 ( 0.00%) 2009.29 ( 37.12%)
>> Amean 30 13919.53 ( 0.00%) 4579.44 ( 67.10%)
>> Amean 48 25246.07 ( 0.00%) 5705.46 ( 77.40%)
>> Amean 79 29701.84 ( 0.00%) 15509.26 ( 47.78%)
>> Amean 110 22803.03 ( 0.00%) 23782.08 ( -4.29%)
>> Amean 141 36356.07 ( 0.00%) 25074.20 ( 31.03%)
>> Amean 160 17046.71 ( 0.00%) 13247.62 ( 22.29%)
>> Stddev 1 0.47 ( 0.00%) 0.49 ( -3.74%)
>> Stddev 4 395.24 ( 0.00%) 254.18 ( 35.69%)
>> Stddev 7 467.24 ( 0.00%) 764.42 ( -63.60%)
>> Stddev 12 1071.43 ( 0.00%) 1395.90 ( -30.28%)
>> Stddev 21 1694.50 ( 0.00%) 1204.89 ( 28.89%)
>> Stddev 30 7945.63 ( 0.00%) 2552.59 ( 67.87%)
>> Stddev 48 14339.51 ( 0.00%) 3227.55 ( 77.49%)
>> Stddev 79 16620.91 ( 0.00%) 8422.15 ( 49.33%)
>> Stddev 110 12912.15 ( 0.00%) 13560.95 ( -5.02%)
>> Stddev 141 20700.13 ( 0.00%) 14544.51 ( 29.74%)
>> Stddev 160 9079.16 ( 0.00%) 7400.69 ( 18.49%)
>>
>> This is more of a mixed bag but it at least shows that fairness
>> is not crippled.
>>
>> The hackbench results are more neutral but this is still important.
>> It's possible to boost the dbench figures by a large amount but only by
>> crippling the performance of a workload like hackbench. The WF_SYNC
>> behaviour is important for these workloads and is why the WF_SYNC
>> changes are not a separate patch.
>>
>> hackbench-process-pipes
>> 6.18-rc1 6.18-rc1
>> vanilla sched-preemptnext-v5
>> Amean 1 0.2657 ( 0.00%) 0.2150 ( 19.07%)
>> Amean 4 0.6107 ( 0.00%) 0.6060 ( 0.76%)
>> Amean 7 0.7923 ( 0.00%) 0.7440 ( 6.10%)
>> Amean 12 1.1500 ( 0.00%) 1.1263 ( 2.06%)
>> Amean 21 1.7950 ( 0.00%) 1.7987 ( -0.20%)
>> Amean 30 2.3207 ( 0.00%) 2.5053 ( -7.96%)
>> Amean 48 3.5023 ( 0.00%) 3.9197 ( -11.92%)
>> Amean 79 4.8093 ( 0.00%) 5.2247 ( -8.64%)
>> Amean 110 6.1160 ( 0.00%) 6.6650 ( -8.98%)
>> Amean 141 7.4763 ( 0.00%) 7.8973 ( -5.63%)
>> Amean 172 8.9560 ( 0.00%) 9.3593 ( -4.50%)
>> Amean 203 10.4783 ( 0.00%) 10.8347 ( -3.40%)
>> Amean 234 12.4977 ( 0.00%) 13.0177 ( -4.16%)
>> Amean 265 14.7003 ( 0.00%) 15.5630 ( -5.87%)
>> Amean 296 16.1007 ( 0.00%) 17.4023 ( -8.08%)
>>
>> Processes using pipes are impacted but the variance (not presented) indicates
>> it's close to noise and the results are not always reproducible. If executed
>> across multiple reboots, it may show neutral or small gains so the worst
>> measured results are presented.
>>
>> Hackbench using sockets is more reliably neutral as the wakeup
>> mechanisms are different between sockets and pipes.
>>
>> hackbench-process-sockets
>> 6.18-rc1 6.18-rc1
>> vanilla sched-preemptnext-v2
>> Amean 1 0.3073 ( 0.00%) 0.3263 ( -6.18%)
>> Amean 4 0.7863 ( 0.00%) 0.7930 ( -0.85%)
>> Amean 7 1.3670 ( 0.00%) 1.3537 ( 0.98%)
>> Amean 12 2.1337 ( 0.00%) 2.1903 ( -2.66%)
>> Amean 21 3.4683 ( 0.00%) 3.4940 ( -0.74%)
>> Amean 30 4.7247 ( 0.00%) 4.8853 ( -3.40%)
>> Amean 48 7.6097 ( 0.00%) 7.8197 ( -2.76%)
>> Amean 79 14.7957 ( 0.00%) 16.1000 ( -8.82%)
>> Amean 110 21.3413 ( 0.00%) 21.9997 ( -3.08%)
>> Amean 141 29.0503 ( 0.00%) 29.0353 ( 0.05%)
>> Amean 172 36.4660 ( 0.00%) 36.1433 ( 0.88%)
>> Amean 203 39.7177 ( 0.00%) 40.5910 ( -2.20%)
>> Amean 234 42.1120 ( 0.00%) 43.5527 ( -3.42%)
>> Amean 265 45.7830 ( 0.00%) 50.0560 ( -9.33%)
>> Amean 296 50.7043 ( 0.00%) 54.3657 ( -7.22%)
>>
>> As schbench has been mentioned in numerous bugs recently, the results
>> are interesting. A test case that represents the default schbench
>> behaviour is
>>
>> schbench Wakeup Latency (usec)
>> 6.18.0-rc1 6.18.0-rc1
>> vanilla sched-preemptnext-v5
>> Amean Wakeup-50th-80 7.17 ( 0.00%) 6.00 ( 16.28%)
>> Amean Wakeup-90th-80 46.56 ( 0.00%) 19.78 ( 57.52%)
>> Amean Wakeup-99th-80 119.61 ( 0.00%) 89.94 ( 24.80%)
>> Amean Wakeup-99.9th-80 3193.78 ( 0.00%) 328.22 ( 89.72%)
>>
>> schbench Requests Per Second (ops/sec)
>> 6.18.0-rc1 6.18.0-rc1
>> vanilla sched-preemptnext-v5
>> Hmean RPS-20th-80 8900.91 ( 0.00%) 9176.78 ( 3.10%)
>> Hmean RPS-50th-80 8987.41 ( 0.00%) 9217.89 ( 2.56%)
>> Hmean RPS-90th-80 9123.73 ( 0.00%) 9273.25 ( 1.64%)
>> Hmean RPS-max-80 9193.50 ( 0.00%) 9301.47 ( 1.17%)
>>
>> Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
>> Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net
>> ---
>> kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++++-------
>> 1 file changed, 130 insertions(+), 22 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 071e07f..c6e5c64 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -929,6 +929,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
>> if (cfs_rq->nr_queued == 1)
>> return curr && curr->on_rq ? curr : se;
>>
>> + /*
>> + * Picking the ->next buddy will affect latency but not fairness.
>> + */
>> + if (sched_feat(PICK_BUDDY) &&
>> + cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
>> + /* ->next will never be delayed */
>> + WARN_ON_ONCE(cfs_rq->next->sched_delayed);
>> + return cfs_rq->next;
>> + }
>> +
>> if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>> curr = NULL;
>>
>> @@ -1167,6 +1177,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>> return delta_exec;
>> }
>>
>> +static void set_next_buddy(struct sched_entity *se);
>> +
>> /*
>> * Used by other classes to account runtime.
>> */
>> @@ -5466,16 +5478,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
>> {
>> struct sched_entity *se;
>>
>> - /*
>> - * Picking the ->next buddy will affect latency but not fairness.
>> - */
>> - if (sched_feat(PICK_BUDDY) &&
>> - cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
>> - /* ->next will never be delayed */
>> - WARN_ON_ONCE(cfs_rq->next->sched_delayed);
>> - return cfs_rq->next;
>> - }
>> -
>> se = pick_eevdf(cfs_rq);
>> if (se->sched_delayed) {
>> dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
>> @@ -6988,8 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> hrtick_update(rq);
>> }
>>
>> -static void set_next_buddy(struct sched_entity *se);
>> -
>> /*
>> * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
>> * failing half-way through and resume the dequeue later.
>> @@ -8676,16 +8676,81 @@ static void set_next_buddy(struct sched_entity *se)
>> }
>> }
>>
>> +enum preempt_wakeup_action {
>> + PREEMPT_WAKEUP_NONE, /* No preemption. */
>> + PREEMPT_WAKEUP_SHORT, /* Ignore slice protection. */
>> + PREEMPT_WAKEUP_PICK, /* Let __pick_eevdf() decide. */
>> + PREEMPT_WAKEUP_RESCHED, /* Force reschedule. */
>> +};
>> +
>> +static inline bool
>> +set_preempt_buddy(struct cfs_rq *cfs_rq, int wake_flags,
>> + struct sched_entity *pse, struct sched_entity *se)
>> +{
>> + /*
>> + * Keep existing buddy if the deadline is sooner than pse.
>> + * The older buddy may be cache cold and completely unrelated
>> + * to the current wakeup but that is unpredictable where as
>> + * obeying the deadline is more in line with EEVDF objectives.
>> + */
>> + if (cfs_rq->next && entity_before(cfs_rq->next, pse))
>> + return false;
>> +
>> + set_next_buddy(pse);
>> + return true;
>> +}
>> +
>> +/*
>> + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
>> + * strictly enforced because the hint is either misunderstood or
>> + * multiple tasks must be woken up.
>> + */
>> +static inline enum preempt_wakeup_action
>> +preempt_sync(struct rq *rq, int wake_flags,
>> + struct sched_entity *pse, struct sched_entity *se)
>> +{
>> + u64 threshold, delta;
>> +
>> + /*
>> + * WF_SYNC without WF_TTWU is not expected so warn if it happens even
>> + * though it is likely harmless.
>> + */
>> + WARN_ON_ONCE(!(wake_flags & WF_TTWU));
>> +
>> + threshold = sysctl_sched_migration_cost;
>> + delta = rq_clock_task(rq) - se->exec_start;
>> + if ((s64)delta < 0)
>> + delta = 0;
>> +
>> + /*
>> + * WF_RQ_SELECTED implies the tasks are stacking on a CPU when they
>> + * could run on other CPUs. Reduce the threshold before preemption is
>> + * allowed to an arbitrary lower value as it is more likely (but not
>> + * guaranteed) the waker requires the wakee to finish.
>> + */
>> + if (wake_flags & WF_RQ_SELECTED)
>> + threshold >>= 2;
>> +
>> + /*
>> + * As WF_SYNC is not strictly obeyed, allow some runtime for batch
>> + * wakeups to be issued.
>> + */
>> + if (entity_before(pse, se) && delta >= threshold)
>> + return PREEMPT_WAKEUP_RESCHED;
>> +
>> + return PREEMPT_WAKEUP_NONE;
>> +}
>> +
>> /*
>> * Preempt the current task with a newly woken task if needed:
>> */
>> static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
>> {
>> + enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
>> struct task_struct *donor = rq->donor;
>> struct sched_entity *se = &donor->se, *pse = &p->se;
>> struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>> int cse_is_idle, pse_is_idle;
>> - bool do_preempt_short = false;
>>
>> if (unlikely(se == pse))
>> return;
>> @@ -8699,10 +8764,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>> if (task_is_throttled(p))
>> return;
>>
>> - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
>> - set_next_buddy(pse);
>> - }
>> -
>> /*
>> * We can come here with TIF_NEED_RESCHED already set from new task
>> * wake up path.
>> @@ -8734,7 +8795,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>> * When non-idle entity preempt an idle entity,
>> * don't give idle entity slice protection.
>> */
>> - do_preempt_short = true;
>> + preempt_action = PREEMPT_WAKEUP_SHORT;
>> goto preempt;
>> }
>>
>> @@ -8753,21 +8814,68 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>> * If @p has a shorter slice than current and @p is eligible, override
>> * current's slice protection in order to allow preemption.
>> */
>> - do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
>> + if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
>> + preempt_action = PREEMPT_WAKEUP_SHORT;
>> + goto pick;
>> + }
>>
>> /*
>> + * Ignore wakee preemption on WF_FORK as it is less likely that
>> + * there is shared data as exec often follow fork. Do not
>> + * preempt for tasks that are sched_delayed as it would violate
>> + * EEVDF to forcibly queue an ineligible task.
>> + */
>> + if ((wake_flags & WF_FORK) || pse->sched_delayed)
>> + return;
>> +
>> + /*
>> + * If @p potentially is completing work required by current then
>> + * consider preemption.
>> + *
>> + * Reschedule if waker is no longer eligible. */
>> + if (in_task() && !entity_eligible(cfs_rq, se)) {
>> + preempt_action = PREEMPT_WAKEUP_RESCHED;
>> + goto preempt;
>> + }
>> +
>> + /* Prefer picking wakee soon if appropriate. */
>> + if (sched_feat(NEXT_BUDDY) &&
>> + set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
>> +
>> + /*
>> + * Decide whether to obey WF_SYNC hint for a new buddy. Old
>> + * buddies are ignored as they may not be relevant to the
>> + * waker and less likely to be cache hot.
>> + */
>> + if (wake_flags & WF_SYNC)
>> + preempt_action = preempt_sync(rq, wake_flags, pse, se);
>> + }
>> +
>> + switch (preempt_action) {
>> + case PREEMPT_WAKEUP_NONE:
>> + return;
>> + case PREEMPT_WAKEUP_RESCHED:
>> + goto preempt;
>> + case PREEMPT_WAKEUP_SHORT:
>> + fallthrough;
>> + case PREEMPT_WAKEUP_PICK:
>> + break;
>> + }
>> +
>> +pick:
>> + /*
>> * If @p has become the most eligible task, force preemption.
>> */
>> - if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
>> + if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
>> goto preempt;
>>
>> - if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
>> + if (sched_feat(RUN_TO_PARITY))
>> update_protect_slice(cfs_rq, se);
>>
>> return;
>>
>> preempt:
>> - if (do_preempt_short)
>> + if (preempt_action == PREEMPT_WAKEUP_SHORT)
>> cancel_protect_slice(se);
>>
>> resched_curr_lazy(rq);
>>
>
Powered by blists - more mailing lists