linux-kernel - Re: [REGRESSION] sched/fair: Reimplement NEXT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a33dc104-cfd3-4c29-976b-ea370f45e24d@arm.com>
Date: Fri, 2 Jan 2026 12:38:58 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Mel Gorman <mgorman@...hsingularity.net>,
 "Peter Zijlstra (Intel)" <peterz@...radead.org>
Cc: x86@...nel.org, linux-kernel@...r.kernel.org,
 Aishwarya TCV <Aishwarya.TCV@....com>
Subject: Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with
 EEVDF goals

Hi, I appreciate I sent this report just before Xmas so most likely you haven't
had a chance to look, but wanted to bring it back to the top of your mailbox in
case it was missed.

Happy new year!

Thanks,
Ryan

On 22/12/2025 10:57, Ryan Roberts wrote:
> Hi Mel, Peter,
> 
> We are building out a kernel performance regression monitoring lab at Arm, and 
> I've noticed some fairly large perofrmance regressions in real-world workloads, 
> for which bisection has fingered this patch.
> 
> We are looking at performance changes between v6.18 and v6.19-rc1, and by 
> reverting this patch on top of v6.19-rc1 many regressions are resolved. (We plan 
> to move the testing to linux-next over the next couple of quarters so hopefully 
> we will be able to deliver this sort of news prior to merging in future).
> 
> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
> statistically significant regression/improvement, where "statistically 
> significant" means the 95% confidence intervals do not overlap".
> 
> The below is a large scale mysql workload, running across 2 AWS instances (a 
> load generator and the mysql server). We have a partner for whom this is a very 
> important workload. Performance regresses by 1.3% between 6.18 and 6.19-rc1 
> (where the patch is added). By reverting the patch, the regression is not only 
> fixed by performance is now nearly 6% better than v6.18:
> 
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | Benchmark                       | Result Class                                       |   6-18-0 (base) |   6-19-0-rc1 | revert-next-buddy |
> +=================================+====================================================+=================+==============+===================+
> | repro-collection/mysql-workload | db transaction rate (transactions/min)             |       646267.33 |   (R) -1.33% |         (I) 5.87% |
> |                                 | new order rate (orders/min)                        |       213256.50 |   (R) -1.32% |         (I) 5.87% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> 
> 
> Next are a bunch of benchmarks all running on a single system. specjbb is the 
> SPEC Java Business Benchmark. The mysql one is the same as above but this time 
> both loadgen and server are on the same system. pgbench is the PostgreSQL 
> benchmark.
> 
> I'm showing hackbench for completeness, but I don't consider it a high priority 
> issue.
> 
> Interestingly, nginx improves significantly with the patch.
> 
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | Benchmark                       | Result Class                                       |   6-18-0 (base) |   6-19-0-rc1 | revert-next-buddy |
> +=================================+====================================================+=================+==============+===================+
> | specjbb/composite               | critical-jOPS (jOPS)                               |        94700.00 |   (R) -5.10% |            -0.90% |
> |                                 | max-jOPS (jOPS)                                    |       113984.50 |   (R) -3.90% |            -0.65% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | repro-collection/mysql-workload | db transaction rate (transactions/min)             |       245438.25 |   (R) -3.88% |            -0.13% |
> |                                 | new order rate (orders/min)                        |        80985.75 |   (R) -3.78% |            -0.07% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | pts/pgbench                     | Scale: 1 Clients: 1 Read Only (TPS)                |        63124.00 |    (I) 2.90% |             0.74% |
> |                                 | Scale: 1 Clients: 1 Read Only - Latency (ms)       |           0.016 |    (I) 5.49% |             1.05% |
> |                                 | Scale: 1 Clients: 1 Read Write (TPS)               |          974.92 |        0.11% |            -0.08% |
> |                                 | Scale: 1 Clients: 1 Read Write - Latency (ms)      |            1.03 |        0.12% |            -0.06% |
> |                                 | Scale: 1 Clients: 250 Read Only (TPS)              |      1915931.58 |   (R) -2.25% |         (I) 2.12% |
> |                                 | Scale: 1 Clients: 250 Read Only - Latency (ms)     |            0.13 |   (R) -2.37% |         (I) 2.09% |
> |                                 | Scale: 1 Clients: 250 Read Write (TPS)             |          855.67 |       -1.36% |            -0.14% |
> |                                 | Scale: 1 Clients: 250 Read Write - Latency (ms)    |          292.39 |       -1.31% |            -0.08% |
> |                                 | Scale: 1 Clients: 1000 Read Only (TPS)             |      1534130.08 |  (R) -11.37% |             0.08% |
> |                                 | Scale: 1 Clients: 1000 Read Only - Latency (ms)    |            0.65 |  (R) -11.38% |             0.08% |
> |                                 | Scale: 1 Clients: 1000 Read Write (TPS)            |          578.75 |       -1.11% |             2.15% |
> |                                 | Scale: 1 Clients: 1000 Read Write - Latency (ms)   |         1736.98 |       -1.26% |             2.47% |
> |                                 | Scale: 100 Clients: 1 Read Only (TPS)              |        57170.33 |        1.68% |             0.10% |
> |                                 | Scale: 100 Clients: 1 Read Only - Latency (ms)     |           0.018 |        1.94% |             0.00% |
> |                                 | Scale: 100 Clients: 1 Read Write (TPS)             |          836.58 |       -0.37% |            -0.41% |
> |                                 | Scale: 100 Clients: 1 Read Write - Latency (ms)    |            1.20 |       -0.37% |            -0.40% |
> |                                 | Scale: 100 Clients: 250 Read Only (TPS)            |      1773440.67 |       -1.61% |             1.67% |
> |                                 | Scale: 100 Clients: 250 Read Only - Latency (ms)   |            0.14 |       -1.40% |             1.56% |
> |                                 | Scale: 100 Clients: 250 Read Write (TPS)           |         5505.50 |       -0.17% |            -0.86% |
> |                                 | Scale: 100 Clients: 250 Read Write - Latency (ms)  |           45.42 |       -0.17% |            -0.85% |
> |                                 | Scale: 100 Clients: 1000 Read Only (TPS)           |      1393037.50 |  (R) -10.31% |            -0.19% |
> |                                 | Scale: 100 Clients: 1000 Read Only - Latency (ms)  |            0.72 |  (R) -10.30% |            -0.17% |
> |                                 | Scale: 100 Clients: 1000 Read Write (TPS)          |         5085.92 |        0.27% |             0.07% |
> |                                 | Scale: 100 Clients: 1000 Read Write - Latency (ms) |          196.79 |        0.23% |             0.05% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | mmtests/hackbench               | hackbench-process-pipes-1 (seconds)                |            0.14 |       -1.51% |            -1.05% |
> |                                 | hackbench-process-pipes-4 (seconds)                |            0.44 |    (I) 6.49% |         (I) 5.42% |
> |                                 | hackbench-process-pipes-7 (seconds)                |            0.68 |  (R) -18.36% |         (I) 3.40% |
> |                                 | hackbench-process-pipes-12 (seconds)               |            1.24 |  (R) -19.89% |            -0.45% |
> |                                 | hackbench-process-pipes-21 (seconds)               |            1.81 |   (R) -8.41% |            -1.22% |
> |                                 | hackbench-process-pipes-30 (seconds)               |            2.39 |   (R) -9.06% |        (R) -2.95% |
> |                                 | hackbench-process-pipes-48 (seconds)               |            3.18 |  (R) -11.68% |        (R) -4.10% |
> |                                 | hackbench-process-pipes-79 (seconds)               |            3.84 |   (R) -9.74% |        (R) -3.25% |
> |                                 | hackbench-process-pipes-110 (seconds)              |            4.68 |   (R) -6.57% |        (R) -2.12% |
> |                                 | hackbench-process-pipes-141 (seconds)              |            5.75 |   (R) -5.86% |        (R) -3.44% |
> |                                 | hackbench-process-pipes-172 (seconds)              |            6.80 |   (R) -4.28% |        (R) -2.81% |
> |                                 | hackbench-process-pipes-203 (seconds)              |            7.94 |   (R) -4.01% |        (R) -3.00% |
> |                                 | hackbench-process-pipes-234 (seconds)              |            9.02 |   (R) -3.52% |        (R) -2.81% |
> |                                 | hackbench-process-pipes-256 (seconds)              |            9.78 |   (R) -3.24% |        (R) -2.81% |
> |                                 | hackbench-process-sockets-1 (seconds)              |            0.29 |        0.50% |             0.26% |
> |                                 | hackbench-process-sockets-4 (seconds)              |            0.76 |   (I) 17.44% |        (I) 16.31% |
> |                                 | hackbench-process-sockets-7 (seconds)              |            1.16 |   (I) 12.10% |         (I) 9.78% |
> |                                 | hackbench-process-sockets-12 (seconds)             |            1.86 |   (I) 10.19% |         (I) 9.83% |
> |                                 | hackbench-process-sockets-21 (seconds)             |            3.12 |    (I) 9.38% |         (I) 9.20% |
> |                                 | hackbench-process-sockets-30 (seconds)             |            4.30 |    (I) 6.43% |         (I) 6.11% |
> |                                 | hackbench-process-sockets-48 (seconds)             |            6.58 |    (I) 3.00% |         (I) 2.19% |
> |                                 | hackbench-process-sockets-79 (seconds)             |           10.56 |    (I) 2.87% |         (I) 3.31% |
> |                                 | hackbench-process-sockets-110 (seconds)            |           13.85 |       -1.15% |         (I) 2.33% |
> |                                 | hackbench-process-sockets-141 (seconds)            |           19.23 |       -1.40% |        (I) 14.53% |
> |                                 | hackbench-process-sockets-172 (seconds)            |           26.33 |    (I) 3.52% |        (I) 30.37% |
> |                                 | hackbench-process-sockets-203 (seconds)            |           30.27 |        1.10% |        (I) 27.20% |
> |                                 | hackbench-process-sockets-234 (seconds)            |           35.12 |        1.60% |        (I) 28.24% |
> |                                 | hackbench-process-sockets-256 (seconds)            |           38.74 |        0.70% |        (I) 28.74% |
> |                                 | hackbench-thread-pipes-1 (seconds)                 |            0.17 |       -1.32% |            -0.76% |
> |                                 | hackbench-thread-pipes-4 (seconds)                 |            0.45 |    (I) 6.91% |         (I) 7.64% |
> |                                 | hackbench-thread-pipes-7 (seconds)                 |            0.74 |   (R) -7.51% |         (I) 5.26% |
> |                                 | hackbench-thread-pipes-12 (seconds)                |            1.32 |   (R) -8.40% |         (I) 2.32% |
> |                                 | hackbench-thread-pipes-21 (seconds)                |            1.95 |   (R) -2.95% |             0.91% |
> |                                 | hackbench-thread-pipes-30 (seconds)                |            2.50 |   (R) -4.61% |             1.47% |
> |                                 | hackbench-thread-pipes-48 (seconds)                |            3.32 |   (R) -5.45% |         (I) 2.15% |
> |                                 | hackbench-thread-pipes-79 (seconds)                |            4.04 |   (R) -5.53% |             1.85% |
> |                                 | hackbench-thread-pipes-110 (seconds)               |            4.94 |   (R) -2.33% |             1.51% |
> |                                 | hackbench-thread-pipes-141 (seconds)               |            6.04 |   (R) -2.47% |             1.15% |
> |                                 | hackbench-thread-pipes-172 (seconds)               |            7.15 |       -0.91% |             1.48% |
> |                                 | hackbench-thread-pipes-203 (seconds)               |            8.31 |       -1.29% |             0.77% |
> |                                 | hackbench-thread-pipes-234 (seconds)               |            9.49 |       -1.03% |             0.77% |
> |                                 | hackbench-thread-pipes-256 (seconds)               |           10.30 |       -0.80% |             0.42% |
> |                                 | hackbench-thread-sockets-1 (seconds)               |            0.31 |        0.05% |            -0.05% |
> |                                 | hackbench-thread-sockets-4 (seconds)               |            0.79 |   (I) 18.91% |        (I) 16.82% |
> |                                 | hackbench-thread-sockets-7 (seconds)               |            1.16 |   (I) 12.57% |        (I) 10.63% |
> |                                 | hackbench-thread-sockets-12 (seconds)              |            1.87 |   (I) 12.65% |        (I) 12.26% |
> |                                 | hackbench-thread-sockets-21 (seconds)              |            3.16 |   (I) 11.62% |        (I) 12.74% |
> |                                 | hackbench-thread-sockets-30 (seconds)              |            4.32 |    (I) 7.35% |         (I) 8.89% |
> |                                 | hackbench-thread-sockets-48 (seconds)              |            6.45 |    (I) 2.69% |         (I) 3.06% |
> |                                 | hackbench-thread-sockets-79 (seconds)              |           10.15 |    (I) 3.30% |             1.98% |
> |                                 | hackbench-thread-sockets-110 (seconds)             |           13.45 |       -0.25% |         (I) 3.68% |
> |                                 | hackbench-thread-sockets-141 (seconds)             |           17.87 |   (R) -2.18% |         (I) 8.46% |
> |                                 | hackbench-thread-sockets-172 (seconds)             |           24.38 |        1.02% |        (I) 24.33% |
> |                                 | hackbench-thread-sockets-203 (seconds)             |           28.38 |       -0.99% |        (I) 24.20% |
> |                                 | hackbench-thread-sockets-234 (seconds)             |           32.75 |       -0.42% |        (I) 24.35% |
> |                                 | hackbench-thread-sockets-256 (seconds)             |           36.49 |       -1.30% |        (I) 26.22% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | pts/nginx                       | Connections: 200 (Requests Per Second)             |       252332.60 |   (I) 17.54% |            -0.53% |
> |                                 | Connections: 1000 (Requests Per Second)            |       248591.29 |   (I) 20.41% |             0.10% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> 
> All of the benchmarks have been run multiple times and I have high confidence in 
> the results. I can share min/mean/max/stdev/ci95 stats if that's helpful though.
> 
> I'm not providing the data, but we also see similar regressions on AmpereOne 
> (another arm64 server system). And we have seen a few functional tests (kvm 
> selftests) that have started to timeout due to this patch slowing things down on 
> arm64.
> 
> I'm hoping you can advise on the best way to proceed? We have a bigger library 
> than what I'm showing, but the only improvement I see due to this patch is 
> nginx. So based on that, my preference would be to revert the patch upstream 
> until the issues can be worked out. I'm guessing the story is quite different 
> for x86 though?
> 
> Thanks,
> Ryan
> 
> 
> 
> On 17/11/2025 16:23, tip-bot2 for Mel Gorman wrote:
>> The following commit has been merged into the sched/core branch of tip:
>>
>> Commit-ID:     e837456fdca81899a3c8e47b3fd39e30eae6e291
>> Gitweb:        https://git.kernel.org/tip/e837456fdca81899a3c8e47b3fd39e30eae6e291
>> Author:        Mel Gorman <mgorman@...hsingularity.net>
>> AuthorDate:    Wed, 12 Nov 2025 12:25:21 
>> Committer:     Peter Zijlstra <peterz@...radead.org>
>> CommitterDate: Mon, 17 Nov 2025 17:13:15 +01:00
>>
>> sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
>>
>> Reimplement NEXT_BUDDY preemption to take into account the deadline and
>> eligibility of the wakee with respect to the waker. In the event
>> multiple buddies could be considered, the one with the earliest deadline
>> is selected.
>>
>> Sync wakeups are treated differently to every other type of wakeup. The
>> WF_SYNC assumption is that the waker promises to sleep in the very near
>> future. This is violated in enough cases that WF_SYNC should be treated
>> as a suggestion instead of a contract. If a waker does go to sleep almost
>> immediately then the delay in wakeup is negligible. In other cases, it's
>> throttled based on the accumulated runtime of the waker so there is a
>> chance that some batched wakeups have been issued before preemption.
>>
>> For all other wakeups, preemption happens if the wakee has a earlier
>> deadline than the waker and eligible to run.
>>
>> While many workloads were tested, the two main targets were a modified
>> dbench4 benchmark and hackbench because the are on opposite ends of the
>> spectrum -- one prefers throughput by avoiding preemption and the other
>> relies on preemption.
>>
>> First is the dbench throughput data even though it is a poor metric but
>> it is the default metric. The test machine is a 2-socket machine and the
>> backing filesystem is XFS as a lot of the IO work is dispatched to kernel
>> threads. It's important to note that these results are not representative
>> across all machines, especially Zen machines, as different bottlenecks
>> are exposed on different machines and filesystems.
>>
>> dbench4 Throughput (misleading but traditional)
>>                             6.18-rc1               6.18-rc1
>>                              vanilla   sched-preemptnext-v5
>> Hmean     1       1268.80 (   0.00%)     1269.74 (   0.07%)
>> Hmean     4       3971.74 (   0.00%)     3950.59 (  -0.53%)
>> Hmean     7       5548.23 (   0.00%)     5420.08 (  -2.31%)
>> Hmean     12      7310.86 (   0.00%)     7165.57 (  -1.99%)
>> Hmean     21      8874.53 (   0.00%)     9149.04 (   3.09%)
>> Hmean     30      9361.93 (   0.00%)    10530.04 (  12.48%)
>> Hmean     48      9540.14 (   0.00%)    11820.40 (  23.90%)
>> Hmean     79      9208.74 (   0.00%)    12193.79 (  32.42%)
>> Hmean     110     8573.12 (   0.00%)    11933.72 (  39.20%)
>> Hmean     141     7791.33 (   0.00%)    11273.90 (  44.70%)
>> Hmean     160     7666.60 (   0.00%)    10768.72 (  40.46%)
>>
>> As throughput is misleading, the benchmark is modified to use a short
>> loadfile report the completion time duration in milliseconds.
>>
>> dbench4 Loadfile Execution Time
>>                              6.18-rc1               6.18-rc1
>>                               vanilla   sched-preemptnext-v5
>> Amean      1         14.62 (   0.00%)       14.69 (  -0.46%)
>> Amean      4         18.76 (   0.00%)       18.85 (  -0.45%)
>> Amean      7         23.71 (   0.00%)       24.38 (  -2.82%)
>> Amean      12        31.25 (   0.00%)       31.87 (  -1.97%)
>> Amean      21        45.12 (   0.00%)       43.69 (   3.16%)
>> Amean      30        61.07 (   0.00%)       54.33 (  11.03%)
>> Amean      48        95.91 (   0.00%)       77.22 (  19.49%)
>> Amean      79       163.38 (   0.00%)      123.08 (  24.66%)
>> Amean      110      243.91 (   0.00%)      175.11 (  28.21%)
>> Amean      141      343.47 (   0.00%)      239.10 (  30.39%)
>> Amean      160      401.15 (   0.00%)      283.73 (  29.27%)
>> Stddev     1          0.52 (   0.00%)        0.51 (   2.45%)
>> Stddev     4          1.36 (   0.00%)        1.30 (   4.04%)
>> Stddev     7          1.88 (   0.00%)        1.87 (   0.72%)
>> Stddev     12         3.06 (   0.00%)        2.45 (  19.83%)
>> Stddev     21         5.78 (   0.00%)        3.87 (  33.06%)
>> Stddev     30         9.85 (   0.00%)        5.25 (  46.76%)
>> Stddev     48        22.31 (   0.00%)        8.64 (  61.27%)
>> Stddev     79        35.96 (   0.00%)       18.07 (  49.76%)
>> Stddev     110       59.04 (   0.00%)       30.93 (  47.61%)
>> Stddev     141       85.38 (   0.00%)       40.93 (  52.06%)
>> Stddev     160       96.38 (   0.00%)       39.72 (  58.79%)
>>
>> That is still looking good and the variance is reduced quite a bit.
>> Finally, fairness is a concern so the next report tracks how many
>> milliseconds does it take for all clients to complete a workfile. This
>> one is tricky because dbench makes to effort to synchronise clients so
>> the durations at benchmark start time differ substantially from typical
>> runtimes. This problem could be mitigated by warming up the benchmark
>> for a number of minutes but it's a matter of opinion whether that
>> counts as an evasion of inconvenient results.
>>
>> dbench4 All Clients Loadfile Execution Time
>>                              6.18-rc1               6.18-rc1
>>                               vanilla   sched-preemptnext-v5
>> Amean      1         15.06 (   0.00%)       15.07 (  -0.03%)
>> Amean      4        603.81 (   0.00%)      524.29 (  13.17%)
>> Amean      7        855.32 (   0.00%)     1331.07 ( -55.62%)
>> Amean      12      1890.02 (   0.00%)     2323.97 ( -22.96%)
>> Amean      21      3195.23 (   0.00%)     2009.29 (  37.12%)
>> Amean      30     13919.53 (   0.00%)     4579.44 (  67.10%)
>> Amean      48     25246.07 (   0.00%)     5705.46 (  77.40%)
>> Amean      79     29701.84 (   0.00%)    15509.26 (  47.78%)
>> Amean      110    22803.03 (   0.00%)    23782.08 (  -4.29%)
>> Amean      141    36356.07 (   0.00%)    25074.20 (  31.03%)
>> Amean      160    17046.71 (   0.00%)    13247.62 (  22.29%)
>> Stddev     1          0.47 (   0.00%)        0.49 (  -3.74%)
>> Stddev     4        395.24 (   0.00%)      254.18 (  35.69%)
>> Stddev     7        467.24 (   0.00%)      764.42 ( -63.60%)
>> Stddev     12      1071.43 (   0.00%)     1395.90 ( -30.28%)
>> Stddev     21      1694.50 (   0.00%)     1204.89 (  28.89%)
>> Stddev     30      7945.63 (   0.00%)     2552.59 (  67.87%)
>> Stddev     48     14339.51 (   0.00%)     3227.55 (  77.49%)
>> Stddev     79     16620.91 (   0.00%)     8422.15 (  49.33%)
>> Stddev     110    12912.15 (   0.00%)    13560.95 (  -5.02%)
>> Stddev     141    20700.13 (   0.00%)    14544.51 (  29.74%)
>> Stddev     160     9079.16 (   0.00%)     7400.69 (  18.49%)
>>
>> This is more of a mixed bag but it at least shows that fairness
>> is not crippled.
>>
>> The hackbench results are more neutral but this is still important.
>> It's possible to boost the dbench figures by a large amount but only by
>> crippling the performance of a workload like hackbench. The WF_SYNC
>> behaviour is important for these workloads and is why the WF_SYNC
>> changes are not a separate patch.
>>
>> hackbench-process-pipes
>>                           6.18-rc1             6.18-rc1
>>                              vanilla   sched-preemptnext-v5
>> Amean     1        0.2657 (   0.00%)      0.2150 (  19.07%)
>> Amean     4        0.6107 (   0.00%)      0.6060 (   0.76%)
>> Amean     7        0.7923 (   0.00%)      0.7440 (   6.10%)
>> Amean     12       1.1500 (   0.00%)      1.1263 (   2.06%)
>> Amean     21       1.7950 (   0.00%)      1.7987 (  -0.20%)
>> Amean     30       2.3207 (   0.00%)      2.5053 (  -7.96%)
>> Amean     48       3.5023 (   0.00%)      3.9197 ( -11.92%)
>> Amean     79       4.8093 (   0.00%)      5.2247 (  -8.64%)
>> Amean     110      6.1160 (   0.00%)      6.6650 (  -8.98%)
>> Amean     141      7.4763 (   0.00%)      7.8973 (  -5.63%)
>> Amean     172      8.9560 (   0.00%)      9.3593 (  -4.50%)
>> Amean     203     10.4783 (   0.00%)     10.8347 (  -3.40%)
>> Amean     234     12.4977 (   0.00%)     13.0177 (  -4.16%)
>> Amean     265     14.7003 (   0.00%)     15.5630 (  -5.87%)
>> Amean     296     16.1007 (   0.00%)     17.4023 (  -8.08%)
>>
>> Processes using pipes are impacted but the variance (not presented) indicates
>> it's close to noise and the results are not always reproducible. If executed
>> across multiple reboots, it may show neutral or small gains so the worst
>> measured results are presented.
>>
>> Hackbench using sockets is more reliably neutral as the wakeup
>> mechanisms are different between sockets and pipes.
>>
>> hackbench-process-sockets
>>                           6.18-rc1             6.18-rc1
>>                              vanilla   sched-preemptnext-v2
>> Amean     1        0.3073 (   0.00%)      0.3263 (  -6.18%)
>> Amean     4        0.7863 (   0.00%)      0.7930 (  -0.85%)
>> Amean     7        1.3670 (   0.00%)      1.3537 (   0.98%)
>> Amean     12       2.1337 (   0.00%)      2.1903 (  -2.66%)
>> Amean     21       3.4683 (   0.00%)      3.4940 (  -0.74%)
>> Amean     30       4.7247 (   0.00%)      4.8853 (  -3.40%)
>> Amean     48       7.6097 (   0.00%)      7.8197 (  -2.76%)
>> Amean     79      14.7957 (   0.00%)     16.1000 (  -8.82%)
>> Amean     110     21.3413 (   0.00%)     21.9997 (  -3.08%)
>> Amean     141     29.0503 (   0.00%)     29.0353 (   0.05%)
>> Amean     172     36.4660 (   0.00%)     36.1433 (   0.88%)
>> Amean     203     39.7177 (   0.00%)     40.5910 (  -2.20%)
>> Amean     234     42.1120 (   0.00%)     43.5527 (  -3.42%)
>> Amean     265     45.7830 (   0.00%)     50.0560 (  -9.33%)
>> Amean     296     50.7043 (   0.00%)     54.3657 (  -7.22%)
>>
>> As schbench has been mentioned in numerous bugs recently, the results
>> are interesting. A test case that represents the default schbench
>> behaviour is
>>
>> schbench Wakeup Latency (usec)
>>                                        6.18.0-rc1             6.18.0-rc1
>>                                           vanilla   sched-preemptnext-v5
>> Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 (  16.28%)
>> Amean     Wakeup-90th-80         46.56 (   0.00%)       19.78 (  57.52%)
>> Amean     Wakeup-99th-80        119.61 (   0.00%)       89.94 (  24.80%)
>> Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      328.22 (  89.72%)
>>
>> schbench Requests Per Second (ops/sec)
>>                                   6.18.0-rc1             6.18.0-rc1
>>                                      vanilla   sched-preemptnext-v5
>> Hmean     RPS-20th-80     8900.91 (   0.00%)     9176.78 (   3.10%)
>> Hmean     RPS-50th-80     8987.41 (   0.00%)     9217.89 (   2.56%)
>> Hmean     RPS-90th-80     9123.73 (   0.00%)     9273.25 (   1.64%)
>> Hmean     RPS-max-80      9193.50 (   0.00%)     9301.47 (   1.17%)
>>
>> Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
>> Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net
>> ---
>>  kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++++-------
>>  1 file changed, 130 insertions(+), 22 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 071e07f..c6e5c64 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -929,6 +929,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
>>  	if (cfs_rq->nr_queued == 1)
>>  		return curr && curr->on_rq ? curr : se;
>>  
>> +	/*
>> +	 * Picking the ->next buddy will affect latency but not fairness.
>> +	 */
>> +	if (sched_feat(PICK_BUDDY) &&
>> +	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
>> +		/* ->next will never be delayed */
>> +		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
>> +		return cfs_rq->next;
>> +	}
>> +
>>  	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>>  		curr = NULL;
>>  
>> @@ -1167,6 +1177,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>>  	return delta_exec;
>>  }
>>  
>> +static void set_next_buddy(struct sched_entity *se);
>> +
>>  /*
>>   * Used by other classes to account runtime.
>>   */
>> @@ -5466,16 +5478,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
>>  {
>>  	struct sched_entity *se;
>>  
>> -	/*
>> -	 * Picking the ->next buddy will affect latency but not fairness.
>> -	 */
>> -	if (sched_feat(PICK_BUDDY) &&
>> -	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
>> -		/* ->next will never be delayed */
>> -		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
>> -		return cfs_rq->next;
>> -	}
>> -
>>  	se = pick_eevdf(cfs_rq);
>>  	if (se->sched_delayed) {
>>  		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
>> @@ -6988,8 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>  	hrtick_update(rq);
>>  }
>>  
>> -static void set_next_buddy(struct sched_entity *se);
>> -
>>  /*
>>   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
>>   * failing half-way through and resume the dequeue later.
>> @@ -8676,16 +8676,81 @@ static void set_next_buddy(struct sched_entity *se)
>>  	}
>>  }
>>  
>> +enum preempt_wakeup_action {
>> +	PREEMPT_WAKEUP_NONE,	/* No preemption. */
>> +	PREEMPT_WAKEUP_SHORT,	/* Ignore slice protection. */
>> +	PREEMPT_WAKEUP_PICK,	/* Let __pick_eevdf() decide. */
>> +	PREEMPT_WAKEUP_RESCHED,	/* Force reschedule. */
>> +};
>> +
>> +static inline bool
>> +set_preempt_buddy(struct cfs_rq *cfs_rq, int wake_flags,
>> +		  struct sched_entity *pse, struct sched_entity *se)
>> +{
>> +	/*
>> +	 * Keep existing buddy if the deadline is sooner than pse.
>> +	 * The older buddy may be cache cold and completely unrelated
>> +	 * to the current wakeup but that is unpredictable where as
>> +	 * obeying the deadline is more in line with EEVDF objectives.
>> +	 */
>> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
>> +		return false;
>> +
>> +	set_next_buddy(pse);
>> +	return true;
>> +}
>> +
>> +/*
>> + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
>> + * strictly enforced because the hint is either misunderstood or
>> + * multiple tasks must be woken up.
>> + */
>> +static inline enum preempt_wakeup_action
>> +preempt_sync(struct rq *rq, int wake_flags,
>> +	     struct sched_entity *pse, struct sched_entity *se)
>> +{
>> +	u64 threshold, delta;
>> +
>> +	/*
>> +	 * WF_SYNC without WF_TTWU is not expected so warn if it happens even
>> +	 * though it is likely harmless.
>> +	 */
>> +	WARN_ON_ONCE(!(wake_flags & WF_TTWU));
>> +
>> +	threshold = sysctl_sched_migration_cost;
>> +	delta = rq_clock_task(rq) - se->exec_start;
>> +	if ((s64)delta < 0)
>> +		delta = 0;
>> +
>> +	/*
>> +	 * WF_RQ_SELECTED implies the tasks are stacking on a CPU when they
>> +	 * could run on other CPUs. Reduce the threshold before preemption is
>> +	 * allowed to an arbitrary lower value as it is more likely (but not
>> +	 * guaranteed) the waker requires the wakee to finish.
>> +	 */
>> +	if (wake_flags & WF_RQ_SELECTED)
>> +		threshold >>= 2;
>> +
>> +	/*
>> +	 * As WF_SYNC is not strictly obeyed, allow some runtime for batch
>> +	 * wakeups to be issued.
>> +	 */
>> +	if (entity_before(pse, se) && delta >= threshold)
>> +		return PREEMPT_WAKEUP_RESCHED;
>> +
>> +	return PREEMPT_WAKEUP_NONE;
>> +}
>> +
>>  /*
>>   * Preempt the current task with a newly woken task if needed:
>>   */
>>  static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
>>  {
>> +	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
>>  	struct task_struct *donor = rq->donor;
>>  	struct sched_entity *se = &donor->se, *pse = &p->se;
>>  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>>  	int cse_is_idle, pse_is_idle;
>> -	bool do_preempt_short = false;
>>  
>>  	if (unlikely(se == pse))
>>  		return;
>> @@ -8699,10 +8764,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>>  	if (task_is_throttled(p))
>>  		return;
>>  
>> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
>> -		set_next_buddy(pse);
>> -	}
>> -
>>  	/*
>>  	 * We can come here with TIF_NEED_RESCHED already set from new task
>>  	 * wake up path.
>> @@ -8734,7 +8795,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>>  		 * When non-idle entity preempt an idle entity,
>>  		 * don't give idle entity slice protection.
>>  		 */
>> -		do_preempt_short = true;
>> +		preempt_action = PREEMPT_WAKEUP_SHORT;
>>  		goto preempt;
>>  	}
>>  
>> @@ -8753,21 +8814,68 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>>  	 * If @p has a shorter slice than current and @p is eligible, override
>>  	 * current's slice protection in order to allow preemption.
>>  	 */
>> -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
>> +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
>> +		preempt_action = PREEMPT_WAKEUP_SHORT;
>> +		goto pick;
>> +	}
>>  
>>  	/*
>> +	 * Ignore wakee preemption on WF_FORK as it is less likely that
>> +	 * there is shared data as exec often follow fork. Do not
>> +	 * preempt for tasks that are sched_delayed as it would violate
>> +	 * EEVDF to forcibly queue an ineligible task.
>> +	 */
>> +	if ((wake_flags & WF_FORK) || pse->sched_delayed)
>> +		return;
>> +
>> +	/*
>> +	 * If @p potentially is completing work required by current then
>> +	 * consider preemption.
>> +	 *
>> +	 * Reschedule if waker is no longer eligible. */
>> +	if (in_task() && !entity_eligible(cfs_rq, se)) {
>> +		preempt_action = PREEMPT_WAKEUP_RESCHED;
>> +		goto preempt;
>> +	}
>> +
>> +	/* Prefer picking wakee soon if appropriate. */
>> +	if (sched_feat(NEXT_BUDDY) &&
>> +	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
>> +
>> +		/*
>> +		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
>> +		 * buddies are ignored as they may not be relevant to the
>> +		 * waker and less likely to be cache hot.
>> +		 */
>> +		if (wake_flags & WF_SYNC)
>> +			preempt_action = preempt_sync(rq, wake_flags, pse, se);
>> +	}
>> +
>> +	switch (preempt_action) {
>> +	case PREEMPT_WAKEUP_NONE:
>> +		return;
>> +	case PREEMPT_WAKEUP_RESCHED:
>> +		goto preempt;
>> +	case PREEMPT_WAKEUP_SHORT:
>> +		fallthrough;
>> +	case PREEMPT_WAKEUP_PICK:
>> +		break;
>> +	}
>> +
>> +pick:
>> +	/*
>>  	 * If @p has become the most eligible task, force preemption.
>>  	 */
>> -	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
>> +	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
>>  		goto preempt;
>>  
>> -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
>> +	if (sched_feat(RUN_TO_PARITY))
>>  		update_protect_slice(cfs_rq, se);
>>  
>>  	return;
>>  
>>  preempt:
>> -	if (do_preempt_short)
>> +	if (preempt_action == PREEMPT_WAKEUP_SHORT)
>>  		cancel_protect_slice(se);
>>  
>>  	resched_curr_lazy(rq);
>>
>