linux-kernel - Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8368868e-48aa-4a90-95d1-1be4de9879e8@linux.ibm.com>
Date: Fri, 5 Dec 2025 11:00:18 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Ilya Leoshkevich <iii@...ux.ibm.com>, linux-kernel@...r.kernel.org,
        linuxppc-dev@...ts.ozlabs.org
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, tglx@...utronix.de, yury.norov@...il.com,
        maddy@...ux.ibm.com, srikar@...ux.ibm.com, gregkh@...uxfoundation.org,
        pbonzini@...hat.com, seanjc@...gle.com, kprateek.nayak@....com,
        vschneid@...hat.com, huschle@...ux.ibm.com, rostedt@...dmis.org,
        dietmar.eggemann@....com, christophe.leroy@...roup.eu,
        linux-s390@...r.kernel.org
Subject: Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU
 preemption



On 12/4/25 6:58 PM, Ilya Leoshkevich wrote:
> On Wed, 2025-11-19 at 18:14 +0530, Shrikanth Hegde wrote:
>> Detailed problem statement and some of the implementation choices
>> were
>> discussed earlier[1].
>>
>> [1]:
>> https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
>>
>> This is likely the version which would be used for LPC2025 discussion
>> on
>> this topic. Feel free to provide your suggestion and hoping for a
>> solution
>> that works for different architectures and it's use cases.
>>
>> All the existing alternatives such as cpu hotplug, creating isolated
>> partitions etc break the user affinity. Since number of CPUs to use
>> change
>> depending on the steal time, it is not driven by User. Hence it would
>> be
>> wrong to break the affinity. This series allows if the task is pinned
>> only paravirt CPUs, it will continue running there.
>>
>> Changes compared v3[1]:
>>
>> - Introduced computation of steal time in powerpc code.
>> - Derive number of CPUs to use and mark the remaining as paravirt
>> based
>>    on steal values.
>> - Provide debugfs knobs to alter how steal time values being used.
>> - Removed static key check for paravirt CPUs (Yury)
>> - Removed preempt_disable/enable while calling stopper (Prateek)
>> - Made select_idle_sibling and friends aware of paravirt CPUs.
>> - Removed 3 unused schedstat fields and introduced 2 related to
>> paravirt
>>    handling.
>> - Handled nohz_full case by enabling tick on it when there is CFS/RT
>> on
>>    it.
>> - Updated helper patch to override arch behaviour for easier
>> debugging
>>    during development.
>> - Kept
>>
>> Changes compared to v4[2]:
>> - Last two patches were sent out separate instead of being with
>> series.
>>    That created confusion. Those two patches are debug patches one can
>>    make use to check functionality across acrhitectures. Sorry about
>>    that.
>> - Use DEVICE_ATTR_RW instead (greg)
>> - Made it as PATCH since arch specific handling completes the
>>    functionality.
>>
>> [2]:
>> https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/
>>
>> TODO:
>>
>> - Get performance numbers on PowerPC, x86 and S390. Hopefully by next
>>    week. Didn't want to hold the series till then.
>>
>> - The CPUs to mark as paravirt is very simple and doesn't work when
>>    vCPUs aren't spread out uniformly across NUMA nodes. Ideal would be
>> splice
>>    the numbers based on how many CPUs each NUMA node has. It is quite
>>    tricky to do specially since cpumask can be on stack too. Given
>>    NR_CPUS can be 8192 and nr_possible_nodes 32. Haven't got my head
>> into
>>    solving it yet. Maybe there is easier way.
>>
>> - DLPAR Add/Remove needs to call init of EC/VP cores (powerpc
>> specific)
>>
>> - Userspace tools awareness such as irqbalance.
>>
>> - Delve into design of hint from Hyeprvisor(HW Hint). i.e Host
>> informs
>>    guest which/how many CPUs it has to use at this moment. This
>> interface
>>    should work across archs with each arch doing its specific
>> handling.
>>
>> - Determine the default values for steal time related knobs
>>    empirically and document them.
>>
>> - Need to check safety against CPU hotplug specially in
>> process_steal.
>>
>>
>> Applies cleanly on tip/master:
>> commit c2ef745151b21d4dcc4b29a1eabf1096f5ba544b
>>
>>
>> Thanks to srikar for providing the initial code around powerpc steal
>> time handling code. Thanks to all who went through and provided
>> reviews.
>>
>> PS: I haven't found a better name. Please suggest if you have any.
>>
>> Shrikanth Hegde (17):
>>    sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
>>    cpumask: Introduce cpu_paravirt_mask
>>    sched/core: Dont allow to use CPU marked as paravirt
>>    sched/debug: Remove unused schedstats
>>    sched/fair: Add paravirt movements for proc sched file
>>    sched/fair: Pass current cpu in select_idle_sibling
>>    sched/fair: Don't consider paravirt CPUs for wakeup and load
>> balance
>>    sched/rt: Don't select paravirt CPU for wakeup and push/pull rt
>> task
>>    sched/core: Add support for nohz_full CPUs
>>    sched/core: Push current task from paravirt CPU
>>    sysfs: Add paravirt CPU file
>>    powerpc: method to initialize ec and vp cores
>>    powerpc: enable/disable paravirt CPUs based on steal time
>>    powerpc: process steal values at fixed intervals
>>    powerpc: add debugfs file for controlling handling on steal values
>>    sysfs: Provide write method for paravirt
>>    sysfs: disable arch handling if paravirt file being written
>>
>>   .../ABI/testing/sysfs-devices-system-cpu      |   9 +
>>   Documentation/scheduler/sched-arch.rst        |  37 +++
>>   arch/powerpc/include/asm/smp.h                |   1 +
>>   arch/powerpc/kernel/smp.c                     |   1 +
>>   arch/powerpc/platforms/pseries/lpar.c         | 223
>> ++++++++++++++++++
>>   arch/powerpc/platforms/pseries/pseries.h      |   1 +
>>   drivers/base/cpu.c                            |  59 +++++
>>   include/linux/cpumask.h                       |  20 ++
>>   include/linux/sched.h                         |   9 +-
>>   kernel/sched/core.c                           | 106 ++++++++-
>>   kernel/sched/debug.c                          |   5 +-
>>   kernel/sched/fair.c                           |  42 +++-
>>   kernel/sched/rt.c                             |  11 +-
>>   kernel/sched/sched.h                          |   9 +
>>   14 files changed, 519 insertions(+), 14 deletions(-)
> 
> The capability to temporarily exclude CPUs from scheduling might be
> beneficial for s390x, where users often run Linux using a proprietary
> hypervisor called PR/SM and with high overcommit. In these
> circumstances virtual CPUs may not be scheduled by a hypervisor for a
> very long time.
> 
> Today we have an upstream feature called "Hiperdispatch", which
> determines that this is about to happen and uses Capacity Aware
> Scheduling to prevent processes from being placed on the affected CPUs.
> However, at least when used for this purpose, Capacity Aware Scheduling
> is best effort and fails to move tasks away from the affected CPUs
> under high load.
> 
> Therefore I have decided to smoke test this series.
> 
> For the purposes of smoke testing, I set up a number of KVM virtual
> machines and start the same benchmark inside each one. Then I collect
> and compare the aggregate throughput numbers. I have not done testing
> with PR/SM yet, but I plan to do this and report back. I also have not
> tested this with VMs that are not 100% utilized yet.
> 

Best results would be when it works as HW hint from hypervisor.

> Benchmark parameters:
> 
> $ sysbench cpu run --threads=$(nproc) --time=10
> $ schbench -r 10 --json --no-locking
> $ hackbench --groups 10 --process --loops 5000
> $ pgbench -h $WORKDIR --client=$(nproc) --time=10
> 
> Figures:
> 
> s390x (16 host CPUs):
> 
> Benchmark      #VMs    #CPUs/VM  ΔRPS (%)
> -----------  ------  ----------  ----------
> hackbench        16           4  60.58%
> pgbench          16           4  50.01%
> hackbench         8           8  46.18%
> hackbench         4           8  43.54%
> hackbench         2          16  43.23%
> hackbench        12           4  42.92%
> hackbench         8           4  35.53%
> hackbench         4          16  30.98%
> pgbench          12           4  18.41%
> hackbench         2          24  7.32%
> pgbench           8           4  6.84%
> pgbench           2          24  3.38%
> pgbench           2          16  3.02%
> pgbench           4          16  2.08%
> hackbench         2          32  1.46%
> pgbench           4           8  1.30%
> schbench          2          16  0.72%
> schbench          4           8  -0.09%
> schbench          4           4  -0.20%
> schbench          8           8  -0.41%
> sysbench          8           4  -0.46%
> sysbench          4           8  -0.53%
> schbench          8           4  -0.65%
> sysbench          2          16  -0.76%
> schbench          2           8  -0.77%
> sysbench          8           8  -1.72%
> schbench          2          24  -1.98%
> schbench         12           4  -2.03%
> sysbench         12           4  -2.13%
> pgbench           2          32  -3.15%
> sysbench         16           4  -3.17%
> schbench         16           4  -3.50%
> sysbench          2           8  -4.01%
> pgbench           8           8  -4.10%
> schbench          4          16  -5.93%
> sysbench          4           4  -5.94%
> pgbench           2           4  -6.40%
> hackbench         2           8  -10.04%
> hackbench         4           4  -10.91%
> pgbench           4           4  -11.05%
> sysbench          2          24  -13.07%
> sysbench          4          16  -13.59%
> hackbench         2           4  -13.96%
> pgbench           2           8  -16.16%
> schbench          2           4  -24.14%
> schbench          2          32  -24.25%
> sysbench          2           4  -24.98%
> sysbench          2          32  -32.84%
> 
> x86_64 (32 host CPUs):
> 
> Benchmark      #VMs    #CPUs/VM  ΔRPS (%)
> -----------  ------  ----------  ----------
> hackbench         4          32  87.02%
> hackbench         8          16  48.45%
> hackbench         4          24  47.95%
> hackbench         2           8  42.74%
> hackbench         2          32  34.90%
> pgbench          16           8  27.87%
> pgbench          12           8  25.17%
> hackbench         8           8  24.92%
> hackbench        16           8  22.41%
> hackbench        16           4  20.83%
> pgbench           8          16  20.40%
> hackbench        12           8  20.37%
> hackbench         4          16  20.36%
> pgbench          16           4  16.60%
> pgbench           8           8  14.92%
> hackbench        12           4  14.49%
> pgbench           4          32  9.49%
> pgbench           2          32  7.26%
> hackbench         2          24  6.54%
> pgbench           4           4  4.67%
> pgbench           8           4  3.24%
> pgbench          12           4  2.66%
> hackbench         4           8  2.53%
> pgbench           4           8  1.96%
> hackbench         2          16  1.93%
> schbench          4          32  1.24%
> pgbench           2           8  0.82%
> schbench          4           4  0.69%
> schbench          2          32  0.44%
> schbench          2          16  0.25%
> schbench         12           8  -0.02%
> sysbench          2           4  -0.02%
> schbench          4          24  -0.12%
> sysbench          2          16  -0.17%
> schbench         12           4  -0.18%
> schbench          2           4  -0.19%
> sysbench          4           8  -0.23%
> schbench          8           4  -0.24%
> sysbench          2           8  -0.24%
> schbench          4           8  -0.28%
> sysbench          8           4  -0.30%
> schbench          4          16  -0.37%
> schbench          2          24  -0.39%
> schbench          8          16  -0.49%
> schbench          2           8  -0.67%
> pgbench           4          16  -0.68%
> schbench          8           8  -0.83%
> sysbench          4           4  -0.92%
> schbench         16           4  -0.94%
> sysbench         12           4  -0.98%
> sysbench          8          16  -1.52%
> sysbench         16           4  -1.57%
> pgbench           2           4  -1.62%
> sysbench         12           8  -1.69%
> schbench         16           8  -1.97%
> sysbench          8           8  -2.08%
> hackbench         8           4  -2.11%
> pgbench           4          24  -3.20%
> pgbench           2          24  -3.35%
> sysbench          2          24  -3.81%
> pgbench           2          16  -4.55%
> sysbench          4          16  -5.10%
> sysbench         16           8  -6.56%
> sysbench          2          32  -8.24%
> sysbench          4          32  -13.54%
> sysbench          4          24  -13.62%
> hackbench         2           4  -15.40%
> hackbench         4           4  -17.71%
> 
> There are some huge wins, especially for hackbench, which corresponds
> to Shrikanth's findings. There are some significant degradations too,
> which I plan to debug. This may simply have to do with the simplistic
> heuristic I am using for testing [1].
> 

Thank you very much!! for running these numbers.

> sysbench, for example, is not supposed to benefit from this series,
> because it is not affected by overcommit. However, it definitely should
> not degrade by 30%. Interestingly enough, this happens only with
> certain combinations of VM and CPU counts, and this is reproducible.
> 

is the host baremetal? is those cases cpufreq governer ramp up or down
might play a role. (speculating)

> Initially I have seen degradations as bad as -80% with schbench. It
> turned out this was caused by userspace per-CPU locking it implements;
> turning it off caused the degradation to go away. To me this looks like
> something synthetic and not something used by real-world application,
> but please correct me if I am wrong - then this will have to be
> resolved.
> 

That's nice to hear. I was concerned with schbench rps. Now i am bit relieved.


Is this with schbench -L option?
I ran with it. and regression i was seeing earlier is gone now.

> 
> One note regarding the PARAVIRT Kconfig gating: s390x does not
> select PARAVIRT	today. For example, steal time we determine based on
> CPU timers and clocks, and not hypervisor hints. For now I had to add
> dummy paravirt headers to test this series. But I would appreciate if
> Kconfig gating was removed.
> 

Keeping PARAVIRT checks on is probably right thing. I will wait to see if
anyone objects.

> Others have already commented on the naming, and I would agree that
> "paravirt" is really misleading. I cannot say that the previous "cpu-
> avoid" one was perfect, but it was much better.
> 
> 
> [1] https://github.com/iii-i/linux/commits/iii/poc/cpu-avoid/v3/

Will look into it. one thing to to be careful are CPU numbers.