[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8368868e-48aa-4a90-95d1-1be4de9879e8@linux.ibm.com>
Date: Fri, 5 Dec 2025 11:00:18 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Ilya Leoshkevich <iii@...ux.ibm.com>, linux-kernel@...r.kernel.org,
linuxppc-dev@...ts.ozlabs.org
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
vincent.guittot@...aro.org, tglx@...utronix.de, yury.norov@...il.com,
maddy@...ux.ibm.com, srikar@...ux.ibm.com, gregkh@...uxfoundation.org,
pbonzini@...hat.com, seanjc@...gle.com, kprateek.nayak@....com,
vschneid@...hat.com, huschle@...ux.ibm.com, rostedt@...dmis.org,
dietmar.eggemann@....com, christophe.leroy@...roup.eu,
linux-s390@...r.kernel.org
Subject: Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU
preemption
On 12/4/25 6:58 PM, Ilya Leoshkevich wrote:
> On Wed, 2025-11-19 at 18:14 +0530, Shrikanth Hegde wrote:
>> Detailed problem statement and some of the implementation choices
>> were
>> discussed earlier[1].
>>
>> [1]:
>> https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
>>
>> This is likely the version which would be used for LPC2025 discussion
>> on
>> this topic. Feel free to provide your suggestion and hoping for a
>> solution
>> that works for different architectures and it's use cases.
>>
>> All the existing alternatives such as cpu hotplug, creating isolated
>> partitions etc break the user affinity. Since number of CPUs to use
>> change
>> depending on the steal time, it is not driven by User. Hence it would
>> be
>> wrong to break the affinity. This series allows if the task is pinned
>> only paravirt CPUs, it will continue running there.
>>
>> Changes compared v3[1]:
>>
>> - Introduced computation of steal time in powerpc code.
>> - Derive number of CPUs to use and mark the remaining as paravirt
>> based
>> on steal values.
>> - Provide debugfs knobs to alter how steal time values being used.
>> - Removed static key check for paravirt CPUs (Yury)
>> - Removed preempt_disable/enable while calling stopper (Prateek)
>> - Made select_idle_sibling and friends aware of paravirt CPUs.
>> - Removed 3 unused schedstat fields and introduced 2 related to
>> paravirt
>> handling.
>> - Handled nohz_full case by enabling tick on it when there is CFS/RT
>> on
>> it.
>> - Updated helper patch to override arch behaviour for easier
>> debugging
>> during development.
>> - Kept
>>
>> Changes compared to v4[2]:
>> - Last two patches were sent out separate instead of being with
>> series.
>> That created confusion. Those two patches are debug patches one can
>> make use to check functionality across acrhitectures. Sorry about
>> that.
>> - Use DEVICE_ATTR_RW instead (greg)
>> - Made it as PATCH since arch specific handling completes the
>> functionality.
>>
>> [2]:
>> https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/
>>
>> TODO:
>>
>> - Get performance numbers on PowerPC, x86 and S390. Hopefully by next
>> week. Didn't want to hold the series till then.
>>
>> - The CPUs to mark as paravirt is very simple and doesn't work when
>> vCPUs aren't spread out uniformly across NUMA nodes. Ideal would be
>> splice
>> the numbers based on how many CPUs each NUMA node has. It is quite
>> tricky to do specially since cpumask can be on stack too. Given
>> NR_CPUS can be 8192 and nr_possible_nodes 32. Haven't got my head
>> into
>> solving it yet. Maybe there is easier way.
>>
>> - DLPAR Add/Remove needs to call init of EC/VP cores (powerpc
>> specific)
>>
>> - Userspace tools awareness such as irqbalance.
>>
>> - Delve into design of hint from Hyeprvisor(HW Hint). i.e Host
>> informs
>> guest which/how many CPUs it has to use at this moment. This
>> interface
>> should work across archs with each arch doing its specific
>> handling.
>>
>> - Determine the default values for steal time related knobs
>> empirically and document them.
>>
>> - Need to check safety against CPU hotplug specially in
>> process_steal.
>>
>>
>> Applies cleanly on tip/master:
>> commit c2ef745151b21d4dcc4b29a1eabf1096f5ba544b
>>
>>
>> Thanks to srikar for providing the initial code around powerpc steal
>> time handling code. Thanks to all who went through and provided
>> reviews.
>>
>> PS: I haven't found a better name. Please suggest if you have any.
>>
>> Shrikanth Hegde (17):
>> sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
>> cpumask: Introduce cpu_paravirt_mask
>> sched/core: Dont allow to use CPU marked as paravirt
>> sched/debug: Remove unused schedstats
>> sched/fair: Add paravirt movements for proc sched file
>> sched/fair: Pass current cpu in select_idle_sibling
>> sched/fair: Don't consider paravirt CPUs for wakeup and load
>> balance
>> sched/rt: Don't select paravirt CPU for wakeup and push/pull rt
>> task
>> sched/core: Add support for nohz_full CPUs
>> sched/core: Push current task from paravirt CPU
>> sysfs: Add paravirt CPU file
>> powerpc: method to initialize ec and vp cores
>> powerpc: enable/disable paravirt CPUs based on steal time
>> powerpc: process steal values at fixed intervals
>> powerpc: add debugfs file for controlling handling on steal values
>> sysfs: Provide write method for paravirt
>> sysfs: disable arch handling if paravirt file being written
>>
>> .../ABI/testing/sysfs-devices-system-cpu | 9 +
>> Documentation/scheduler/sched-arch.rst | 37 +++
>> arch/powerpc/include/asm/smp.h | 1 +
>> arch/powerpc/kernel/smp.c | 1 +
>> arch/powerpc/platforms/pseries/lpar.c | 223
>> ++++++++++++++++++
>> arch/powerpc/platforms/pseries/pseries.h | 1 +
>> drivers/base/cpu.c | 59 +++++
>> include/linux/cpumask.h | 20 ++
>> include/linux/sched.h | 9 +-
>> kernel/sched/core.c | 106 ++++++++-
>> kernel/sched/debug.c | 5 +-
>> kernel/sched/fair.c | 42 +++-
>> kernel/sched/rt.c | 11 +-
>> kernel/sched/sched.h | 9 +
>> 14 files changed, 519 insertions(+), 14 deletions(-)
>
> The capability to temporarily exclude CPUs from scheduling might be
> beneficial for s390x, where users often run Linux using a proprietary
> hypervisor called PR/SM and with high overcommit. In these
> circumstances virtual CPUs may not be scheduled by a hypervisor for a
> very long time.
>
> Today we have an upstream feature called "Hiperdispatch", which
> determines that this is about to happen and uses Capacity Aware
> Scheduling to prevent processes from being placed on the affected CPUs.
> However, at least when used for this purpose, Capacity Aware Scheduling
> is best effort and fails to move tasks away from the affected CPUs
> under high load.
>
> Therefore I have decided to smoke test this series.
>
> For the purposes of smoke testing, I set up a number of KVM virtual
> machines and start the same benchmark inside each one. Then I collect
> and compare the aggregate throughput numbers. I have not done testing
> with PR/SM yet, but I plan to do this and report back. I also have not
> tested this with VMs that are not 100% utilized yet.
>
Best results would be when it works as HW hint from hypervisor.
> Benchmark parameters:
>
> $ sysbench cpu run --threads=$(nproc) --time=10
> $ schbench -r 10 --json --no-locking
> $ hackbench --groups 10 --process --loops 5000
> $ pgbench -h $WORKDIR --client=$(nproc) --time=10
>
> Figures:
>
> s390x (16 host CPUs):
>
> Benchmark #VMs #CPUs/VM ΔRPS (%)
> ----------- ------ ---------- ----------
> hackbench 16 4 60.58%
> pgbench 16 4 50.01%
> hackbench 8 8 46.18%
> hackbench 4 8 43.54%
> hackbench 2 16 43.23%
> hackbench 12 4 42.92%
> hackbench 8 4 35.53%
> hackbench 4 16 30.98%
> pgbench 12 4 18.41%
> hackbench 2 24 7.32%
> pgbench 8 4 6.84%
> pgbench 2 24 3.38%
> pgbench 2 16 3.02%
> pgbench 4 16 2.08%
> hackbench 2 32 1.46%
> pgbench 4 8 1.30%
> schbench 2 16 0.72%
> schbench 4 8 -0.09%
> schbench 4 4 -0.20%
> schbench 8 8 -0.41%
> sysbench 8 4 -0.46%
> sysbench 4 8 -0.53%
> schbench 8 4 -0.65%
> sysbench 2 16 -0.76%
> schbench 2 8 -0.77%
> sysbench 8 8 -1.72%
> schbench 2 24 -1.98%
> schbench 12 4 -2.03%
> sysbench 12 4 -2.13%
> pgbench 2 32 -3.15%
> sysbench 16 4 -3.17%
> schbench 16 4 -3.50%
> sysbench 2 8 -4.01%
> pgbench 8 8 -4.10%
> schbench 4 16 -5.93%
> sysbench 4 4 -5.94%
> pgbench 2 4 -6.40%
> hackbench 2 8 -10.04%
> hackbench 4 4 -10.91%
> pgbench 4 4 -11.05%
> sysbench 2 24 -13.07%
> sysbench 4 16 -13.59%
> hackbench 2 4 -13.96%
> pgbench 2 8 -16.16%
> schbench 2 4 -24.14%
> schbench 2 32 -24.25%
> sysbench 2 4 -24.98%
> sysbench 2 32 -32.84%
>
> x86_64 (32 host CPUs):
>
> Benchmark #VMs #CPUs/VM ΔRPS (%)
> ----------- ------ ---------- ----------
> hackbench 4 32 87.02%
> hackbench 8 16 48.45%
> hackbench 4 24 47.95%
> hackbench 2 8 42.74%
> hackbench 2 32 34.90%
> pgbench 16 8 27.87%
> pgbench 12 8 25.17%
> hackbench 8 8 24.92%
> hackbench 16 8 22.41%
> hackbench 16 4 20.83%
> pgbench 8 16 20.40%
> hackbench 12 8 20.37%
> hackbench 4 16 20.36%
> pgbench 16 4 16.60%
> pgbench 8 8 14.92%
> hackbench 12 4 14.49%
> pgbench 4 32 9.49%
> pgbench 2 32 7.26%
> hackbench 2 24 6.54%
> pgbench 4 4 4.67%
> pgbench 8 4 3.24%
> pgbench 12 4 2.66%
> hackbench 4 8 2.53%
> pgbench 4 8 1.96%
> hackbench 2 16 1.93%
> schbench 4 32 1.24%
> pgbench 2 8 0.82%
> schbench 4 4 0.69%
> schbench 2 32 0.44%
> schbench 2 16 0.25%
> schbench 12 8 -0.02%
> sysbench 2 4 -0.02%
> schbench 4 24 -0.12%
> sysbench 2 16 -0.17%
> schbench 12 4 -0.18%
> schbench 2 4 -0.19%
> sysbench 4 8 -0.23%
> schbench 8 4 -0.24%
> sysbench 2 8 -0.24%
> schbench 4 8 -0.28%
> sysbench 8 4 -0.30%
> schbench 4 16 -0.37%
> schbench 2 24 -0.39%
> schbench 8 16 -0.49%
> schbench 2 8 -0.67%
> pgbench 4 16 -0.68%
> schbench 8 8 -0.83%
> sysbench 4 4 -0.92%
> schbench 16 4 -0.94%
> sysbench 12 4 -0.98%
> sysbench 8 16 -1.52%
> sysbench 16 4 -1.57%
> pgbench 2 4 -1.62%
> sysbench 12 8 -1.69%
> schbench 16 8 -1.97%
> sysbench 8 8 -2.08%
> hackbench 8 4 -2.11%
> pgbench 4 24 -3.20%
> pgbench 2 24 -3.35%
> sysbench 2 24 -3.81%
> pgbench 2 16 -4.55%
> sysbench 4 16 -5.10%
> sysbench 16 8 -6.56%
> sysbench 2 32 -8.24%
> sysbench 4 32 -13.54%
> sysbench 4 24 -13.62%
> hackbench 2 4 -15.40%
> hackbench 4 4 -17.71%
>
> There are some huge wins, especially for hackbench, which corresponds
> to Shrikanth's findings. There are some significant degradations too,
> which I plan to debug. This may simply have to do with the simplistic
> heuristic I am using for testing [1].
>
Thank you very much!! for running these numbers.
> sysbench, for example, is not supposed to benefit from this series,
> because it is not affected by overcommit. However, it definitely should
> not degrade by 30%. Interestingly enough, this happens only with
> certain combinations of VM and CPU counts, and this is reproducible.
>
is the host baremetal? is those cases cpufreq governer ramp up or down
might play a role. (speculating)
> Initially I have seen degradations as bad as -80% with schbench. It
> turned out this was caused by userspace per-CPU locking it implements;
> turning it off caused the degradation to go away. To me this looks like
> something synthetic and not something used by real-world application,
> but please correct me if I am wrong - then this will have to be
> resolved.
>
That's nice to hear. I was concerned with schbench rps. Now i am bit relieved.
Is this with schbench -L option?
I ran with it. and regression i was seeing earlier is gone now.
>
> One note regarding the PARAVIRT Kconfig gating: s390x does not
> select PARAVIRT today. For example, steal time we determine based on
> CPU timers and clocks, and not hypervisor hints. For now I had to add
> dummy paravirt headers to test this series. But I would appreciate if
> Kconfig gating was removed.
>
Keeping PARAVIRT checks on is probably right thing. I will wait to see if
anyone objects.
> Others have already commented on the naming, and I would agree that
> "paravirt" is really misleading. I cannot say that the previous "cpu-
> avoid" one was perfect, but it was much better.
>
>
> [1] https://github.com/iii-i/linux/commits/iii/poc/cpu-avoid/v3/
Will look into it. one thing to to be careful are CPU numbers.
Powered by blists - more mailing lists