[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251204175405.1511340-1-srikar@linux.ibm.com>
Date: Thu, 4 Dec 2025 23:23:48 +0530
From: Srikar Dronamraju <srikar@...ux.ibm.com>
To: linux-kernel@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
Peter Zijlstra <peterz@...radead.org>
Cc: Ben Segall <bsegall@...gle.com>,
Christophe Leroy <christophe.leroy@...roup.eu>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Ingo Molnar <mingo@...nel.org>, Juri Lelli <juri.lelli@...hat.com>,
K Prateek Nayak <kprateek.nayak@....com>,
Madhavan Srinivasan <maddy@...ux.ibm.com>,
Mel Gorman <mgorman@...e.de>, Michael Ellerman <mpe@...erman.id.au>,
Nicholas Piggin <npiggin@...il.com>,
Shrikanth Hegde <sshegde@...ux.ibm.com>,
Srikar Dronamraju <srikar@...ux.ibm.com>,
Steven Rostedt <rostedt@...dmis.org>,
Swapnil Sapkal <swapnil.sapkal@....com>,
Thomas Huth <thuth@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
virtualization@...ts.linux.dev, Yicong Yang <yangyicong@...ilicon.com>,
Ilya Leoshkevich <iii@...ux.ibm.com>
Subject: [PATCH 00/17] Steal time based dynamic CPU resource management
VMs or Shared LPARs provide flexibility, better and efficient use of system
resources. To achieve this most of these setups (VMs or LPARs) will have
guaranteed or entitled share of resources. However they will be allotted
more resources so that a VM can get to use unused share of free/unused VMs
Hence most of these VMs are configured to be overcommitted i.e each VM can
exceed its guaranteed share of resources. Here we are mostly looking at
CPU/cores as a resource. The other option is pinning, which does provide
flexibility but not better use of system resources.
However each VM thinks that it has access to all the allotted resources.
Hence each VM will spread the workload to as many CPUs/cores as possible.
This leads to resource contention causing performance impact. Hence the
clear goal of better system utilization is actually not met.
To overcome this problem, a hint could be provided to the VMs so that Linux
scheduler knows how many CPUs/cores have to be used. In this series, steal
time is used as a hint so that Linux scheduler uses to know how and which
CPUs/cores are to be used. Typically if the resources are over-utilized by
one or more of the VMs, the steal time will spike. If the resources are
underutilized, the steal time will be low. Currently this series implements
this steal based dynamic CPU resource management on PowerVM Shared LPARs.
However since steal is a pretty generic VM attribute, this can be extended
to any architecture that has some form of steal accounting.
If in the future, there is a better hinting mechanism/strategy, the
infrastructure could be modified to work with it.
There have been similar work on these lines. The most recent reference being
https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/
Here is the broad outline of this patch series.
- If the steal time is high identify CPUs that not be should be used, reduce
their CPU capacity and mark them as inactive. This will make unpinned
tasks to be migrated out. Pinned tasks that cant be migrated out will still
continue to run over there.
- If the steal time is low identify CPUs that were marked as inactive, reset
their CPU capacity, mark them as active and available for the scheduler to
use.
For our experiment we have 2 shared LPARs both having 72 cores/576 CPUs
each entitled to 24 cores/192 CPUs and both sharing 64 cores/512 CPUs
running 10 iterations of ebizzy. (Higher is better)
nonoise case, i.e only ebizzy is running on 1 LPAR and the other LPAR is free
threads base cores-used +patchset cores-used
8 1 4.82 1.01958 5.025
12 1 6.855 1.01761 7.09
16 1 8.86 0.977243 8.475
24 1 13.47 0.996121 13.085
36 1 20.1 1.01447 19.79
64 1 33.2 0.976135 29.105
72 1 36.05 1.01956 35.775
144 1 55.14 1.01805 54.74
288 1 56.005 1.06081 56.575
576 1 54.945 1.07684 42.42
1152 1 54.65 1.06421 41.625
noise case, i.e both LPARS running similar ebizzy workload.
In the noise case, if one LPAR runs x threads, ebizzy, noise/other lpar also
runs x threads where x is 8,12,16,24..
threads base cores-used +patchset cores-used
8 1 4.805 0.982148 5.32
12 1 6.865 1.00572 7.405
16 1 8.975 0.972395 9.33
24 1 13.44 0.999339 13.525
36 1 19.95 1.00277 19.24
64 1 26.615 1.05265 26.73
72 1 27.055 0.968465 26.05
144 1 32.84 0.917759 33.23
288 1 30.365 0.957132 29.18
576 1 29.14 0.870245 23.325
1152 1 29.135 0.897712 24.36
While there are some regressions, its certainly using less number of cores.
Also on an average cache-misses, cycles, instructions, context-switches
reduced by 3x with the patchset both in the noise and nonoise case.
(Lower is better)
nonoise
cache-misses cs cycles instructions
threads base +patched base +patched base +patched base +patched
8 1 0.26 1 0.34 1 0.28 1 0.32
12 1 0.42 1 0.50 1 0.41 1 0.51
16 1 0.27 1 0.33 1 0.29 1 0.31
24 1 0.43 1 0.51 1 0.44 1 0.49
36 1 0.29 1 0.33 1 0.31 1 0.32
64 1 0.43 1 0.50 1 0.46 1 0.47
72 1 0.19 1 0.20 1 0.19 1 0.19
144 1 0.48 1 0.50 1 0.47 1 0.48
288 1 0.24 1 0.25 1 0.24 1 0.25
576 1 0.13 1 0.25 1 0.22 1 0.25
1152 1 0.35 1 0.34 1 0.37 1 0.33
noise
cache-misses cs cycles instructions
threads base +patched base +patched base +patched base +patched
8 1 0.26 1 0.33 1 0.28 1 0.33
12 1 0.39 1 0.52 1 0.41 1 0.48
16 1 0.27 1 0.33 1 0.29 1 0.32
24 1 0.42 1 0.51 1 0.44 1 0.48
36 1 0.35 1 0.34 1 0.32 1 0.32
64 1 0.43 1 0.50 1 0.46 1 0.48
72 1 0.20 1 0.19 1 0.19 1 0.19
144 1 0.49 1 0.51 1 0.46 1 0.45
288 1 0.23 1 0.25 1 0.21 1 0.21
576 1 0.26 1 0.25 1 0.21 1 0.20
1152 1 0.29 1 0.34 1 0.35 1 0.26
However there is still more work to be done.
Please let me know your valuable inputs/feedback about these changes.
Should apply cleanly on v6.18
Cc: "Ben Segall <bsegall@...gle.com>"
Cc: "Christophe Leroy <christophe.leroy@...roup.eu>"
Cc: "Dietmar Eggemann <dietmar.eggemann@....com>"
Cc: "Ingo Molnar <mingo@...nel.org>"
Cc: "Juri Lelli <juri.lelli@...hat.com>"
Cc: "K Prateek Nayak <kprateek.nayak@....com>"
Cc: "linux-kernel@...r.kernel.org"
Cc: "linuxppc-dev@...ts.ozlabs.org"
Cc: "Madhavan Srinivasan <maddy@...ux.ibm.com>"
Cc: "Mel Gorman <mgorman@...e.de>"
Cc: "Michael Ellerman <mpe@...erman.id.au>"
Cc: "Nicholas Piggin <npiggin@...il.com>"
Cc: "Peter Zijlstra <peterz@...radead.org>"
Cc: "Shrikanth Hegde <sshegde@...ux.ibm.com>"
Cc: "Steven Rostedt <rostedt@...dmis.org>"
Cc: "Swapnil Sapkal <swapnil.sapkal@....com>"
Cc: "Thomas Huth <thuth@...hat.com>"
Cc: "Valentin Schneider <vschneid@...hat.com>"
Cc: "Vincent Guittot <vincent.guittot@...aro.org>"
Cc: "virtualization@...ts.linux.dev"
Cc: "Yicong Yang <yangyicong@...ilicon.com>"
Cc: "Ilya Leoshkevich <iii@...ux.ibm.com>"
Srikar Dronamraju (17):
sched/fair: Enable group_asym_packing in find_idlest_group
powerpc/lpar: Reorder steal accounting calculation
pseries/lpar: Process steal metrics
powerpc/smp: Add num_available_cores callback for smp_ops
pseries/smp: Query and set entitlements
powerpc/smp: Delay processing steal time at boot
sched/core: Set balance_callback only if CPU is dying
sched/core: Implement CPU soft offline/online
powerpc/smp: Implement arch_scale_cpu_capacity for shared LPARs
powerpc/smp: Define arch_update_cpu_topology for shared LPARs
pseries/smp: Create soft offline infrastructure for Powerpc shared
LPARs.
pseries/smp: Trigger softoffline based on steal metrics
pseries/smp: Account cores when triggering softoffline
powerpc/smp: Assume preempt if CPU is inactive.
pseries/hotplug: Update available_cores on a dlpar event
pseries/smp: Allow users to override steal thresholds
pseries/lpar: Add debug interface to set steal interval
arch/powerpc/include/asm/paravirt.h | 62 +------
arch/powerpc/include/asm/smp.h | 6 +
arch/powerpc/include/asm/topology.h | 5 +
arch/powerpc/kernel/smp.c | 38 ++++
arch/powerpc/platforms/pseries/hotplug-cpu.c | 6 +
arch/powerpc/platforms/pseries/lpar.c | 71 +++++++-
arch/powerpc/platforms/pseries/pseries.h | 8 +
arch/powerpc/platforms/pseries/smp.c | 173 +++++++++++++++++++
include/linux/sched/topology.h | 1 +
kernel/sched/core.c | 50 +++++-
kernel/sched/fair.c | 33 +++-
11 files changed, 383 insertions(+), 70 deletions(-)
--
2.43.7
Powered by blists - more mailing lists