linux-kernel - [PATCH 00/17] Steal time based dynamic CPU resource management

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251204175405.1511340-1-srikar@linux.ibm.com>
Date: Thu,  4 Dec 2025 23:23:48 +0530
From: Srikar Dronamraju <srikar@...ux.ibm.com>
To: linux-kernel@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
        Peter Zijlstra <peterz@...radead.org>
Cc: Ben Segall <bsegall@...gle.com>,
        Christophe Leroy <christophe.leroy@...roup.eu>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ingo Molnar <mingo@...nel.org>, Juri Lelli <juri.lelli@...hat.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        Madhavan Srinivasan <maddy@...ux.ibm.com>,
        Mel Gorman <mgorman@...e.de>, Michael Ellerman <mpe@...erman.id.au>,
        Nicholas Piggin <npiggin@...il.com>,
        Shrikanth Hegde <sshegde@...ux.ibm.com>,
        Srikar Dronamraju <srikar@...ux.ibm.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Swapnil Sapkal <swapnil.sapkal@....com>,
        Thomas Huth <thuth@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        virtualization@...ts.linux.dev, Yicong Yang <yangyicong@...ilicon.com>,
        Ilya Leoshkevich <iii@...ux.ibm.com>
Subject: [PATCH 00/17] Steal time based dynamic CPU resource management

VMs or Shared LPARs provide flexibility, better and efficient use of system
resources. To achieve this most of these setups (VMs or LPARs) will have
guaranteed or entitled share of resources. However they will be allotted
more resources so that a VM can get to use unused share of free/unused VMs
Hence most of these VMs are configured to be overcommitted i.e each VM can
exceed its guaranteed share of resources. Here we are mostly looking at
CPU/cores as a resource. The other option is pinning, which does provide
flexibility but not better use of system resources.

However each VM thinks that it has access to all the allotted resources.
Hence each VM will spread the workload to as many CPUs/cores as possible.
This leads to resource contention causing performance impact. Hence the
clear goal of better system utilization is actually not met.

To overcome this problem, a hint could be provided to the VMs so that Linux
scheduler knows how many CPUs/cores have to be used. In this series, steal
time is used as a hint so that Linux scheduler uses to know how and which
CPUs/cores are to be used. Typically if the resources are over-utilized by
one or more of the VMs, the steal time will spike. If the resources are
underutilized, the steal time will be low. Currently this series implements
this steal based dynamic CPU resource management on PowerVM Shared LPARs.
However since steal is a pretty generic VM attribute, this can be extended
to any architecture that has some form of steal accounting.

If in the future, there is a better hinting mechanism/strategy, the
infrastructure could be modified to work with it.

There have been similar work on these lines. The most recent reference being
https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/

Here is the broad outline of this patch series.
- If the steal time is high identify CPUs that not be should be used, reduce
  their CPU capacity and mark them as inactive. This will make unpinned
  tasks to be migrated out. Pinned tasks that cant be migrated out will still
  continue to run over there.
- If the steal time is low identify CPUs that were marked as inactive, reset
  their CPU capacity, mark them as active and available for the scheduler to
  use.

For our experiment we have 2 shared LPARs both having 72 cores/576 CPUs
each entitled to 24 cores/192 CPUs and both sharing 64 cores/512 CPUs
running 10 iterations of ebizzy. (Higher is better)

nonoise case, i.e only ebizzy is running on 1 LPAR and the other LPAR is free
threads  base  cores-used  +patchset  cores-used
8        1     4.82        1.01958    5.025
12       1     6.855       1.01761    7.09
16       1     8.86        0.977243   8.475
24       1     13.47       0.996121   13.085
36       1     20.1        1.01447    19.79
64       1     33.2        0.976135   29.105
72       1     36.05       1.01956    35.775
144      1     55.14       1.01805    54.74
288      1     56.005      1.06081    56.575
576      1     54.945      1.07684    42.42
1152     1     54.65       1.06421    41.625

noise case, i.e both LPARS running similar ebizzy workload.
In the noise case, if one LPAR runs x threads, ebizzy, noise/other lpar also
runs x threads where x is 8,12,16,24..

threads  base  cores-used  +patchset  cores-used
8        1     4.805       0.982148   5.32
12       1     6.865       1.00572    7.405
16       1     8.975       0.972395   9.33
24       1     13.44       0.999339   13.525
36       1     19.95       1.00277    19.24
64       1     26.615      1.05265    26.73
72       1     27.055      0.968465   26.05
144      1     32.84       0.917759   33.23
288      1     30.365      0.957132   29.18
576      1     29.14       0.870245   23.325
1152     1     29.135      0.897712   24.36

While there are some regressions, its certainly using less number of cores.
Also on an average cache-misses, cycles, instructions, context-switches
reduced by 3x with the patchset both in the noise and nonoise case.
(Lower is better)

nonoise
         cache-misses    cs              cycles          instructions
threads  base  +patched  base  +patched  base  +patched  base  +patched
8        1     0.26      1     0.34      1     0.28      1     0.32
12       1     0.42      1     0.50      1     0.41      1     0.51
16       1     0.27      1     0.33      1     0.29      1     0.31
24       1     0.43      1     0.51      1     0.44      1     0.49
36       1     0.29      1     0.33      1     0.31      1     0.32
64       1     0.43      1     0.50      1     0.46      1     0.47
72       1     0.19      1     0.20      1     0.19      1     0.19
144      1     0.48      1     0.50      1     0.47      1     0.48
288      1     0.24      1     0.25      1     0.24      1     0.25
576      1     0.13      1     0.25      1     0.22      1     0.25
1152     1     0.35      1     0.34      1     0.37      1     0.33

noise
         cache-misses    cs              cycles          instructions
threads  base  +patched  base  +patched  base  +patched  base  +patched
8        1     0.26      1     0.33      1     0.28      1     0.33
12       1     0.39      1     0.52      1     0.41      1     0.48
16       1     0.27      1     0.33      1     0.29      1     0.32
24       1     0.42      1     0.51      1     0.44      1     0.48
36       1     0.35      1     0.34      1     0.32      1     0.32
64       1     0.43      1     0.50      1     0.46      1     0.48
72       1     0.20      1     0.19      1     0.19      1     0.19
144      1     0.49      1     0.51      1     0.46      1     0.45
288      1     0.23      1     0.25      1     0.21      1     0.21
576      1     0.26      1     0.25      1     0.21      1     0.20
1152     1     0.29      1     0.34      1     0.35      1     0.26

However there is still more work to be done.
Please let me know your valuable inputs/feedback about these changes.
Should apply cleanly on v6.18

Cc: "Ben Segall <bsegall@...gle.com>"
Cc: "Christophe Leroy <christophe.leroy@...roup.eu>"
Cc: "Dietmar Eggemann <dietmar.eggemann@....com>"
Cc: "Ingo Molnar <mingo@...nel.org>"
Cc: "Juri Lelli <juri.lelli@...hat.com>"
Cc: "K Prateek Nayak <kprateek.nayak@....com>"
Cc: "linux-kernel@...r.kernel.org"
Cc: "linuxppc-dev@...ts.ozlabs.org"
Cc: "Madhavan Srinivasan <maddy@...ux.ibm.com>"
Cc: "Mel Gorman <mgorman@...e.de>"
Cc: "Michael Ellerman <mpe@...erman.id.au>"
Cc: "Nicholas Piggin <npiggin@...il.com>"
Cc: "Peter Zijlstra <peterz@...radead.org>"
Cc: "Shrikanth Hegde <sshegde@...ux.ibm.com>"
Cc: "Steven Rostedt <rostedt@...dmis.org>"
Cc: "Swapnil Sapkal <swapnil.sapkal@....com>"
Cc: "Thomas Huth <thuth@...hat.com>"
Cc: "Valentin Schneider <vschneid@...hat.com>"
Cc: "Vincent Guittot <vincent.guittot@...aro.org>"
Cc: "virtualization@...ts.linux.dev"
Cc: "Yicong Yang <yangyicong@...ilicon.com>"
Cc: "Ilya Leoshkevich <iii@...ux.ibm.com>"

Srikar Dronamraju (17):
  sched/fair: Enable group_asym_packing in find_idlest_group
  powerpc/lpar: Reorder steal accounting calculation
  pseries/lpar: Process steal metrics
  powerpc/smp: Add num_available_cores callback for smp_ops
  pseries/smp: Query and set entitlements
  powerpc/smp: Delay processing steal time at boot
  sched/core: Set balance_callback only if CPU is dying
  sched/core: Implement CPU soft offline/online
  powerpc/smp: Implement arch_scale_cpu_capacity for shared LPARs
  powerpc/smp: Define arch_update_cpu_topology for shared LPARs
  pseries/smp: Create soft offline infrastructure for Powerpc shared
    LPARs.
  pseries/smp: Trigger softoffline based on steal metrics
  pseries/smp: Account cores when triggering softoffline
  powerpc/smp: Assume preempt if CPU is inactive.
  pseries/hotplug: Update available_cores on a dlpar event
  pseries/smp: Allow users to override steal thresholds
  pseries/lpar: Add debug interface to set steal interval

 arch/powerpc/include/asm/paravirt.h          |  62 +------
 arch/powerpc/include/asm/smp.h               |   6 +
 arch/powerpc/include/asm/topology.h          |   5 +
 arch/powerpc/kernel/smp.c                    |  38 ++++
 arch/powerpc/platforms/pseries/hotplug-cpu.c |   6 +
 arch/powerpc/platforms/pseries/lpar.c        |  71 +++++++-
 arch/powerpc/platforms/pseries/pseries.h     |   8 +
 arch/powerpc/platforms/pseries/smp.c         | 173 +++++++++++++++++++
 include/linux/sched/topology.h               |   1 +
 kernel/sched/core.c                          |  50 +++++-
 kernel/sched/fair.c                          |  33 +++-
 11 files changed, 383 insertions(+), 70 deletions(-)

-- 
2.43.7