linux-kernel - [RFC PATCH v3 0/4] sched/fair: introduce new scheduler group type group

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250512115325.30022-1-huschle@linux.ibm.com>
Date: Mon, 12 May 2025 13:53:21 +0200
From: Tobias Huschle <huschle@...ux.ibm.com>
To: linux-kernel@...r.kernel.org
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        vschneid@...hat.com, sshegde@...ux.ibm.com
Subject: [RFC PATCH v3 0/4] sched/fair: introduce new scheduler group type group_parked

Introduces parked_cpu concept. When a CPU is marked as parked, 
it is expected that nothing meaningful runs on it. To achieve 
this use a new group type called group_parked in the load 
balance path.

See cover letter of v2 for extensive description.

Adding an usecase and performance metrics:

The core goal is to allow Linux systems running as guests under a
hypervisor to run more efficiently. Virtualization usually implies
that the total amount of virtual CPUs creates an overcommitment on
the actually existing physical CPUs on the host.

==== Scenario and workload ==========================================

Therefore, the following scenario is used:
- KVM host with 10 cores, SMT 2, yielding 20 CPUs
- 8 KVM guests with 6 vCPUs each, no SMT, yielding 48 vCPUs in total

The following workload is used:
The guests communicate pair-wise with one another via distinct Linux
bridges, so we get 4 bridges, with 2 guests connected respectively.
Each pair runs an uperf benchmark with one guest sending 200 bytes to
the other guest and receiving 30000 bytes on return. This is done
by 50 parallel workers per pair. All running simulatinously for 400s.

==== Comparison 1 ===================================================

1. no guest vCPUs are parked
   This implies that 48 vCPUs are used, overcommiting the 20 actually
   available host CPUs.
2. guest vCPUs 2-5 are parked, means only 0-1 are used
   In this case, only 16 vCPUs are explicitly used, leaving 4 host
   CPUs available for virtualization overhead

Results:
Setup 2 provides a throughput improvement of ~24%.

==== Comparison 2 ===================================================

In addition to the uperf workload, each guest now runs 2 stress-ng
workers: stress-ng --cpu 2 --cpu-load 99 --cpu-method matrixprod
These 2 stress-ng workers are meant to consume the full CPU
entitlement of each guest.

Results:
Setup 2 provides a throughput improvement of ~50%. 

As an additional metric, the bogo/ops reported by stress-ng can be
noted where setup 1 outperforms setup 2 by 23%, which is expected, 
as stress-ng will pick up all available computation power left 
untouched by uperf. So, a better performing uperf consumes more
CPU runtime, taking it away from stress-ng.

This yields a trade-off between improving interrupt/lock-using
workload like uperf and penalizing purely CPU focussed workload
like stress-ng. With a 50% improvement on the uperf side, but only
23% regression on the stress-ng side, it is probably possible to
find a sweet spot.

==== Notes ==========================================================

This is of course only an initial sniff test. Additional
configurations will need to be tested, but the initial runs look
promising, in the sense that it is possible to find performance 
improvements by passing information about the availability of CPU 
resources in the host. 
In the presented runs, all values where set statically and not
dynamically modified.

The number of usable CPU resources should be seen as a part of the
topology that the system is perceiving from the underlying layer.
The host running on that underlying layer has to determine, that 
certain CPUs are currently not usable for the guest. 
The guest would receive this information through architecture
specific means, as it should perceive, that it is interacting with
actual hardware rathen than being virtualized.

In the case where the host observes that all of its resources are
consumed by the guests, it can pass the necessary information such
that the guests can start parking CPUs. If the host observes, that
the overall pressure on the resources is relieved, it can instruct
the guests that it is safe to unpark CPUs again.

==== Open questions =================================================

There are a couple of issues and corner cases which need further
considerations:
- dl:         Deadline scheduling is not covered yet. There is
              probably only little overlap in systems that would make
              use of parked CPUs and systems running deadline 
              scheduling.
- ext:        Probably affected as well. Needs some conceptional
              thoughts first.
- raciness:   Right now, there are no synchronization efforts. It needs
              to be considered whether those might be necessary or if
              it is alright that the parked-state of a CPU might change
              during load-balancing.
- taskset:    If a task is pinned to CPUs that are all parked, the 
              pinning is discarded (similar to CPU hotplug). Thoughts
              need to be spent on how to properly notify the user.
- reporting:  Tools like lsdasd and debugfs should represent the parked 
              state of CPUs.
- interrupts: Interrupts should be disabled on parked CPUs as well,
              most likely responsibility of an implementing arch.

=====================================================================

Changes to v2
- provide usecase and performance measurements
- add support for realtime scheduler
  The adjustments work fine for all kinds of real time threads.
  Only those which are running at 100% CPU utilization are never
  interrupted and therefore never rescheduled. This is an
  limitation for now although scenarios that would profit from
  having parked CPUs would probably not run such uninterrupted
  real time processes anyway.
- use h_nr_queued instead of nr_running
- remove unnecessary arch_cpu_parked check
- do not touch idle load balancer, it seems to be unnecessary to 
  explicitly run it, the idea could be reconsidered later

Patches apply to tip:sched/core

The s390 patch serves as a simplified implementation example.
Tobias Huschle (4):
  sched/fair: introduce new scheduler group type group_parked
  sched/rt: add support for parked CPUs
  sched/fair: adapt scheduler group weight and capacity for parked CPUs
  s390/topology: Add initial implementation for selection of parked CPUs

 arch/s390/include/asm/smp.h    |  2 +
 arch/s390/kernel/smp.c         |  5 ++
 include/linux/sched/topology.h | 19 +++++++
 kernel/sched/core.c            | 13 ++++-
 kernel/sched/fair.c            | 95 +++++++++++++++++++++++++++++-----
 kernel/sched/rt.c              | 25 +++++++--
 kernel/sched/syscalls.c        |  3 ++
 7 files changed, 142 insertions(+), 20 deletions(-)

-- 
2.34.1