[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250512115325.30022-1-huschle@linux.ibm.com>
Date: Mon, 12 May 2025 13:53:21 +0200
From: Tobias Huschle <huschle@...ux.ibm.com>
To: linux-kernel@...r.kernel.org
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
vincent.guittot@...aro.org, dietmar.eggemann@....com,
rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
vschneid@...hat.com, sshegde@...ux.ibm.com
Subject: [RFC PATCH v3 0/4] sched/fair: introduce new scheduler group type group_parked
Introduces parked_cpu concept. When a CPU is marked as parked,
it is expected that nothing meaningful runs on it. To achieve
this use a new group type called group_parked in the load
balance path.
See cover letter of v2 for extensive description.
Adding an usecase and performance metrics:
The core goal is to allow Linux systems running as guests under a
hypervisor to run more efficiently. Virtualization usually implies
that the total amount of virtual CPUs creates an overcommitment on
the actually existing physical CPUs on the host.
==== Scenario and workload ==========================================
Therefore, the following scenario is used:
- KVM host with 10 cores, SMT 2, yielding 20 CPUs
- 8 KVM guests with 6 vCPUs each, no SMT, yielding 48 vCPUs in total
The following workload is used:
The guests communicate pair-wise with one another via distinct Linux
bridges, so we get 4 bridges, with 2 guests connected respectively.
Each pair runs an uperf benchmark with one guest sending 200 bytes to
the other guest and receiving 30000 bytes on return. This is done
by 50 parallel workers per pair. All running simulatinously for 400s.
==== Comparison 1 ===================================================
1. no guest vCPUs are parked
This implies that 48 vCPUs are used, overcommiting the 20 actually
available host CPUs.
2. guest vCPUs 2-5 are parked, means only 0-1 are used
In this case, only 16 vCPUs are explicitly used, leaving 4 host
CPUs available for virtualization overhead
Results:
Setup 2 provides a throughput improvement of ~24%.
==== Comparison 2 ===================================================
In addition to the uperf workload, each guest now runs 2 stress-ng
workers: stress-ng --cpu 2 --cpu-load 99 --cpu-method matrixprod
These 2 stress-ng workers are meant to consume the full CPU
entitlement of each guest.
Results:
Setup 2 provides a throughput improvement of ~50%.
As an additional metric, the bogo/ops reported by stress-ng can be
noted where setup 1 outperforms setup 2 by 23%, which is expected,
as stress-ng will pick up all available computation power left
untouched by uperf. So, a better performing uperf consumes more
CPU runtime, taking it away from stress-ng.
This yields a trade-off between improving interrupt/lock-using
workload like uperf and penalizing purely CPU focussed workload
like stress-ng. With a 50% improvement on the uperf side, but only
23% regression on the stress-ng side, it is probably possible to
find a sweet spot.
==== Notes ==========================================================
This is of course only an initial sniff test. Additional
configurations will need to be tested, but the initial runs look
promising, in the sense that it is possible to find performance
improvements by passing information about the availability of CPU
resources in the host.
In the presented runs, all values where set statically and not
dynamically modified.
The number of usable CPU resources should be seen as a part of the
topology that the system is perceiving from the underlying layer.
The host running on that underlying layer has to determine, that
certain CPUs are currently not usable for the guest.
The guest would receive this information through architecture
specific means, as it should perceive, that it is interacting with
actual hardware rathen than being virtualized.
In the case where the host observes that all of its resources are
consumed by the guests, it can pass the necessary information such
that the guests can start parking CPUs. If the host observes, that
the overall pressure on the resources is relieved, it can instruct
the guests that it is safe to unpark CPUs again.
==== Open questions =================================================
There are a couple of issues and corner cases which need further
considerations:
- dl: Deadline scheduling is not covered yet. There is
probably only little overlap in systems that would make
use of parked CPUs and systems running deadline
scheduling.
- ext: Probably affected as well. Needs some conceptional
thoughts first.
- raciness: Right now, there are no synchronization efforts. It needs
to be considered whether those might be necessary or if
it is alright that the parked-state of a CPU might change
during load-balancing.
- taskset: If a task is pinned to CPUs that are all parked, the
pinning is discarded (similar to CPU hotplug). Thoughts
need to be spent on how to properly notify the user.
- reporting: Tools like lsdasd and debugfs should represent the parked
state of CPUs.
- interrupts: Interrupts should be disabled on parked CPUs as well,
most likely responsibility of an implementing arch.
=====================================================================
Changes to v2
- provide usecase and performance measurements
- add support for realtime scheduler
The adjustments work fine for all kinds of real time threads.
Only those which are running at 100% CPU utilization are never
interrupted and therefore never rescheduled. This is an
limitation for now although scenarios that would profit from
having parked CPUs would probably not run such uninterrupted
real time processes anyway.
- use h_nr_queued instead of nr_running
- remove unnecessary arch_cpu_parked check
- do not touch idle load balancer, it seems to be unnecessary to
explicitly run it, the idea could be reconsidered later
Patches apply to tip:sched/core
The s390 patch serves as a simplified implementation example.
Tobias Huschle (4):
sched/fair: introduce new scheduler group type group_parked
sched/rt: add support for parked CPUs
sched/fair: adapt scheduler group weight and capacity for parked CPUs
s390/topology: Add initial implementation for selection of parked CPUs
arch/s390/include/asm/smp.h | 2 +
arch/s390/kernel/smp.c | 5 ++
include/linux/sched/topology.h | 19 +++++++
kernel/sched/core.c | 13 ++++-
kernel/sched/fair.c | 95 +++++++++++++++++++++++++++++-----
kernel/sched/rt.c | 25 +++++++--
kernel/sched/syscalls.c | 3 ++
7 files changed, 142 insertions(+), 20 deletions(-)
--
2.34.1
Powered by blists - more mailing lists