[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250625191108.1646208-1-sshegde@linux.ibm.com>
Date: Thu, 26 Jun 2025 00:40:59 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
vincent.guittot@...aro.org, tglx@...utronix.de, yury.norov@...il.com,
maddy@...ux.ibm.com
Cc: sshegde@...ux.ibm.com, vschneid@...hat.com, dietmar.eggemann@....com,
rostedt@...dmis.org, kprateek.nayak@....com, huschle@...ux.ibm.com,
srikar@...ux.ibm.com, linux-kernel@...r.kernel.org,
christophe.leroy@...roup.eu, linuxppc-dev@...ts.ozlabs.org,
gregkh@...uxfoundation.org
Subject: [RFC v2 0/9] cpu avoid state and push task mechanism
This is a followup version if [1] with few additions. This is still an RFC
and would like get feedback on the idea and suggestions on improvement.
v1->v2:
- Renamed to cpu_avoid_mask in place of cpu_parked_mask.
- Used a static key such that no impact to regular case.
- add sysfs file to show avoid CPUs.
- Make RT understand avoid CPUs.
- Add documentation patch
- Took care of reported compile error in [1] when NR_CPUS=1
-----------------
Problem statement
-----------------
vCPU - Virtual CPUs - CPU in VM world.
pCPU - Physical CPUs - CPU in baremetal world.
A hypervisor is managing these vCPUs from different VMs. When a vCPU
requests for CPU, hypervisor does the job of scheduling them on a pCPU.
So this issue occurs when there are more vCPUs(combined across all VMs)
than the pCPU. So when *all* vCPUs are requesting for CPUs, hypervisor
can only run a few of them and remaining will be preempted(waiting for pCPU).
If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from
VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
each other and request for *limited* vCPUs, it avoids the above overhead and
there is context switching within vCPU(less expensive). Even if hypervisor
is preempting one vCPU to run another within the same VM, it is still more
expensive than the task preemption within the vCPU. So *basic* aim to avoid
vCPU preemption.
So to achieve this, use "CPU Avoid" concept, where it is better
if workload avoids these vCPUs at this moment.
(vCPUs stays online, we don't want the overhead of sched domain rebuild).
Contention is dynamic in nature. When there is contention for pCPU is to be
detected and determined by architecture. Archs needs to update the mask
accordingly.
When there is contention, use limited vCPUs as indicated by arch.
When there is no contention, use all vCPUs.
-------------------------
To be done and Questions:
-------------------------
1. IRQ - still don't understand this cpu_avoid_mask. Maybe irqbalance
code could be modified to do the same. Ran stress-ng --hrtimers, irq
moved out of avoid cpu though. So need to see if changes to irqbalance is
required or not.
2. If a task is spawned by affining to only avoid CPUs. Should that fail
or throw a warning to user.
3. Other classes such as SCHED_EXT, SCHED_DL won't understand this infra
yet.
4. Performance testing yet to be done. RFC only verified the functional
aspects of whether task move out of avoid CPUs or not. Move happens quite
fast (around 1-2 seconds even on large systems with very high utilization)
5. Haven't come up an infra which could combine all push task related
changes. It is currently spread across rt, dl, fair. Maybe some
consolidation can be done. but which tasks to push/pull still remains in
the class.
6. cpu_avoid_mask may need some sort of locking to ensure read/write is
correct.
[1]: https://lore.kernel.org/all/20250523181448.3777233-1-sshegde@linux.ibm.com/
Shrikanth Hegde (9):
sched/docs: Document avoid_cpu_mask and avoid CPU concept
cpumask: Introduce cpu_avoid_mask
sched/core: Don't allow to use CPU marked as avoid
sched/fair: Don't use CPU marked as avoid for wakeup and load balance
sched/rt: Don't select CPU marked as avoid for wakeup and push/pull rt task
sched/core: Push current task out if CPU is marked as avoid
sched: Add static key check for cpu_avoid
sysfs: Add cpu_avoid file
powerpc: add debug file for set/unset cpu avoid
Documentation/scheduler/sched-arch.rst | 25 +++++++++++++
arch/powerpc/include/asm/paravirt.h | 2 ++
arch/powerpc/kernel/smp.c | 50 ++++++++++++++++++++++++++
drivers/base/cpu.c | 8 +++++
include/linux/cpumask.h | 17 +++++++++
kernel/cpu.c | 3 ++
kernel/sched/core.c | 50 +++++++++++++++++++++++++-
kernel/sched/fair.c | 11 +++++-
kernel/sched/rt.c | 9 +++--
kernel/sched/sched.h | 10 ++++++
10 files changed, 181 insertions(+), 4 deletions(-)
--
2.43.0
Powered by blists - more mailing lists