linux-kernel - [RFC v2 0/9] cpu avoid state and push task mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250625191108.1646208-1-sshegde@linux.ibm.com>
Date: Thu, 26 Jun 2025 00:40:59 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, tglx@...utronix.de, yury.norov@...il.com,
        maddy@...ux.ibm.com
Cc: sshegde@...ux.ibm.com, vschneid@...hat.com, dietmar.eggemann@....com,
        rostedt@...dmis.org, kprateek.nayak@....com, huschle@...ux.ibm.com,
        srikar@...ux.ibm.com, linux-kernel@...r.kernel.org,
        christophe.leroy@...roup.eu, linuxppc-dev@...ts.ozlabs.org,
        gregkh@...uxfoundation.org
Subject: [RFC v2 0/9] cpu avoid state and push task mechanism

This is a followup version if [1] with few additions. This is still an RFC 
and would like get feedback on the idea and suggestions on improvement. 

v1->v2:
- Renamed to cpu_avoid_mask in place of cpu_parked_mask.
- Used a static key such that no impact to regular case. 
- add sysfs file to show avoid CPUs.
- Make RT understand avoid CPUs. 
- Add documentation patch 
- Took care of reported compile error in [1] when NR_CPUS=1

-----------------
Problem statement
-----------------
vCPU - Virtual CPUs - CPU in VM world.
pCPU - Physical CPUs - CPU in baremetal world.

A hypervisor is managing these vCPUs from different VMs. When a vCPU 
requests for CPU, hypervisor does the job of scheduling them on a pCPU.

So this issue occurs when there are more vCPUs(combined across all VMs) 
than the pCPU. So when *all* vCPUs are requesting for CPUs, hypervisor 
can only run a few of them and remaining will be preempted(waiting for pCPU).

If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from 
VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
each other and request for *limited*  vCPUs, it avoids the above overhead and 
there is context switching within vCPU(less expensive). Even if hypervisor
is preempting one vCPU to run another within the same VM, it is still more 
expensive than the task preemption within the vCPU. So *basic* aim to avoid 
vCPU preemption.

So to achieve this, use "CPU Avoid" concept, where it is better
if workload avoids these vCPUs at this moment.
(vCPUs stays online, we don't want the overhead of sched domain rebuild).

Contention is dynamic in nature. When there is contention for pCPU is to be 
detected and determined by architecture. Archs needs to update the mask 
accordingly.

When there is contention, use limited vCPUs as indicated by arch.
When there is no contention, use all vCPUs.

-------------------------
To be done and Questions: 
-------------------------
1. IRQ - still don't understand this cpu_avoid_mask. Maybe irqbalance
code could be modified to do the same. Ran stress-ng --hrtimers, irq
moved out of avoid cpu though. So need to see if changes to irqbalance is
required or not.

2. If a task is spawned by affining to only avoid CPUs. Should that fail
or throw a warning to user. 

3. Other classes such as SCHED_EXT, SCHED_DL won't understand this infra
yet.

4. Performance testing yet to be done. RFC only verified the functional
aspects of whether task move out of avoid CPUs or not. Move happens quite
fast (around 1-2 seconds even on large systems with very high utilization) 

5. Haven't come up an infra which could combine all push task related
changes. It is currently spread across rt, dl, fair. Maybe some
consolidation can be done. but which tasks to push/pull still remains in
the class. 

6. cpu_avoid_mask may need some sort of locking to ensure read/write is
correct. 

[1]: https://lore.kernel.org/all/20250523181448.3777233-1-sshegde@linux.ibm.com/

Shrikanth Hegde (9):
  sched/docs: Document avoid_cpu_mask and avoid CPU concept
  cpumask: Introduce cpu_avoid_mask
  sched/core: Don't allow to use CPU marked as avoid
  sched/fair: Don't use CPU marked as avoid for wakeup and load balance
  sched/rt: Don't select CPU marked as avoid for wakeup and push/pull rt task
  sched/core: Push current task out if CPU is marked as avoid
  sched: Add static key check for cpu_avoid
  sysfs: Add cpu_avoid file
  powerpc: add debug file for set/unset cpu avoid

 Documentation/scheduler/sched-arch.rst | 25 +++++++++++++
 arch/powerpc/include/asm/paravirt.h    |  2 ++
 arch/powerpc/kernel/smp.c              | 50 ++++++++++++++++++++++++++
 drivers/base/cpu.c                     |  8 +++++
 include/linux/cpumask.h                | 17 +++++++++
 kernel/cpu.c                           |  3 ++
 kernel/sched/core.c                    | 50 +++++++++++++++++++++++++-
 kernel/sched/fair.c                    | 11 +++++-
 kernel/sched/rt.c                      |  9 +++--
 kernel/sched/sched.h                   | 10 ++++++
 10 files changed, 181 insertions(+), 4 deletions(-)

-- 
2.43.0