linux-kernel - Re: [RFC v2 0/9] cpu avoid state and push task mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFxwaKwBykv5shN4@yury>
Date: Wed, 25 Jun 2025 17:55:52 -0400
From: Yury Norov <yury.norov@...il.com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
	vincent.guittot@...aro.org, tglx@...utronix.de, maddy@...ux.ibm.com,
	vschneid@...hat.com, dietmar.eggemann@....com, rostedt@...dmis.org,
	kprateek.nayak@....com, huschle@...ux.ibm.com, srikar@...ux.ibm.com,
	linux-kernel@...r.kernel.org, christophe.leroy@...roup.eu,
	linuxppc-dev@...ts.ozlabs.org, gregkh@...uxfoundation.org
Subject: Re: [RFC v2 0/9] cpu avoid state and push task mechanism

On Thu, Jun 26, 2025 at 12:40:59AM +0530, Shrikanth Hegde wrote:
> This is a followup version if [1] with few additions. This is still an RFC 
> and would like get feedback on the idea and suggestions on improvement. 
> 
> v1->v2:
> - Renamed to cpu_avoid_mask in place of cpu_parked_mask.

This one is not any better to the previous. Why avoid? When avoid?
I already said that: for objects, having positive self-explaining
noun names is much better than negative and/or function-style verb
names. I suggested cpu_paravirt_mask, and I still believe it's a much
better option.

> - Used a static key such that no impact to regular case. 

Static keys are not free and designed for different purpose. You have
CONFIG_PARAVIRT, and I don't understand why you're trying to avoid
using it.

I don't mind about static keys, if you prefer them, I just want to
have feature-specific code under corresponding config.

Can you please print bloat-o-meter report for CONFIG_PARAVIRT=n?
Have you any perf numbers to advocate static keys here? 

> - add sysfs file to show avoid CPUs.
> - Make RT understand avoid CPUs. 
> - Add documentation patch 
> - Took care of reported compile error in [1] when NR_CPUS=1
> 
> -----------------
> Problem statement
> -----------------
> vCPU - Virtual CPUs - CPU in VM world.
> pCPU - Physical CPUs - CPU in baremetal world.
> 
> A hypervisor is managing these vCPUs from different VMs. When a vCPU 
> requests for CPU, hypervisor does the job of scheduling them on a pCPU.
> 
> So this issue occurs when there are more vCPUs(combined across all VMs) 
> than the pCPU. So when *all* vCPUs are requesting for CPUs, hypervisor 
> can only run a few of them and remaining will be preempted(waiting for pCPU).
> 
> If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from 
> VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
> each other and request for *limited*  vCPUs, it avoids the above overhead and 
                                       ^
Did this extra whitespace escaped from the previous line, or the following?
                                        v
> there is context switching within vCPU(less expensive). Even if hypervisor
> is preempting one vCPU to run another within the same VM, it is still more 
> expensive than the task preemption within the vCPU. So *basic* aim to avoid 
> vCPU preemption.
> 
> So to achieve this, use "CPU Avoid" concept, where it is better
> if workload avoids these vCPUs at this moment.
> (vCPUs stays online, we don't want the overhead of sched domain rebuild).
> 
> Contention is dynamic in nature. When there is contention for pCPU is to be 
> detected and determined by architecture. Archs needs to update the mask 
> accordingly.
> 
> When there is contention, use limited vCPUs as indicated by arch.
> When there is no contention, use all vCPUs.
> 
> -------------------------
> To be done and Questions: 
> -------------------------
> 1. IRQ - still don't understand this cpu_avoid_mask. Maybe irqbalance
> code could be modified to do the same. Ran stress-ng --hrtimers, irq
> moved out of avoid cpu though. So need to see if changes to irqbalance is
> required or not.
> 
> 2. If a task is spawned by affining to only avoid CPUs. Should that fail
> or throw a warning to user. 

I think it's possible that existing codebase will do that. And because
you don't want to break userspace, you should not restrict.

> 3. Other classes such as SCHED_EXT, SCHED_DL won't understand this infra
> yet.
> 
> 4. Performance testing yet to be done. RFC only verified the functional
> aspects of whether task move out of avoid CPUs or not. Move happens quite
> fast (around 1-2 seconds even on large systems with very high utilization) 
> 
> 5. Haven't come up an infra which could combine all push task related
> changes. It is currently spread across rt, dl, fair. Maybe some
> consolidation can be done. but which tasks to push/pull still remains in
> the class. 
> 
> 6. cpu_avoid_mask may need some sort of locking to ensure read/write is
> correct. 
> 
> [1]: https://lore.kernel.org/all/20250523181448.3777233-1-sshegde@linux.ibm.com/
> 
> Shrikanth Hegde (9):
>   sched/docs: Document avoid_cpu_mask and avoid CPU concept
>   cpumask: Introduce cpu_avoid_mask
>   sched/core: Don't allow to use CPU marked as avoid
>   sched/fair: Don't use CPU marked as avoid for wakeup and load balance
>   sched/rt: Don't select CPU marked as avoid for wakeup and push/pull rt task
>   sched/core: Push current task out if CPU is marked as avoid
>   sched: Add static key check for cpu_avoid
>   sysfs: Add cpu_avoid file
>   powerpc: add debug file for set/unset cpu avoid
> 
>  Documentation/scheduler/sched-arch.rst | 25 +++++++++++++
>  arch/powerpc/include/asm/paravirt.h    |  2 ++
>  arch/powerpc/kernel/smp.c              | 50 ++++++++++++++++++++++++++
>  drivers/base/cpu.c                     |  8 +++++
>  include/linux/cpumask.h                | 17 +++++++++
>  kernel/cpu.c                           |  3 ++
>  kernel/sched/core.c                    | 50 +++++++++++++++++++++++++-
>  kernel/sched/fair.c                    | 11 +++++-
>  kernel/sched/rt.c                      |  9 +++--
>  kernel/sched/sched.h                   | 10 ++++++
>  10 files changed, 181 insertions(+), 4 deletions(-)
> 
> -- 
> 2.43.0