[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFxwaKwBykv5shN4@yury>
Date: Wed, 25 Jun 2025 17:55:52 -0400
From: Yury Norov <yury.norov@...il.com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
vincent.guittot@...aro.org, tglx@...utronix.de, maddy@...ux.ibm.com,
vschneid@...hat.com, dietmar.eggemann@....com, rostedt@...dmis.org,
kprateek.nayak@....com, huschle@...ux.ibm.com, srikar@...ux.ibm.com,
linux-kernel@...r.kernel.org, christophe.leroy@...roup.eu,
linuxppc-dev@...ts.ozlabs.org, gregkh@...uxfoundation.org
Subject: Re: [RFC v2 0/9] cpu avoid state and push task mechanism
On Thu, Jun 26, 2025 at 12:40:59AM +0530, Shrikanth Hegde wrote:
> This is a followup version if [1] with few additions. This is still an RFC
> and would like get feedback on the idea and suggestions on improvement.
>
> v1->v2:
> - Renamed to cpu_avoid_mask in place of cpu_parked_mask.
This one is not any better to the previous. Why avoid? When avoid?
I already said that: for objects, having positive self-explaining
noun names is much better than negative and/or function-style verb
names. I suggested cpu_paravirt_mask, and I still believe it's a much
better option.
> - Used a static key such that no impact to regular case.
Static keys are not free and designed for different purpose. You have
CONFIG_PARAVIRT, and I don't understand why you're trying to avoid
using it.
I don't mind about static keys, if you prefer them, I just want to
have feature-specific code under corresponding config.
Can you please print bloat-o-meter report for CONFIG_PARAVIRT=n?
Have you any perf numbers to advocate static keys here?
> - add sysfs file to show avoid CPUs.
> - Make RT understand avoid CPUs.
> - Add documentation patch
> - Took care of reported compile error in [1] when NR_CPUS=1
>
> -----------------
> Problem statement
> -----------------
> vCPU - Virtual CPUs - CPU in VM world.
> pCPU - Physical CPUs - CPU in baremetal world.
>
> A hypervisor is managing these vCPUs from different VMs. When a vCPU
> requests for CPU, hypervisor does the job of scheduling them on a pCPU.
>
> So this issue occurs when there are more vCPUs(combined across all VMs)
> than the pCPU. So when *all* vCPUs are requesting for CPUs, hypervisor
> can only run a few of them and remaining will be preempted(waiting for pCPU).
>
> If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from
> VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
> each other and request for *limited* vCPUs, it avoids the above overhead and
^
Did this extra whitespace escaped from the previous line, or the following?
v
> there is context switching within vCPU(less expensive). Even if hypervisor
> is preempting one vCPU to run another within the same VM, it is still more
> expensive than the task preemption within the vCPU. So *basic* aim to avoid
> vCPU preemption.
>
> So to achieve this, use "CPU Avoid" concept, where it is better
> if workload avoids these vCPUs at this moment.
> (vCPUs stays online, we don't want the overhead of sched domain rebuild).
>
> Contention is dynamic in nature. When there is contention for pCPU is to be
> detected and determined by architecture. Archs needs to update the mask
> accordingly.
>
> When there is contention, use limited vCPUs as indicated by arch.
> When there is no contention, use all vCPUs.
>
> -------------------------
> To be done and Questions:
> -------------------------
> 1. IRQ - still don't understand this cpu_avoid_mask. Maybe irqbalance
> code could be modified to do the same. Ran stress-ng --hrtimers, irq
> moved out of avoid cpu though. So need to see if changes to irqbalance is
> required or not.
>
> 2. If a task is spawned by affining to only avoid CPUs. Should that fail
> or throw a warning to user.
I think it's possible that existing codebase will do that. And because
you don't want to break userspace, you should not restrict.
> 3. Other classes such as SCHED_EXT, SCHED_DL won't understand this infra
> yet.
>
> 4. Performance testing yet to be done. RFC only verified the functional
> aspects of whether task move out of avoid CPUs or not. Move happens quite
> fast (around 1-2 seconds even on large systems with very high utilization)
>
> 5. Haven't come up an infra which could combine all push task related
> changes. It is currently spread across rt, dl, fair. Maybe some
> consolidation can be done. but which tasks to push/pull still remains in
> the class.
>
> 6. cpu_avoid_mask may need some sort of locking to ensure read/write is
> correct.
>
> [1]: https://lore.kernel.org/all/20250523181448.3777233-1-sshegde@linux.ibm.com/
>
> Shrikanth Hegde (9):
> sched/docs: Document avoid_cpu_mask and avoid CPU concept
> cpumask: Introduce cpu_avoid_mask
> sched/core: Don't allow to use CPU marked as avoid
> sched/fair: Don't use CPU marked as avoid for wakeup and load balance
> sched/rt: Don't select CPU marked as avoid for wakeup and push/pull rt task
> sched/core: Push current task out if CPU is marked as avoid
> sched: Add static key check for cpu_avoid
> sysfs: Add cpu_avoid file
> powerpc: add debug file for set/unset cpu avoid
>
> Documentation/scheduler/sched-arch.rst | 25 +++++++++++++
> arch/powerpc/include/asm/paravirt.h | 2 ++
> arch/powerpc/kernel/smp.c | 50 ++++++++++++++++++++++++++
> drivers/base/cpu.c | 8 +++++
> include/linux/cpumask.h | 17 +++++++++
> kernel/cpu.c | 3 ++
> kernel/sched/core.c | 50 +++++++++++++++++++++++++-
> kernel/sched/fair.c | 11 +++++-
> kernel/sched/rt.c | 9 +++--
> kernel/sched/sched.h | 10 ++++++
> 10 files changed, 181 insertions(+), 4 deletions(-)
>
> --
> 2.43.0
Powered by blists - more mailing lists