linux-kernel - Re: [RFC v2 0/9] cpu avoid state and push task mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b941d8f2-9df1-4b1a-9519-6076cd36ce9d@linux.ibm.com>
Date: Thu, 26 Jun 2025 20:03:04 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Yury Norov <yury.norov@...il.com>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, tglx@...utronix.de, maddy@...ux.ibm.com,
        vschneid@...hat.com, dietmar.eggemann@....com, rostedt@...dmis.org,
        kprateek.nayak@....com, huschle@...ux.ibm.com, srikar@...ux.ibm.com,
        linux-kernel@...r.kernel.org, christophe.leroy@...roup.eu,
        linuxppc-dev@...ts.ozlabs.org, gregkh@...uxfoundation.org
Subject: Re: [RFC v2 0/9] cpu avoid state and push task mechanism



On 6/26/25 03:25, Yury Norov wrote:
> On Thu, Jun 26, 2025 at 12:40:59AM +0530, Shrikanth Hegde wrote:
>> This is a followup version if [1] with few additions. This is still an RFC
>> and would like get feedback on the idea and suggestions on improvement.
>>
>> v1->v2:
>> - Renamed to cpu_avoid_mask in place of cpu_parked_mask.
> 
> This one is not any better to the previous. Why avoid? When avoid?
> I already said that: for objects, having positive self-explaining
> noun names is much better than negative and/or function-style verb
> names. I suggested cpu_paravirt_mask, and I still believe it's a much
> better option.
> 

ok. only reason is CPU is always para virtualized in those environment right?
When there is contention for pCPU, only then we want set this mask.
So i thought it might have to reflect that.


I can keep cpu_paravirt_mask. Could you please suggest set/get names which could
go with it? cpu_paravirt(cpu)?

>> - Used a static key such that no impact to regular case.
> 
> Static keys are not free and designed for different purpose. You have
> CONFIG_PARAVIRT, and I don't understand why you're trying to avoid
> using it.
> 
> I don't mind about static keys, if you prefer them, I just want to
> have feature-specific code under corresponding config.
> 
> Can you please print bloat-o-meter report for CONFIG_PARAVIRT=n?
> Have you any perf numbers to advocate static keys here?
> 

I wanted to see if there could be any other use cases apart from paravirt case.

One I thought was, in SMT systems under low utilization, it could help higher IPC by keeping the tasks on
only 1 thread.. if base_slice is kept low, latency could be relatively low.

Other was, workloads or system usage can be dynamic in nature with peaks and troughs. when it is in trough, one may not want to use all
the cores(instead use SMT siblings), thereby saving some power.


Using CONFIG_PARAVIRT could end up sprinkling a bit of ifdefs. Need to see how I could minimize it.
Let me get back with bloat-o-meter numbers and performance numbers.

>> - add sysfs file to show avoid CPUs.
>> - Make RT understand avoid CPUs.
>> - Add documentation patch
>> - Took care of reported compile error in [1] when NR_CPUS=1
>>
>> -----------------
>> Problem statement
>> -----------------
>> vCPU - Virtual CPUs - CPU in VM world.
>> pCPU - Physical CPUs - CPU in baremetal world.
>>
>> A hypervisor is managing these vCPUs from different VMs. When a vCPU
>> requests for CPU, hypervisor does the job of scheduling them on a pCPU.
>>
>> So this issue occurs when there are more vCPUs(combined across all VMs)
>> than the pCPU. So when *all* vCPUs are requesting for CPUs, hypervisor
>> can only run a few of them and remaining will be preempted(waiting for pCPU).
>>
>> If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from
>> VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
>> each other and request for *limited*  vCPUs, it avoids the above overhead and
>                                         ^
> Did this extra whitespace escaped from the previous line, or the following?
>

Thanks for noticing it.
                                           v
>> there is context switching within vCPU(less expensive). Even if hypervisor
>> is preempting one vCPU to run another within the same VM, it is still more
>> expensive than the task preemption within the vCPU. So *basic* aim to avoid
>> vCPU preemption.
>>
>> So to achieve this, use "CPU Avoid" concept, where it is better
>> if workload avoids these vCPUs at this moment.
>> (vCPUs stays online, we don't want the overhead of sched domain rebuild).
>>
>> Contention is dynamic in nature. When there is contention for pCPU is to be
>> detected and determined by architecture. Archs needs to update the mask
>> accordingly.
>>
>> When there is contention, use limited vCPUs as indicated by arch.
>> When there is no contention, use all vCPUs.
>>
>> -------------------------
>> To be done and Questions:
>> -------------------------
>> 1. IRQ - still don't understand this cpu_avoid_mask. Maybe irqbalance
>> code could be modified to do the same. Ran stress-ng --hrtimers, irq
>> moved out of avoid cpu though. So need to see if changes to irqbalance is
>> required or not.
>>
>> 2. If a task is spawned by affining to only avoid CPUs. Should that fail
>> or throw a warning to user.
> 
> I think it's possible that existing codebase will do that. And because
> you don't want to break userspace, you should not restrict.

ok got it. currently it is allowed.

> 
>> 3. Other classes such as SCHED_EXT, SCHED_DL won't understand this infra
>> yet.
>>
>> 4. Performance testing yet to be done. RFC only verified the functional
>> aspects of whether task move out of avoid CPUs or not. Move happens quite
>> fast (around 1-2 seconds even on large systems with very high utilization)
>>
>> 5. Haven't come up an infra which could combine all push task related
>> changes. It is currently spread across rt, dl, fair. Maybe some
>> consolidation can be done. but which tasks to push/pull still remains in
>> the class.
>>
>> 6. cpu_avoid_mask may need some sort of locking to ensure read/write is
>> correct.
>>
>> [1]: https://lore.kernel.org/all/20250523181448.3777233-1-sshegde@linux.ibm.com/
>>
>> Shrikanth Hegde (9):
>>    sched/docs: Document avoid_cpu_mask and avoid CPU concept
>>    cpumask: Introduce cpu_avoid_mask
>>    sched/core: Don't allow to use CPU marked as avoid
>>    sched/fair: Don't use CPU marked as avoid for wakeup and load balance
>>    sched/rt: Don't select CPU marked as avoid for wakeup and push/pull rt task
>>    sched/core: Push current task out if CPU is marked as avoid
>>    sched: Add static key check for cpu_avoid
>>    sysfs: Add cpu_avoid file
>>    powerpc: add debug file for set/unset cpu avoid
>>
>>   Documentation/scheduler/sched-arch.rst | 25 +++++++++++++
>>   arch/powerpc/include/asm/paravirt.h    |  2 ++
>>   arch/powerpc/kernel/smp.c              | 50 ++++++++++++++++++++++++++
>>   drivers/base/cpu.c                     |  8 +++++
>>   include/linux/cpumask.h                | 17 +++++++++
>>   kernel/cpu.c                           |  3 ++
>>   kernel/sched/core.c                    | 50 +++++++++++++++++++++++++-
>>   kernel/sched/fair.c                    | 11 +++++-
>>   kernel/sched/rt.c                      |  9 +++--
>>   kernel/sched/sched.h                   | 10 ++++++
>>   10 files changed, 181 insertions(+), 4 deletions(-)
>>
>> -- 
>> 2.43.0