linux-kernel - Re: [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <48d66446-40be-4a4e-a5af-c19e0b8d9182@linux.ibm.com>
Date: Tue, 21 Oct 2025 11:40:23 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, tglx@...utronix.de, yury.norov@...il.com,
        maddy@...ux.ibm.com, linux-kernel@...r.kernel.org,
        linuxppc-dev@...ts.ozlabs.org, gregkh@...uxfoundation.org,
        vschneid@...hat.com, iii@...ux.ibm.com, huschle@...ux.ibm.com,
        rostedt@...dmis.org, dietmar.eggemann@....com, vineeth@...byteword.org,
        jgross@...e.com, pbonzini@...hat.com
Subject: Re: [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU
 preemption


Hi Sean.
Thanks for taking time and going through the series.

On 10/20/25 8:02 PM, Sean Christopherson wrote:
> On Wed, Sep 10, 2025, Shrikanth Hegde wrote:
>> tl;dr
>>
>> This is follow up of [1] with few fixes and addressing review comments.
>> Upgraded it to RFC PATCH from RFC.
>> Please review.
>>
>> [1]: v2 - https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/
>>
>> v2 -> v3:
>> - Renamed to paravirt CPUs
> 
> There are myriad uses of "paravirt" throughout Linux and related environments,
> and none of them mean "oversubscribed" or "contended".  I assume Hillf's comments
> triggered the rename from "avoid CPUs", but IMO "avoid" is at least somewhat
> accurate; "paravirt" is wildly misleading.

Name has been tricky. We want to have a positive sounding name while conveying
that these CPUs are not be used for now due to contention,
they may be used again when the contention has gone.


> 
>> - Folded the changes under CONFIG_PARAVIRT.
>> - Fixed the crash due work_buf corruption while using
>>    stop_one_cpu_nowait.
>> - Added sysfs documentation.
>> - Copy most of __balance_push_cpu_stop to new one, this helps it move
>>    the code out of CONFIG_HOTPLUG_CPU.
>> - Some of the code movement suggested.
>>
>> -----------------
>> ::Detailed info::
>> -----------------
>> Problem statement
>>
>> vCPU - Virtual CPUs - CPU in VM world.
>> pCPU - Physical CPUs - CPU in baremetal world.
>>
>> A hypervisor does scheduling of vCPUs on a pCPUs. It has to give each
>> vCPU some cycles and be fair. When there are more vCPU requests than
>> the pCPUs, hypervsior has to preempt some vCPUs in order to run others.
>> This is called as vCPU preemption.
>>
>> If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from
>> VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
>> each other and request for limited  vCPUs, it avoids the above overhead and
>> there is context switching within vCPU(less expensive). Even if hypervisor
>> is preempting one vCPU to run another within the same VM, it is still more
>> expensive than the task preemption within the vCPU. So basic aim to avoid
>> vCPU preemption.
>>
>> So to achieve this, introduce "Paravirt CPU" concept, where it is better if
>> workload avoids these vCPUs at this moment. (vCPUs stays online, don't want
>> the overhead of sched domain rebuild and hotplug takes a lot of time too).
>>
>> When there is contention, don't use paravirt CPUs.
>> When there is no contention, use all vCPUs.
> 
> ...
> 
>> ------------
>> Open issues:
>>
>> - Derivation of hint from steal time is still a challenge. Some work is
>>    underway to address it.
>>
>> - Consider kvm and other hypervsiors and how they could derive the hint.
>>    Need inputs from community.
> 
> Bluntly, this series is never going to land, at least not in a form that's remotely
> close to what is proposed here.  This is an incredibly simplistic way of handling
> overcommit, and AFAICT there's no line of sight to supporting more complex scenarios.
> 

Could you describe these complex scenarios?

Current usecase has been on two archs. powerpc and s390.
IIUC, both have an non-linux hypervisor running on host and linux guests.

Currently the s390 Hypervsior has a way of marking vCPU has Vertical High,
vertical Medium, Vertical Low. So when there is steal time, arch could easily
mark vertical Lows as "paravirt" CPUs.

> I.e. I don't see a path to resolving all these "todos" in the changelog from the
> last patch:
> 
>   : Ideal would be get the hint from hypervisor. It would be more accurate
>   : since it has knowledge of all SPLPARs deployed in the system.
>   :
>   : Till the hint from underlying hypervisor arrives, another idea is to
>   : approximate the hint from steal time. There are some works ongoing, but
>   : not there yet due to challenges revolving around limits and
>   : convergence.
>   :
>   : Till that happens, there is a need for debugfs file which could be used to
>   : set/unset the hint. The interface currently is number starting from which
>   : CPUs will marked as paravirt. It could be changed to one the takes a
>   : cpumask(list of CPUs) in future.
> 
> I see Vineeth and Steven are on the Cc.  Argh, and you even commented on their
> first RFC[1], where it was made quite clear that sprinkling one-off "hints"
> throughoug the kernel wasn't a viable approach.

IIRC, it was in other direction. guest was asking the host to mark some vCPU has
RT task to have it boosted in host.

> 
> I don't know the current status of the ChromeOS work, but there was agreement in
> principle that the bulk of paravirt scheduling should not need to touch the kernel
> (host or guest)[2].
> 

Based on some event if all the tasks on a CPU have to move out, then scheduler needs to
be there no? to move the task out, and not schedule anything new on it.

The current mechanisms such as cpu hotplug, isolated partitions all break the task affinity.
So need a new mechanism.

Note: Host is not running linux kernel. We are requesting host to provide this info through
HCALL or VPA area.

> [1] https://lore.kernel.org/all/20231214024727.3503870-1-vineeth@bitbyteword.org
> [2] https://lore.kernel.org/all/ZjJf27yn-vkdB32X@google.com

Vineeth,
whats the latest on vcpu_boosted framework? AFAIR both guest/host were running linux there.