[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPZIGCFk-Rnlc1yT@google.com>
Date: Mon, 20 Oct 2025 07:32:56 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
vincent.guittot@...aro.org, tglx@...utronix.de, yury.norov@...il.com,
maddy@...ux.ibm.com, linux-kernel@...r.kernel.org,
linuxppc-dev@...ts.ozlabs.org, gregkh@...uxfoundation.org,
vschneid@...hat.com, iii@...ux.ibm.com, huschle@...ux.ibm.com,
rostedt@...dmis.org, dietmar.eggemann@....com, vineeth@...byteword.org,
jgross@...e.com, pbonzini@...hat.com
Subject: Re: [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption
On Wed, Sep 10, 2025, Shrikanth Hegde wrote:
> tl;dr
>
> This is follow up of [1] with few fixes and addressing review comments.
> Upgraded it to RFC PATCH from RFC.
> Please review.
>
> [1]: v2 - https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/
>
> v2 -> v3:
> - Renamed to paravirt CPUs
There are myriad uses of "paravirt" throughout Linux and related environments,
and none of them mean "oversubscribed" or "contended". I assume Hillf's comments
triggered the rename from "avoid CPUs", but IMO "avoid" is at least somewhat
accurate; "paravirt" is wildly misleading.
> - Folded the changes under CONFIG_PARAVIRT.
> - Fixed the crash due work_buf corruption while using
> stop_one_cpu_nowait.
> - Added sysfs documentation.
> - Copy most of __balance_push_cpu_stop to new one, this helps it move
> the code out of CONFIG_HOTPLUG_CPU.
> - Some of the code movement suggested.
>
> -----------------
> ::Detailed info::
> -----------------
> Problem statement
>
> vCPU - Virtual CPUs - CPU in VM world.
> pCPU - Physical CPUs - CPU in baremetal world.
>
> A hypervisor does scheduling of vCPUs on a pCPUs. It has to give each
> vCPU some cycles and be fair. When there are more vCPU requests than
> the pCPUs, hypervsior has to preempt some vCPUs in order to run others.
> This is called as vCPU preemption.
>
> If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from
> VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
> each other and request for limited vCPUs, it avoids the above overhead and
> there is context switching within vCPU(less expensive). Even if hypervisor
> is preempting one vCPU to run another within the same VM, it is still more
> expensive than the task preemption within the vCPU. So basic aim to avoid
> vCPU preemption.
>
> So to achieve this, introduce "Paravirt CPU" concept, where it is better if
> workload avoids these vCPUs at this moment. (vCPUs stays online, don't want
> the overhead of sched domain rebuild and hotplug takes a lot of time too).
>
> When there is contention, don't use paravirt CPUs.
> When there is no contention, use all vCPUs.
...
> ------------
> Open issues:
>
> - Derivation of hint from steal time is still a challenge. Some work is
> underway to address it.
>
> - Consider kvm and other hypervsiors and how they could derive the hint.
> Need inputs from community.
Bluntly, this series is never going to land, at least not in a form that's remotely
close to what is proposed here. This is an incredibly simplistic way of handling
overcommit, and AFAICT there's no line of sight to supporting more complex scenarios.
I.e. I don't see a path to resolving all these "todos" in the changelog from the
last patch:
: Ideal would be get the hint from hypervisor. It would be more accurate
: since it has knowledge of all SPLPARs deployed in the system.
:
: Till the hint from underlying hypervisor arrives, another idea is to
: approximate the hint from steal time. There are some works ongoing, but
: not there yet due to challenges revolving around limits and
: convergence.
:
: Till that happens, there is a need for debugfs file which could be used to
: set/unset the hint. The interface currently is number starting from which
: CPUs will marked as paravirt. It could be changed to one the takes a
: cpumask(list of CPUs) in future.
I see Vineeth and Steven are on the Cc. Argh, and you even commented on their
first RFC[1], where it was made quite clear that sprinkling one-off "hints"
throughoug the kernel wasn't a viable approach.
I don't know the current status of the ChromeOS work, but there was agreement in
principle that the bulk of paravirt scheduling should not need to touch the kernel
(host or guest)[2].
[1] https://lore.kernel.org/all/20231214024727.3503870-1-vineeth@bitbyteword.org
[2] https://lore.kernel.org/all/ZjJf27yn-vkdB32X@google.com
Powered by blists - more mailing lists