[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251120225603.9460-1-hdanton@sina.com>
Date: Fri, 21 Nov 2025 06:56:00 +0800
From: Hillf Danton <hdanton@...a.com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>
Cc: linux-kernel@...r.kernel.org,
peterz@...radead.org,
seanjc@...gle.com,
kprateek.nayak@....com
Subject: Re: [PATCH 01/17] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
On Thu, 20 Nov 2025 20:24:13 +0530 Shrikanth Hegde wrote:
> On 11/20/25 3:18 AM, Hillf Danton wrote:
> > On Wed, 19 Nov 2025 18:14:33 +0530 Shrikanth Hegde wrote:
> >> Add documentation for new cpumask called cpu_paravirt_mask. This could
> >> help users in understanding what this mask and the concept behind it.
> >>
> >> Signed-off-by: Shrikanth Hegde <sshegde@...ux.ibm.com>
> >> ---
> >> Documentation/scheduler/sched-arch.rst | 37 ++++++++++++++++++++++++++
> >> 1 file changed, 37 insertions(+)
> >>
> >> diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
> >> index ed07efea7d02..6972c295013d 100644
> >> --- a/Documentation/scheduler/sched-arch.rst
> >> +++ b/Documentation/scheduler/sched-arch.rst
> >> @@ -62,6 +62,43 @@ Your cpu_idle routines need to obey the following rules:
> >> arch/x86/kernel/process.c has examples of both polling and
> >> sleeping idle functions.
> >>
> >> +Paravirt CPUs
> >> +=============
> >> +
> >> +Under virtualised environments it is possible to overcommit CPU resources.
> >> +i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical
> >> +CPUs(pCPU). Under such conditions when all or many VM's have high utilization,
> >> +hypervisor won't be able to satisfy the CPU requirement and has to context
> >> +switch within or across VM. i.e hypervisor need to preempt one vCPU to run
> >> +another. This is called vCPU preemption. This is more expensive compared to
> >> +task context switch within a vCPU.
> >> +
> > What is missing is
> > 1) vCPU preemption is X% more expensive compared to task context switch within a vCPU.
> >
>
> This would change from arch to arch IMO. Will try to get numbers from PowerVM hypervisor.
>
> >> +In such cases it is better that VM's co-ordinate among themselves and ask for
> >> +less CPU by not using some of the vCPUs. Such vCPUs where workload can be
> >> +avoided at the moment for less vCPU preemption are called as "Paravirt CPUs".
> >> +Note that when the pCPU contention goes away, these vCPUs can be used again
> >> +by the workload.
> >> +
> > 2) given X, how to work out Y, the number of Paravirt CPUs for the simple
> > scenario like 8 pCPUs and 16 vCPUs (8 vCPUs from VM1, 8 vCPUs from VM2)?
> >
>
> Y need not be dependent on X. Note CPUs are marked as paravirt only when both VM's
> end up consuming all the CPU resource.
>
To check that dependence, the frequence of vCPU preemption can be set to
100HZ and the frequence of task context switch within a vCPU to 250HZ,
on top of __zero__ Y (actually what we can do before this work), to compare
with the result of whatever Y this work can select.
BTW workload on vCPU can be compiling linux kernel with -j 8.
> Different cases:
> 1. VM1 is idle and VM2 is idle - No vCPUs are marked as paravirt.
> 2. VM1 is 100% busy and VM2 is idle - No steal time seen - No vCPUs is marked as paravirt.
> 3. VM1 is idle and VM2 is 100% busy - No steal time seen - No vCPUs is marked as paravirt.
> 4. VM1 is 100% busy and VM2 is 100% busy - 50% steal time would be seen in each -
> Since there are only 8 pCPUs (assuming each VM1 is allocated equally), 4 vCPUs in
> each VM will be marked as paravirt. Workload consolidates to remaining 4 vCPUs and
> hence no steal time will seen. Benefit would seen since host doesn't need to change
> expensive VM context switches.
Powered by blists - more mailing lists