linux-kernel - Re: [RFC PATCH 0/5] sched: cpu parked and push current task mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f6f122cf-daf4-4e31-af42-4f12761aa1da@linux.ibm.com>
Date: Tue, 27 May 2025 23:00:21 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Yury Norov <yury.norov@...il.com>, Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org,
        tglx@...utronix.de, maddy@...ux.ibm.com, vschneid@...hat.com,
        dietmar.eggemann@....com, rostedt@...dmis.org, jstultz@...gle.com,
        kprateek.nayak@....com, huschle@...ux.ibm.com, srikar@...ux.ibm.com,
        linux-kernel@...r.kernel.org, linux@...musvillemoes.dk
Subject: Re: [RFC PATCH 0/5] sched: cpu parked and push current task mechanism

Hi Peter, Yury.

Thanks for taking a look at this series.

On 5/27/25 21:17, Yury Norov wrote:
> On Tue, May 27, 2025 at 05:10:20PM +0200, Peter Zijlstra wrote:
>> On Fri, May 23, 2025 at 11:44:43PM +0530, Shrikanth Hegde wrote:
>>> In a para-virtualised environment, there could be multiple
>>> overcommitted VMs. i.e sum of virtual CPUs(vCPU) > physical CPU(pCPU).
>>> When all such VMs request for cpu cycles at the same, it is not possible
>>> to serve all of them. This leads to VM level preemptions and hence the
>>> steal time.
>>>
>>> Bring the notion of CPU parked state which implies underlying pCPU may
>>> not be available for use at this time. This means it is better to avoid
>>> this vCPU. So when a CPU is marked as parked, one should vacate it as
>>> soon as it can. So it is going to dynamic at runtime and can change
>>> often.
>>
>> You've lost me here already. Why would pCPU not be available? Simply
>> because it is running another vCPU? I would say this means the pCPU is
>> available, its just doing something else.
>>
>> Not available to me means it is going offline or something like that.
>>
>>> In general, task level preemption(driven by VM) is less expensive than VM
>>> level preemption(driven by hypervisor). So pack to less CPUs helps to
>>> improve the overall workload throughput/latency.
>>
>> This seems to suggest you're 'parking' vCPUs, while above you seemed to
>> suggest pCPU. More confusion.

Yes. I meant parking of vCPUs only. pCPU is running one of those vCPU at any point in time.

>>
>>> cpu parking and need for cpu parking has been explained here as well [1]. Much
>>> of the context explained in the cover letter there applies to this
>>> problem context as well.
>>> [1]: https://lore.kernel.org/all/20250512115325.30022-1-huschle@linux.ibm.com/
>>
>> Yeah, totally not following any of that either :/
>>
>>
>> Mostly I have only confusion and no idea what you're actually wanting to
>> do.
> 
> My wild guess is that the idea is to not preempt the pCPU while running
> a particular vCPU workload. But I agree, this should all be reworded and
> explained better. I didn't understand this, either.
> 
> Thanks,
> YUry

Sorry, Apologies for not explaining it clearly. My bad.
Let me take a shot at it again:

----------------------------

vCPU - Virtual CPUs - CPU in VM world.
pCPU - Physical CPUs - CPU in baremetal world.

A hypervisor is managing these vCPUs from different VMs. When a vCPU requests for CPU, hypervisor does the job
of scheduling them on a pCPU.

So this issue occurs when there are more vCPUs(combined across all VMs) than the pCPU. So when *all* vCPUs are
requesting for CPUs, hypervisor can only run a few of them and remaining will be preempted(waiting for pCPU).

If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from VM2, it has to do
save/restore VM context.Instead if VM's can co-ordinate among each other and request for *limited*  vCPUs,
it avoids the above overhead and there is context switching within vCPU(less expensive). Even if hypervisor
is preempting one vCPU to run another withing the same VM, it is still more expensive than the task preemption within
the vCPU. So *basic* aim to avoid vCPU preemption.

So to achieve this, use this parking(we need better name for sure) concept, where it is better
if workloads avoid some vCPUs at this moment. (vCPUs stays online, we don't want the overhead of sched domain rebuild).

contention is dynamic in nature. When there is contention for pCPU is to be detected and determined
by architecture. Archs needs to update the mask regularly.

When there is contention, use limited vCPUs as indicated by arch.
When there is no contention, use all vCPUs.