[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2e97c804-c67a-4c92-94c9-d47a6648439c@amd.com>
Date: Fri, 12 Sep 2025 14:18:15 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>
CC: <vschneid@...hat.com>, <iii@...ux.ibm.com>, <huschle@...ux.ibm.com>,
<rostedt@...dmis.org>, <dietmar.eggemann@....com>, <vineeth@...byteword.org>,
<jgross@...e.com>, <pbonzini@...hat.com>, <seanjc@...gle.com>,
<mingo@...hat.com>, <peterz@...radead.org>, <juri.lelli@...hat.com>,
<vincent.guittot@...aro.org>, <tglx@...utronix.de>, <yury.norov@...il.com>,
<maddy@...ux.ibm.com>, <linux-kernel@...r.kernel.org>,
<linuxppc-dev@...ts.ozlabs.org>, <gregkh@...uxfoundation.org>
Subject: Re: [RFC PATCH v3 07/10] sched/core: Push current task from paravirt
CPU
Hello Shrikanth,
On 9/12/2025 10:52 AM, Shrikanth Hegde wrote:
>
>
> On 9/11/25 10:36 PM, K Prateek Nayak wrote:
>> Hello Shrikanth,
>>
>> On 9/11/2025 10:22 PM, Shrikanth Hegde wrote:
>>>>> + if (is_cpu_paravirt(cpu))
>>>>> + push_current_from_paravirt_cpu(rq);
>>>>
>>>> Does this mean paravirt CPU is capable of handling an interrupt but may
>>>> not be continuously available to run a task?
>>>
>>> When i run hackbench which involves fair bit of IRQ stuff, it moves out.
>>>
>>> For example,
>>>
>>> echo 600-710 > /sys/devices/system/cpu/paravirt
>>>
>>> 11:31:54 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>>> 11:31:57 AM 598 2.04 0.00 77.55 0.00 18.37 0.00 1.02 0.00 0.00 1.02
>>> 11:31:57 AM 599 1.01 0.00 79.80 0.00 17.17 0.00 1.01 0.00 0.00 1.01
>>> 11:31:57 AM 600 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 99.01
>>> 11:31:57 AM 601 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
>>> 11:31:57 AM 602 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
>>>
>>>
>>> There could some workloads which doesn't move irq's out, for which needs irqbalance change.
>>> Looking into it.
>>>
>>> Or is the VMM expected to set
>>>> the CPU on the paravirt mask and give the vCPU sufficient time to move the
>>>> task before yanking it away from the pCPU?
>>>>
>>>
>>> If the vCPU is running something, it is going to run at some point on pCPU.
>>> hypervisor will give the cycles to this vCPU by preempting some other vCPU.
>>>
>>> It is that using this infra, there is should be nothing on that paravirt vCPU.
>>> That way collectively VMM gets only limited request for pCPU which it can satify
>>> without vCPU preemption.
>>
>> Ack! Just wanted to understand the usage.
>>
>> P.S. I remember discussions during last LPC where we could communicate
>> this unavailability via CPU capacity. Was that problematic for some
>> reason? Sorry if I didn't follow this discussion earlier.
>>
>
> Thanks for that questions. Gives a opportunity to retrospect.
>
> Yes. That's where we started. but that has a lot of implementation challenges.
> Still an option though.
>
> History upto current state:
>
> 1. At LPC24 presented the problem statement, and why existing approaches such as hotplug,
> cpuset cgroup or taskset are not viable solution. Hotplug would have come handy if the cost was low.
> The overhead of sched domain rebuild and serial nature of hotplug makes it not viable option.
> One of the possible approach was CPU Capacity.
Ack. Is creating an isolated partition on the fly too expensive too?
I don't think creation of that partition is serialized and it should
achieve a similar result with a single sched-domain rebuild and I'm
hoping VMM doesn't change the paravirt mask at an alarming rate.
P.S. Some stupid benchmarking on a 256CPU machine:
mkdir /sys/fs/cgroup/isol/
echo isolated > /sys/fs/cgroup/isol/cpuset.cpus.partition
time for i in {1..1000}; do \
echo "8-15" > /sys/fs/cgroup/isol/cpuset.cpus.exclusive; \
echo "16-23" > /sys/fs/cgroup/isol/cpuset.cpus.exclusive; \
done
real 2m50.016s
user 0m0.198s
sys 1m47.708s
So that is about (170sec / 2000) ~ 85ms per cpuset operation.
Definitely more expensive than setting the paravirt but compare that to:
for i in {8..15}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done; \
for i in {8..15}; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done; \
for i in {16..23}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done; \
for i in {16..23}; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done;'
real 0m5.046s
user 0m0.014s
sys 0m0.110s
Definitely less expensive than a full hotplug.
>
> 1. Issues with CPU Capacity approach:
> a. Need to make group_misfit_task as the highest priority. That alone will break big.LITTLE
> since it relies on group misfit and group_overload should have higher priority there.
> b. At high concurrency tasks still moved those CPUs with CAPACITY=1.
> c. A lot of scheduler stats would need to be aware of change in CAPACITY specially load balance/wakeup.
Ack. Thinking out loud: Can capacity go to 0 via H/W pressure interface?
Maybe we can toggle the "sched_asym_cpucapacity" static branch without
actually having SD_ASYM_CAPACITY in these special case to enable
asym_fits_cpu() steer away from these 0 capacity CPUs.
> d. in update_group_misfit - need to set the misfit load based on capacity. the current code sets to 0,
> because of task_fits_cpu stuff
> e. More challenges in RT.
>
> That's when Tobias had introduced a new group type called group_parked.
> https://lore.kernel.org/all/20241204112149.25872-2-huschle@linux.ibm.com/
> It has relatively cleaner implementation compared to CPU CAPACITY.
>
> It had a few disadvantages too:
> 1. It use to take around 8-10 seconds for tasks to move out of those CPUs. That was the main
> concern.
> 2. Needs a few stats based changes in update_sg_lb_stats. might be tricky in all scenarios.
>
> That's when we were exploring how the tasks move out when the cpu goes offline. It happens quite fast too.
> So tried a similar mechanism and this is where we are right now.
I agree push is great from that perspective.
>
>> [..snip..]
>>>>> + local_irq_save(flags);
>>>>> + preempt_disable();
>>>>
>>>> Disabling IRQs implies preemption is disabled.
>>>>
>>>
>>> Most of the places stop_one_cpu_nowait called with preemption & irq disabled.
>>> stopper runs at the next possible opportunity.
>>
>> But is there any reason to do both local_irq_save() and
>> preempt_disable()? include/linux/preempt.h defines preemptible() as:
>>
>> #define preemptible() (preempt_count() == 0 && !irqs_disabled())
>>
>> so disabling IRQs should be sufficient right or am I missing something?
>>
>
> f0498d2a54e79 (Peter Zijlstra) "sched: Fix stop_one_cpu_nowait() vs hotplug"
> could be the answer you are looking for.
I think in all the cases covered by that commit, "task_rq_unlock(...)" would
have enabled interrupts which required that specified pattern but here we
have preempt_disable() within a local_irq_save() section which might not be
necessary.
>
>>>
>>> stop_one_cpu_nowait
>>> ->queues the task into stopper list
>>> -> wake_up_process(stopper)
>>> -> set need_resched
>>> -> stopper runs as early as possible.
>>>
>
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists