linux-kernel - Re: [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2e97c804-c67a-4c92-94c9-d47a6648439c@amd.com>
Date: Fri, 12 Sep 2025 14:18:15 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>
CC: <vschneid@...hat.com>, <iii@...ux.ibm.com>, <huschle@...ux.ibm.com>,
	<rostedt@...dmis.org>, <dietmar.eggemann@....com>, <vineeth@...byteword.org>,
	<jgross@...e.com>, <pbonzini@...hat.com>, <seanjc@...gle.com>,
	<mingo@...hat.com>, <peterz@...radead.org>, <juri.lelli@...hat.com>,
	<vincent.guittot@...aro.org>, <tglx@...utronix.de>, <yury.norov@...il.com>,
	<maddy@...ux.ibm.com>, <linux-kernel@...r.kernel.org>,
	<linuxppc-dev@...ts.ozlabs.org>, <gregkh@...uxfoundation.org>
Subject: Re: [RFC PATCH v3 07/10] sched/core: Push current task from paravirt
 CPU

Hello Shrikanth,

On 9/12/2025 10:52 AM, Shrikanth Hegde wrote:
> 
> 
> On 9/11/25 10:36 PM, K Prateek Nayak wrote:
>> Hello Shrikanth,
>>
>> On 9/11/2025 10:22 PM, Shrikanth Hegde wrote:
>>>>> +    if (is_cpu_paravirt(cpu))
>>>>> +        push_current_from_paravirt_cpu(rq);
>>>>
>>>> Does this mean paravirt CPU is capable of handling an interrupt but may
>>>> not be continuously available to run a task?
>>>
>>> When i run hackbench which involves fair bit of IRQ stuff, it moves out.
>>>
>>> For example,
>>>
>>> echo 600-710 > /sys/devices/system/cpu/paravirt
>>>
>>> 11:31:54 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>> 11:31:57 AM  598    2.04    0.00   77.55    0.00   18.37    0.00    1.02    0.00    0.00    1.02
>>> 11:31:57 AM  599    1.01    0.00   79.80    0.00   17.17    0.00    1.01    0.00    0.00    1.01
>>> 11:31:57 AM  600    0.00    0.00    0.00    0.00    0.00    0.00    0.99    0.00    0.00   99.01
>>> 11:31:57 AM  601    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>> 11:31:57 AM  602    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>>
>>>
>>> There could some workloads which doesn't move irq's out, for which needs irqbalance change.
>>> Looking into it.
>>>
>>>   Or is the VMM expected to set
>>>> the CPU on the paravirt mask and give the vCPU sufficient time to move the
>>>> task before yanking it away from the pCPU?
>>>>
>>>
>>> If the vCPU is running something, it is going to run at some point on pCPU.
>>> hypervisor will give the cycles to this vCPU by preempting some other vCPU.
>>>
>>> It is that using this infra, there is should be nothing on that paravirt vCPU.
>>> That way collectively VMM gets only limited request for pCPU which it can satify
>>> without vCPU preemption.
>>
>> Ack! Just wanted to understand the usage.
>>
>> P.S. I remember discussions during last LPC where we could communicate
>> this unavailability via CPU capacity. Was that problematic for some
>> reason? Sorry if I didn't follow this discussion earlier.
>>
> 
> Thanks for that questions. Gives a opportunity to retrospect.
> 
> Yes. That's where we started. but that has a lot of implementation challenges.
> Still an option though.
> 
> History upto current state:
> 
> 1. At LPC24 presented the problem statement, and why existing approaches such as hotplug,
>    cpuset cgroup or taskset are not viable solution. Hotplug would have come handy if the cost was low.
>    The overhead of sched domain rebuild and serial nature of hotplug makes it not viable option.
>    One of the possible approach was CPU Capacity.

Ack. Is creating an isolated partition on the fly too expensive too?
I don't think creation of that partition is serialized and it should
achieve a similar result with a single sched-domain rebuild and I'm
hoping VMM doesn't change the paravirt mask at an alarming rate.

P.S. Some stupid benchmarking on a 256CPU machine:

    mkdir /sys/fs/cgroup/isol/
    echo isolated >  /sys/fs/cgroup/isol/cpuset.cpus.partition

    time for i in {1..1000}; do \
    echo "8-15" > /sys/fs/cgroup/isol/cpuset.cpus.exclusive; \
    echo "16-23" > /sys/fs/cgroup/isol/cpuset.cpus.exclusive; \
    done

    real    2m50.016s
    user    0m0.198s
    sys     1m47.708s

So that is about (170sec / 2000) ~ 85ms per cpuset operation.
Definitely more expensive than setting the paravirt but compare that to:

    for i in {8..15}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done; \
    for i in {8..15}; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done; \
    for i in {16..23}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done; \
    for i in {16..23}; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done;'

    real    0m5.046s
    user    0m0.014s
    sys     0m0.110s

Definitely less expensive than a full hotplug.

> 
> 1. Issues with CPU Capacity approach:
>    a. Need to make group_misfit_task as the highest priority. That alone will break big.LITTLE
>       since it relies on group misfit and group_overload should have higher priority there.
>    b. At high concurrency tasks still moved those CPUs with CAPACITY=1.
>    c. A lot of scheduler stats would need to be aware of change in CAPACITY specially load balance/wakeup.

Ack. Thinking out loud: Can capacity go to 0 via H/W pressure interface?
Maybe we can toggle the "sched_asym_cpucapacity" static branch without
actually having SD_ASYM_CAPACITY in these special case to enable
asym_fits_cpu() steer away from these 0 capacity CPUs.

>    d. in update_group_misfit - need to set the misfit load based on capacity. the current code sets to 0,
>       because of task_fits_cpu stuff
>    e. More challenges in RT.
> 
> That's when Tobias had introduced a new group type called group_parked.
> https://lore.kernel.org/all/20241204112149.25872-2-huschle@linux.ibm.com/
>   It has relatively cleaner implementation compared to CPU CAPACITY.
> 
> It had a few disadvantages too:
> 1. It use to take around 8-10 seconds for tasks to move out of those CPUs. That was the main
>    concern.
> 2. Needs a few stats based changes in update_sg_lb_stats. might be tricky in all scenarios.
> 
> That's when we were exploring how the tasks move out when the cpu goes offline. It happens quite fast too.
> So tried a similar mechanism and this is where we are right now.

I agree push is great from that perspective.

> 
>> [..snip..]
>>>>> +    local_irq_save(flags);
>>>>> +    preempt_disable();
>>>>
>>>> Disabling IRQs implies preemption is disabled.
>>>>
>>>
>>> Most of the places stop_one_cpu_nowait called with preemption & irq disabled.
>>> stopper runs at the next possible opportunity.
>>
>> But is there any reason to do both local_irq_save() and
>> preempt_disable()? include/linux/preempt.h defines preemptible() as:
>>
>>      #define preemptible()   (preempt_count() == 0 && !irqs_disabled())
>>
>> so disabling IRQs should be sufficient right or am I missing something?
>>
> 
> f0498d2a54e79 (Peter Zijlstra) "sched: Fix stop_one_cpu_nowait() vs hotplug"
> could be the answer you are looking for.

I think in all the cases covered by that commit, "task_rq_unlock(...)" would
have enabled interrupts which required that specified pattern but here we
have preempt_disable() within a local_irq_save() section which might not be
necessary.

> 
>>>
>>> stop_one_cpu_nowait
>>>   ->queues the task into stopper list
>>>      -> wake_up_process(stopper)
>>>         -> set need_resched
>>>           -> stopper runs as early as possible.
>>>          
> 

-- 
Thanks and Regards,
Prateek