linux-kernel - Re: [PATCH] sched/deadline: Derive root domain from active cpu in task's cpus

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <454c61fc-0771-4800-b075-02591bab79b1@arm.com>
Date: Thu, 16 Oct 2025 13:37:37 +0200
From: Pierre Gondois <pierre.gondois@....com>
To: Juri Lelli <juri.lelli@...hat.com>, Pingfan Liu <piliu@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org,
 Ingo Molnar <mingo@...hat.com>, Vincent Guittot
 <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH] sched/deadline: Derive root domain from active cpu in
 task's cpus_ptr


On 10/15/25 11:35, Juri Lelli wrote:
> On 14/10/25 21:09, Pingfan Liu wrote:
>> Hi Pierre,
>>
>> Thanks for sharing your perspective.
>>
>> On Sat, Oct 11, 2025 at 12:26 AM Pierre Gondois <pierre.gondois@....com> wrote:
>>>
>>> On 10/6/25 14:12, Juri Lelli wrote:
>>>> On 06/10/25 12:13, Pierre Gondois wrote:
>>>>> On 9/30/25 11:04, Peter Zijlstra wrote:
>>>>>> On Tue, Sep 30, 2025 at 08:20:06AM +0100, Juri Lelli wrote:
>>>>>>
>>>>>>> I actually wonder if we shouldn't make cppc_fie a "special" DEADLINE
>>>>>>> tasks (like schedutil [1]). IIUC that is how it is thought to behave
>>>>>>> already [2], but, since it's missing the SCHED_FLAG_SUGOV flag(/hack),
>>>>>>> it is not "transparent" from a bandwidth tracking point of view.
>>>>>>>
>>>>>>> 1 -https://elixir.bootlin.com/linux/v6.17/source/kernel/sched/cpufreq_schedutil.c#L661
>>>>>>> 2 -https://elixir.bootlin.com/linux/v6.17/source/drivers/cpufreq/cppc_cpufreq.c#L198
>>>>>> Right, I remember that hack. Bit sad its spreading, but this CPPC thing
>>>>>> is very much like the schedutil one, so might as well do that I suppose.
>>>>> IIUC, the sugov thread was switched to deadline to allow frequency updates
>>>>> when deadline tasks start to run. I.e. there should be no point updating the
>>>>> freq. after the deadline task finished running, cf [1] and [2]
>>>>>
>>>>> The CPPC FIE worker should not require to run that quickly as it seems to be
>>>>> more like a freq. maintenance work (the call comes from the sched tick)
>>>>>
>>>>> sched_tick()
>>>>> \-arch_scale_freq_tick() / topology_scale_freq_tick()
>>>>>     \-set_freq_scale() / cppc_scale_freq_tick()
>>>>>       \-irq_work_queue()
>>>> OK, but how much bandwidth is enough for it (on different platforms)?
>>>> Also, I am not sure the worker follows cpusets/root domain changes.
>>>>
>>>>
>>> To share some additional information, I could to reproduce the issue by
>>> creating as many deadline tasks with a huge bandwidth that the platform
>>> allows it:
>>> chrt -d -T 1000000 -P 1000000 0 yes > /dev/null &
>>>
>>> Then kexec to another kernel. The available bandwidth of the root domain
>>> gradually decreases with the number of CPUs unplugged.
>>> At some point, there is not enough bandwidth and an overflow is detected.
>>> (Same call stack as in the original message).
> I seem to agree with Pingfan below, kexec (kernel crash?) is a case
> where all guarantees are out of the window anyway, so really no point in
> keeping track of bandwidth and failing hotplug. Guess we should be
> adding an ad-hoc check/bail for this case.

Yes right

>>> So I'm not sure this is really related to the cppc_fie thread.
>>> I think it's more related to checking the available bandwidth in a context
>>> which is not appropriate. The deadline bandwidth might lack when the
>>> platform
>>> is reset, but this should not be that important.
>>>
>> I think there are two independent issues.
>>
>> In your experiment, as CPUs are hot-removed one by one, at some point
>> the hot-removal will fail due to insufficient DL bandwidth. There
>> should be a warning message to inform users about what's happening,
>> and users can then remove some DL tasks to continue the CPU
>> hot-removal.
>>
>> Meanwhile, in the kexec case, this checking can be skipped since the
>> system cannot roll back to a working state anyway

Yes right, I meant that:
-
when using kexec, the kernel crashes
-
when manually unplugging CPUs with:
`echo 0 > /sys/devices/system/cpu/cpuX/online`
The kernel returns `write error: Device or resource busy` at some point 
to prevent
from reducing the DL bandwidth too much.

------

I  could not reproduce the issue you reported initially. I am using 
a radxa orion o6
which has a cppc_fie worker.

AFAIU it should not be possible to add/remove bandwidth to the 
def_root_domain.
During kexec, the following is happening to all CPUs:
\-dl_bw_manage(dl_bw_req_deactivate, cpu)
   \- // Check if there is enough bandwidth
\-dl_clear_root_domain_cpu(cpu)
   \- // Recompute the available bandwidth based on the remaining CPUs

So I'm not sure to understand why accounting some bandwidth to the 
def_root_domain
is problematic in practice as the def_root_domain seems to have some DL 
bandwidth.

IIUC the problem seems to be that for some reason there is not enough 
bandwidth in the
def_root_domain aswell, which triggers the bandwidth overflow detection.

>>
>> Thanks,
>>
>> Pingfan
>>> ---
>>>
>>> Question:
>>> Since the cppc_fie worker doesn't have the SCHED_FLAG_SUGOV flag,
>>> is this comment actually correct ?
>>> /*
>>>    * Fake (unused) bandwidth; workaround to "fix"
>>>    * priority inheritance.
>>>    */
>>>
>>> ---
>>>
>>> On a non-deadline related topic, the CPPC drivers creates a cppc_fie
>>> worker in
>>> case the CPPC counters to estimate the current frequency are in PCC
>>> channels.
>>> Accessing these channels requires to go through sleeping sections,
>>> that's why a worker is used.
>>>
>>> However, CPPC counters might be accessed through FFH, which doesn't go
>>> through
>>> sleeping sections. In such case, the cppc_fie worker is never used and never
>>> removed, so it would be nice to remote it.
>>>
>