linux-kernel - Re: [PATCH] sched/deadline: Derive root domain from active cpu in task's cpus

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a0f65eab-0091-4590-9037-c00bd5d0060c@arm.com>
Date: Thu, 16 Oct 2025 16:26:39 +0200
From: Pierre Gondois <pierre.gondois@....com>
To: Pingfan Liu <piliu@...hat.com>
Cc: Juri Lelli <juri.lelli@...hat.com>, Peter Zijlstra
 <peterz@...radead.org>, linux-kernel@...r.kernel.org,
 Ingo Molnar <mingo@...hat.com>, Vincent Guittot
 <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH] sched/deadline: Derive root domain from active cpu in
 task's cpus_ptr


On 10/16/25 14:17, Pingfan Liu wrote:
> On Thu, Oct 16, 2025 at 7:38 PM Pierre Gondois <pierre.gondois@....com> wrote:
>>
>> On 10/15/25 11:35, Juri Lelli wrote:
>>> On 14/10/25 21:09, Pingfan Liu wrote:
>>>> Hi Pierre,
>>>>
>>>> Thanks for sharing your perspective.
>>>>
>>>> On Sat, Oct 11, 2025 at 12:26 AM Pierre Gondois <pierre.gondois@....com> wrote:
>>>>> On 10/6/25 14:12, Juri Lelli wrote:
>>>>>> On 06/10/25 12:13, Pierre Gondois wrote:
>>>>>>> On 9/30/25 11:04, Peter Zijlstra wrote:
>>>>>>>> On Tue, Sep 30, 2025 at 08:20:06AM +0100, Juri Lelli wrote:
>>>>>>>>
>>>>>>>>> I actually wonder if we shouldn't make cppc_fie a "special" DEADLINE
>>>>>>>>> tasks (like schedutil [1]). IIUC that is how it is thought to behave
>>>>>>>>> already [2], but, since it's missing the SCHED_FLAG_SUGOV flag(/hack),
>>>>>>>>> it is not "transparent" from a bandwidth tracking point of view.
>>>>>>>>>
>>>>>>>>> 1 -https://elixir.bootlin.com/linux/v6.17/source/kernel/sched/cpufreq_schedutil.c#L661
>>>>>>>>> 2 -https://elixir.bootlin.com/linux/v6.17/source/drivers/cpufreq/cppc_cpufreq.c#L198
>>>>>>>> Right, I remember that hack. Bit sad its spreading, but this CPPC thing
>>>>>>>> is very much like the schedutil one, so might as well do that I suppose.
>>>>>>> IIUC, the sugov thread was switched to deadline to allow frequency updates
>>>>>>> when deadline tasks start to run. I.e. there should be no point updating the
>>>>>>> freq. after the deadline task finished running, cf [1] and [2]
>>>>>>>
>>>>>>> The CPPC FIE worker should not require to run that quickly as it seems to be
>>>>>>> more like a freq. maintenance work (the call comes from the sched tick)
>>>>>>>
>>>>>>> sched_tick()
>>>>>>> \-arch_scale_freq_tick() / topology_scale_freq_tick()
>>>>>>>      \-set_freq_scale() / cppc_scale_freq_tick()
>>>>>>>        \-irq_work_queue()
>>>>>> OK, but how much bandwidth is enough for it (on different platforms)?
>>>>>> Also, I am not sure the worker follows cpusets/root domain changes.
>>>>>>
>>>>>>
>>>>> To share some additional information, I could to reproduce the issue by
>>>>> creating as many deadline tasks with a huge bandwidth that the platform
>>>>> allows it:
>>>>> chrt -d -T 1000000 -P 1000000 0 yes > /dev/null &
>>>>>
>>>>> Then kexec to another kernel. The available bandwidth of the root domain
>>>>> gradually decreases with the number of CPUs unplugged.
>>>>> At some point, there is not enough bandwidth and an overflow is detected.
>>>>> (Same call stack as in the original message).
>>> I seem to agree with Pingfan below, kexec (kernel crash?) is a case
>>> where all guarantees are out of the window anyway, so really no point in
>>> keeping track of bandwidth and failing hotplug. Guess we should be
>>> adding an ad-hoc check/bail for this case.
>> Yes right
>>
>>>>> So I'm not sure this is really related to the cppc_fie thread.
>>>>> I think it's more related to checking the available bandwidth in a context
>>>>> which is not appropriate. The deadline bandwidth might lack when the
>>>>> platform
>>>>> is reset, but this should not be that important.
>>>>>
>>>> I think there are two independent issues.
>>>>
>>>> In your experiment, as CPUs are hot-removed one by one, at some point
>>>> the hot-removal will fail due to insufficient DL bandwidth. There
>>>> should be a warning message to inform users about what's happening,
>>>> and users can then remove some DL tasks to continue the CPU
>>>> hot-removal.
>>>>
>>>> Meanwhile, in the kexec case, this checking can be skipped since the
>>>> system cannot roll back to a working state anyway
>> Yes right, I meant that:
>> -
>> when using kexec, the kernel crashes
>> -
>> when manually unplugging CPUs with:
>> `echo 0 > /sys/devices/system/cpu/cpuX/online`
>> The kernel returns `write error: Device or resource busy` at some point
>> to prevent
>> from reducing the DL bandwidth too much.
>>
>> ------
>>
>> I  could not reproduce the issue you reported initially. I am using
>> a radxa orion o6
>> which has a cppc_fie worker.
>>
> I speculate that you miss something like
> "isolcpus=managed_irq,domain,1-71,73-143" in the kernel command line.
> That is critical to reproduce the bug. In that case, cpus
> [1,71],[73,143] are in def_root_domain, while cpu0 and 72 are in the
> other new root_domain. The bug is triggered if cppc_fie worker is
> scheduled on cpu72.

I could finally reproduce the same issue.

The orion o6 has only 12 CPUs. I use:
   isolcpus=managed_irq,domain,1-9,11
Just before kexec, I check that the cppc_fie worker is on CPU10.

During kexec, I can see what you signal:

Upon unplugging CPU10, in this order:
- dl_bw_manage() is called and no overflow is detected as the 
non-default root domain is used - def_root_domain is attached to CPU10 - 
dl_add_task_root_domain(CPU10) is called for cppc_fie
- the resulting root domain for CPU10 is def_root_domain
- dl_clear_root_domain_cpu is called and a new bandwidth
is computed for the CPUs that are part of def_root_domain, i.e. CPU11.
(I don't know how correct this is to do that).

Upon unplugging CPU11:
- dl_bw_manage() is called, but now that there is only one CPU left in 
def_root_domain, an overflow is detected. - dl_bw_deactivate() returns 
an error code and the platform crashes

I could not see the issue previously as the cppc_fie worker was not on
the penultimate CPU (say CPU7). When unplugging CPU7:
- there is enough bandwidth on the CPUs that are part of def_root_domain,
i.e. CPU8,9, 10,11, so there is no overflow
- I assume the cppc_fie worker had the time to wake-up or be migrated
to CPU0 before reaching the last isolated CPU ...

Even by reducing the frequency at which the worker is used it seems to
only be triggered when the CPUs 1-9,11 are isolated...


>
>> AFAIU it should not be possible to add/remove bandwidth to the
>> def_root_domain.
>> During kexec, the following is happening to all CPUs:
>> \-dl_bw_manage(dl_bw_req_deactivate, cpu)
>>     \- // Check if there is enough bandwidth
>> \-dl_clear_root_domain_cpu(cpu)
>>     \- // Recompute the available bandwidth based on the remaining CPUs
>>
>> So I'm not sure to understand why accounting some bandwidth to the
>> def_root_domain
>> is problematic in practice as the def_root_domain seems to have some DL
>> bandwidth.
>>
>> IIUC the problem seems to be that for some reason there is not enough
>> bandwidth in the
>> def_root_domain aswell, which triggers the bandwidth overflow detection.
>>
> The problem is caused by accounting the blocked-state DL task's
> bandwidth to a wrong root_domain. Let me refer to the previous
> example.  cpus [1,71],[73,143] belong to def_root_domain, cpu0,72
> belong to root_domainA. In the kernel, the root_domain is traced in
> cpu_rq(cpu)->rd. But for an offline rq, rq->rd points to
> def_root_domain. Hence the reserved bandwidth of cppc_fie is wrongly
> accounted into the  def_root_domain instead of root_domainA. So
> finally, cpu143 refuses to be offlined since def_root_domain
> demonstrates there should be reserved DL bandwidth.

Yes right, I think we agree on what is happening

>
>
> Thanks,
>
> Pingfan
>>>> Thanks,
>>>>
>>>> Pingfan
>>>>> ---
>>>>>
>>>>> Question:
>>>>> Since the cppc_fie worker doesn't have the SCHED_FLAG_SUGOV flag,
>>>>> is this comment actually correct ?
>>>>> /*
>>>>>     * Fake (unused) bandwidth; workaround to "fix"
>>>>>     * priority inheritance.
>>>>>     */
>>>>>
>>>>> ---
>>>>>
>>>>> On a non-deadline related topic, the CPPC drivers creates a cppc_fie
>>>>> worker in
>>>>> case the CPPC counters to estimate the current frequency are in PCC
>>>>> channels.
>>>>> Accessing these channels requires to go through sleeping sections,
>>>>> that's why a worker is used.
>>>>>
>>>>> However, CPPC counters might be accessed through FFH, which doesn't go
>>>>> through
>>>>> sleeping sections. In such case, the cppc_fie worker is never used and never
>>>>> removed, so it would be nice to remote it.
>>>>>
>