linux-kernel - Re: [PATCH] sched/deadline: Derive root domain from active cpu in task's cpus

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF+s44TT3f0Tp_bXttx-Uwf-QdDUi47Lb7PDqJ9iUg-AUSwPxQ@mail.gmail.com>
Date: Thu, 16 Oct 2025 20:17:28 +0800
From: Pingfan Liu <piliu@...hat.com>
To: Pierre Gondois <pierre.gondois@....com>
Cc: Juri Lelli <juri.lelli@...hat.com>, Peter Zijlstra <peterz@...radead.org>, 
	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH] sched/deadline: Derive root domain from active cpu in
 task's cpus_ptr

On Thu, Oct 16, 2025 at 7:38 PM Pierre Gondois <pierre.gondois@....com> wrote:
>
>
> On 10/15/25 11:35, Juri Lelli wrote:
> > On 14/10/25 21:09, Pingfan Liu wrote:
> >> Hi Pierre,
> >>
> >> Thanks for sharing your perspective.
> >>
> >> On Sat, Oct 11, 2025 at 12:26 AM Pierre Gondois <pierre.gondois@....com> wrote:
> >>>
> >>> On 10/6/25 14:12, Juri Lelli wrote:
> >>>> On 06/10/25 12:13, Pierre Gondois wrote:
> >>>>> On 9/30/25 11:04, Peter Zijlstra wrote:
> >>>>>> On Tue, Sep 30, 2025 at 08:20:06AM +0100, Juri Lelli wrote:
> >>>>>>
> >>>>>>> I actually wonder if we shouldn't make cppc_fie a "special" DEADLINE
> >>>>>>> tasks (like schedutil [1]). IIUC that is how it is thought to behave
> >>>>>>> already [2], but, since it's missing the SCHED_FLAG_SUGOV flag(/hack),
> >>>>>>> it is not "transparent" from a bandwidth tracking point of view.
> >>>>>>>
> >>>>>>> 1 -https://elixir.bootlin.com/linux/v6.17/source/kernel/sched/cpufreq_schedutil.c#L661
> >>>>>>> 2 -https://elixir.bootlin.com/linux/v6.17/source/drivers/cpufreq/cppc_cpufreq.c#L198
> >>>>>> Right, I remember that hack. Bit sad its spreading, but this CPPC thing
> >>>>>> is very much like the schedutil one, so might as well do that I suppose.
> >>>>> IIUC, the sugov thread was switched to deadline to allow frequency updates
> >>>>> when deadline tasks start to run. I.e. there should be no point updating the
> >>>>> freq. after the deadline task finished running, cf [1] and [2]
> >>>>>
> >>>>> The CPPC FIE worker should not require to run that quickly as it seems to be
> >>>>> more like a freq. maintenance work (the call comes from the sched tick)
> >>>>>
> >>>>> sched_tick()
> >>>>> \-arch_scale_freq_tick() / topology_scale_freq_tick()
> >>>>>     \-set_freq_scale() / cppc_scale_freq_tick()
> >>>>>       \-irq_work_queue()
> >>>> OK, but how much bandwidth is enough for it (on different platforms)?
> >>>> Also, I am not sure the worker follows cpusets/root domain changes.
> >>>>
> >>>>
> >>> To share some additional information, I could to reproduce the issue by
> >>> creating as many deadline tasks with a huge bandwidth that the platform
> >>> allows it:
> >>> chrt -d -T 1000000 -P 1000000 0 yes > /dev/null &
> >>>
> >>> Then kexec to another kernel. The available bandwidth of the root domain
> >>> gradually decreases with the number of CPUs unplugged.
> >>> At some point, there is not enough bandwidth and an overflow is detected.
> >>> (Same call stack as in the original message).
> > I seem to agree with Pingfan below, kexec (kernel crash?) is a case
> > where all guarantees are out of the window anyway, so really no point in
> > keeping track of bandwidth and failing hotplug. Guess we should be
> > adding an ad-hoc check/bail for this case.
>
> Yes right
>
> >>> So I'm not sure this is really related to the cppc_fie thread.
> >>> I think it's more related to checking the available bandwidth in a context
> >>> which is not appropriate. The deadline bandwidth might lack when the
> >>> platform
> >>> is reset, but this should not be that important.
> >>>
> >> I think there are two independent issues.
> >>
> >> In your experiment, as CPUs are hot-removed one by one, at some point
> >> the hot-removal will fail due to insufficient DL bandwidth. There
> >> should be a warning message to inform users about what's happening,
> >> and users can then remove some DL tasks to continue the CPU
> >> hot-removal.
> >>
> >> Meanwhile, in the kexec case, this checking can be skipped since the
> >> system cannot roll back to a working state anyway
>
> Yes right, I meant that:
> -
> when using kexec, the kernel crashes
> -
> when manually unplugging CPUs with:
> `echo 0 > /sys/devices/system/cpu/cpuX/online`
> The kernel returns `write error: Device or resource busy` at some point
> to prevent
> from reducing the DL bandwidth too much.
>
> ------
>
> I  could not reproduce the issue you reported initially. I am using
> a radxa orion o6
> which has a cppc_fie worker.
>

I speculate that you miss something like
"isolcpus=managed_irq,domain,1-71,73-143" in the kernel command line.
That is critical to reproduce the bug. In that case, cpus
[1,71],[73,143] are in def_root_domain, while cpu0 and 72 are in the
other new root_domain. The bug is triggered if cppc_fie worker is
scheduled on cpu72.

> AFAIU it should not be possible to add/remove bandwidth to the
> def_root_domain.
> During kexec, the following is happening to all CPUs:
> \-dl_bw_manage(dl_bw_req_deactivate, cpu)
>    \- // Check if there is enough bandwidth
> \-dl_clear_root_domain_cpu(cpu)
>    \- // Recompute the available bandwidth based on the remaining CPUs
>
> So I'm not sure to understand why accounting some bandwidth to the
> def_root_domain
> is problematic in practice as the def_root_domain seems to have some DL
> bandwidth.
>
> IIUC the problem seems to be that for some reason there is not enough
> bandwidth in the
> def_root_domain aswell, which triggers the bandwidth overflow detection.
>

The problem is caused by accounting the blocked-state DL task's
bandwidth to a wrong root_domain. Let me refer to the previous
example.  cpus [1,71],[73,143] belong to def_root_domain, cpu0,72
belong to root_domainA. In the kernel, the root_domain is traced in
cpu_rq(cpu)->rd. But for an offline rq, rq->rd points to
def_root_domain. Hence the reserved bandwidth of cppc_fie is wrongly
accounted into the  def_root_domain instead of root_domainA. So
finally, cpu143 refuses to be offlined since def_root_domain
demonstrates there should be reserved DL bandwidth.


Thanks,

Pingfan
> >>
> >> Thanks,
> >>
> >> Pingfan
> >>> ---
> >>>
> >>> Question:
> >>> Since the cppc_fie worker doesn't have the SCHED_FLAG_SUGOV flag,
> >>> is this comment actually correct ?
> >>> /*
> >>>    * Fake (unused) bandwidth; workaround to "fix"
> >>>    * priority inheritance.
> >>>    */
> >>>
> >>> ---
> >>>
> >>> On a non-deadline related topic, the CPPC drivers creates a cppc_fie
> >>> worker in
> >>> case the CPPC counters to estimate the current frequency are in PCC
> >>> channels.
> >>> Accessing these channels requires to go through sleeping sections,
> >>> that's why a worker is used.
> >>>
> >>> However, CPPC counters might be accessed through FFH, which doesn't go
> >>> through
> >>> sleeping sections. In such case, the cppc_fie worker is never used and never
> >>> removed, so it would be nice to remote it.
> >>>
> >
>