[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAF+s44QOQHwjaaCiFzi_vJRXvDusJCKpEyr85=ZHNjnD7JSO4w@mail.gmail.com>
Date: Fri, 17 Oct 2025 15:08:34 +0800
From: Pingfan Liu <piliu@...hat.com>
To: Pierre Gondois <pierre.gondois@....com>
Cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCHv2] sched/deadline: Walk up cpuset hierarchy to decide root
domain when hot-unplug
Hi Pierre,
Thanks for your careful test and analysis. Please see the comments below.
On Thu, Oct 16, 2025 at 11:30 PM Pierre Gondois <pierre.gondois@....com> wrote:
>
> Hello Pingfan,
>
> On 10/16/25 14:00, Pingfan Liu wrote:
> > When testing kexec-reboot on a 144 cpus machine with
> > isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> > encounter the following bug:
> >
> > [ 97.114759] psci: CPU142 killed (polled 0 ms)
> > [ 97.333236] Failed to offline CPU143 - error=-16
> > [ 97.333246] ------------[ cut here ]------------
> > [ 97.342682] kernel BUG at kernel/cpu.c:1569!
> > [ 97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> > [ 97.353281] Modules linked in: rfkill sunrpc dax_hmem cxl_acpi cxl_port cxl_core einj vfat fat arm_smmuv3_pmu nvidia_cspmu arm_spe_pmu coresight_trbe arm_cspmu_module rndis_host ipmi_ssif cdc_ether i2c_smbus spi_nor usbnet ast coresight_tmc mii ixgbe i2c_algo_bit mdio mtd coresight_funnel coresight_stm stm_core coresight_etm4x coresight cppc_cpufreq loop fuse nfnetlink xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt nvme nvme_core nvme_auth i2c_tegra acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler dm_mirror dm_region_hash dm_log dm_mod
> > [ 97.404119] CPU: 0 UID: 0 PID: 2583 Comm: kexec Kdump: loaded Not tainted 6.12.0-41.el10.aarch64 #1
> > [ 97.413371] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 2.0 07/12/2024
> > [ 97.420400] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> > [ 97.427518] pc : smp_shutdown_nonboot_cpus+0x104/0x128
> > [ 97.432778] lr : smp_shutdown_nonboot_cpus+0x11c/0x128
> > [ 97.438028] sp : ffff800097c6b9a0
> > [ 97.441411] x29: ffff800097c6b9a0 x28: ffff0000a099d800 x27: 0000000000000000
> > [ 97.448708] x26: 0000000000000000 x25: 0000000000000000 x24: ffffb94aaaa8f218
> > [ 97.456004] x23: ffffb94aaaabaae0 x22: ffffb94aaaa8f018 x21: 0000000000000000
> > [ 97.463301] x20: ffffb94aaaa8fc10 x19: 000000000000008f x18: 00000000fffffffe
> > [ 97.470598] x17: 0000000000000000 x16: ffffb94aa958fcd0 x15: ffff103acfca0b64
> > [ 97.477894] x14: ffff800097c6b520 x13: 36312d3d726f7272 x12: ffff103acfc6ffa8
> > [ 97.485191] x11: ffff103acf6f0000 x10: ffff103bc085c400 x9 : ffffb94aa88a0eb0
> > [ 97.492488] x8 : 0000000000000001 x7 : 000000000017ffe8 x6 : c0000000fffeffff
> > [ 97.499784] x5 : ffff003bdf62b408 x4 : 0000000000000000 x3 : 0000000000000000
> > [ 97.507081] x2 : 0000000000000000 x1 : ffff0000a099d800 x0 : 0000000000000002
> > [ 97.514379] Call trace:
> > [ 97.516874] smp_shutdown_nonboot_cpus+0x104/0x128
> > [ 97.521769] machine_shutdown+0x20/0x38
> > [ 97.525693] kernel_kexec+0xc4/0xf0
> > [ 97.529260] __do_sys_reboot+0x24c/0x278
> > [ 97.533272] __arm64_sys_reboot+0x2c/0x40
> > [ 97.537370] invoke_syscall.constprop.0+0x74/0xd0
> > [ 97.542179] do_el0_svc+0xb0/0xe8
> > [ 97.545562] el0_svc+0x44/0x1d0
> > [ 97.548772] el0t_64_sync_handler+0x120/0x130
> > [ 97.553222] el0t_64_sync+0x1a4/0x1a8
> > [ 97.556963] Code: a94363f7 a8c47bfd d50323bf d65f03c0 (d4210000)
> > [ 97.563191] ---[ end trace 0000000000000000 ]---
> > [ 97.595854] Kernel panic - not syncing: Oops - BUG: Fatal exception
> > [ 97.602275] Kernel Offset: 0x394a28600000 from 0xffff800080000000
> > [ 97.608502] PHYS_OFFSET: 0x80000000
> > [ 97.612062] CPU features: 0x10,0000000d,002a6928,5667fea7
> > [ 97.617580] Memory Limit: none
> > [ 97.648626] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]
> >
> > Tracking down this issue, I found that dl_bw_deactivate() returned
> > -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> > When a CPU is inactive, its rd is set to def_root_domain. For an
> > blocked-state deadline task (in this case, "cppc_fie"), it was not
> > migrated to CPU0, and its task_rq() information is stale. As a result,
> > its bandwidth is wrongly accounted into def_root_domain during domain
> > rebuild.
> >
> > The following rules stand for deadline sub-system:
> > -1.any cpu belongs to a unique root domain at a given time
> > -2.DL bandwidth checker ensures that the root domain has active cpus.
> > And for active cpu, cpu_rq(cpu)->rd always tracks a valid root domain.
> >
> > Now, let's examine the blocked-state task P.
> > If P is attached to a cpuset that is a partition root, it is
> > straightforward to find an active CPU.
> > If P is attached to a cpuset which later has changed from 'root' to 'member',
> > the active CPUs are grouped into the parent root domain. Naturally, the
> > CPUs' capacity and reserved DL bandwidth are taken into account in the
> > parent root domain. (In practice, it may be unsafe to attach P to an
> > arbitrary root domain, since that domain may lack sufficient DL
> > bandwidth for P.) Again, it is straightforward to find an active CPU in
> > the parent root domain. (parent root domain means the first ancestor
> > cpuset which is partition root)
> >
> > This patch walks up the cpuset hierarchy to find the active CPUs in P's
> > root domain and retrieves valid rd from cpu_rq(cpu)->rd.
> >
> > Signed-off-by: Pingfan Liu <piliu@...hat.com>
> > Cc: Ingo Molnar <mingo@...hat.com>
> > Cc: Peter Zijlstra <peterz@...radead.org>
> > Cc: Juri Lelli <juri.lelli@...hat.com>
> > Cc: Pierre Gondois <pierre.gondois@....com>
> > Cc: Vincent Guittot <vincent.guittot@...aro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@....com>
> > Cc: Steven Rostedt <rostedt@...dmis.org>
> > Cc: Ben Segall <bsegall@...gle.com>
> > Cc: Mel Gorman <mgorman@...e.de>
> > Cc: Valentin Schneider <vschneid@...hat.com>
> > To: linux-kernel@...r.kernel.org
> > ---
> > include/linux/cpuset.h | 6 ++++++
> > kernel/cgroup/cpuset.c | 15 +++++++++++++++
> > kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
> > 3 files changed, 45 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > index 2ddb256187b51..478ae68bdfc8f 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -130,6 +130,7 @@ extern void rebuild_sched_domains(void);
> >
> > extern void cpuset_print_current_mems_allowed(void);
> > extern void cpuset_reset_sched_domains(void);
> > +extern struct cpumask *cpuset_task_rd_effective_cpus(struct task_struct *p);
> >
> > /*
> > * read_mems_allowed_begin is required when making decisions involving
> > @@ -276,6 +277,11 @@ static inline void cpuset_reset_sched_domains(void)
> > partition_sched_domains(1, NULL, NULL);
> > }
> >
> > +static inline struct cpumask *cpuset_task_rd_effective_cpus(struct task_struct *p)
> > +{
> > + return cpu_active_mask;
> > +}
> > +
> > static inline void cpuset_print_current_mems_allowed(void)
> > {
> > }
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 27adb04df675d..25356d3f9d635 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -1102,6 +1102,21 @@ void cpuset_reset_sched_domains(void)
> > mutex_unlock(&cpuset_mutex);
> > }
> >
> > +/* caller hold RCU read lock */
> > +struct cpumask *cpuset_task_rd_effective_cpus(struct task_struct *p)
> > +{
> > + struct cpuset *cs;
> > +
> > + cs = task_cs(p);
> > + while (cs != &top_cpuset) {
> > + if (is_sched_load_balance(cs))
> > + break;
> > + cs = parent_cs(cs);
> > + }
> > +
> > + return cs->effective_cpus;
> > +}
> > +
> > /**
> > * cpuset_update_tasks_cpumask - Update the cpumasks of tasks in the cpuset.
> > * @cs: the cpuset in which each task's cpus_allowed mask needs to be changed
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 72c1f72463c75..fe0aec279c19a 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -2884,6 +2884,8 @@ void dl_add_task_root_domain(struct task_struct *p)
> > struct rq_flags rf;
> > struct rq *rq;
> > struct dl_bw *dl_b;
> > + unsigned int cpu;
> > + struct cpumask *msk;
> >
> > raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
> > if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
> > @@ -2891,16 +2893,32 @@ void dl_add_task_root_domain(struct task_struct *p)
> > return;
> > }
> >
> > - rq = __task_rq_lock(p, &rf);
> > -
> > + /* prevent race among cpu hotplug, changing of partition_root_state */
> > + lockdep_assert_cpus_held();
> > + /*
> > + * If @p is in blocked state, task_cpu() may be not active. In that
> > + * case, rq->rd does not trace a correct root_domain. On the other hand,
> > + * @p must belong to an root_domain at any given time, which must have
> > + * active rq, whose rq->rd traces the valid root domain.
> > + */
> > + msk = cpuset_task_rd_effective_cpus(p);
>
> For the cppc_fie worker, msk doesn't seem to exclude the isolated CPUs.
You are right. I am distracted from the cpuset.partition_root_state.
But root_domain can be created from two sources:
cpuset.partition_root_state and isolcpus="domain" cmdline. My patch
overlooks the latter one. And finally it returns
top_cpuset.effective_cpus.
I think before diving into cpuset,
housekeeping_cpumask(HK_TYPE_DOMAIN) should be used as the first
filter. With it, isolated cpus are put aside and the left cpus will
obey the control of cpuset.
> The patch seems to work on my setup, but only because the first active
> CPU is selected. CPU0 is likely the primary CPU which is offlined last.
>
> IIUC, this patch should work even if we select the last CPU of resulting
> mask,
> but it fails on my setup:
>
> cpumask_and(msk, cpu_active_mask, msk0);
> cpu = cpumask_last(msk);
>
Good catch, again I really appreciate your careful test and analysis.
> ------
>
> Also, just to note (as this might be another topic), but the patch
> doesn't solve
> the case where many deadline tasks are created first:
> chrt -d -T 1000000 -P 1000000 0 yes > /dev/null &
>
> and we then kexec to another Image
>
Yes, it is not fixed in this patch. And I have another draft patch for
that issue.
Thanks,
Pingfan
> > + cpu = cpumask_first_and(cpu_active_mask, msk);
> > + /*
> > + * If a root domain reserves bandwidth for a DL task, the DL bandwidth
> > + * check prevents CPU hot removal from deactivating all CPUs in that
> > + * domain.
> > + */
> > + BUG_ON(cpu >= nr_cpu_ids);
> > + rq = cpu_rq(cpu);
> > + /*
> > + * This point is under the protection of cpu_hotplug_lock. Hence
> > + * rq->rd is stable.
> > + */
> > dl_b = &rq->rd->dl_bw;
> > raw_spin_lock(&dl_b->lock);
> > -
> > __dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
> > -
> > raw_spin_unlock(&dl_b->lock);
> > -
> > - task_rq_unlock(rq, p, &rf);
> > + raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
> > }
> >
> > void dl_clear_root_domain(struct root_domain *rd)
>
Powered by blists - more mailing lists