linux-kernel - Re: [PATCHv2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <91eb613a-2e29-4e1d-b78f-741c5fb88e90@arm.com>
Date: Thu, 16 Oct 2025 17:29:19 +0200
From: Pierre Gondois <pierre.gondois@....com>
To: Pingfan Liu <piliu@...hat.com>, linux-kernel@...r.kernel.org
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
 Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCHv2] sched/deadline: Walk up cpuset hierarchy to decide root
 domain when hot-unplug

Hello Pingfan,

On 10/16/25 14:00, Pingfan Liu wrote:
> When testing kexec-reboot on a 144 cpus machine with
> isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> encounter the following bug:
>
> [   97.114759] psci: CPU142 killed (polled 0 ms)
> [   97.333236] Failed to offline CPU143 - error=-16
> [   97.333246] ------------[ cut here ]------------
> [   97.342682] kernel BUG at kernel/cpu.c:1569!
> [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> [   97.353281] Modules linked in: rfkill sunrpc dax_hmem cxl_acpi cxl_port cxl_core einj vfat fat arm_smmuv3_pmu nvidia_cspmu arm_spe_pmu coresight_trbe arm_cspmu_module rndis_host ipmi_ssif cdc_ether i2c_smbus spi_nor usbnet ast coresight_tmc mii ixgbe i2c_algo_bit mdio mtd coresight_funnel coresight_stm stm_core coresight_etm4x coresight cppc_cpufreq loop fuse nfnetlink xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt nvme nvme_core nvme_auth i2c_tegra acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler dm_mirror dm_region_hash dm_log dm_mod
> [   97.404119] CPU: 0 UID: 0 PID: 2583 Comm: kexec Kdump: loaded Not tainted 6.12.0-41.el10.aarch64 #1
> [   97.413371] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 2.0 07/12/2024
> [   97.420400] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> [   97.427518] pc : smp_shutdown_nonboot_cpus+0x104/0x128
> [   97.432778] lr : smp_shutdown_nonboot_cpus+0x11c/0x128
> [   97.438028] sp : ffff800097c6b9a0
> [   97.441411] x29: ffff800097c6b9a0 x28: ffff0000a099d800 x27: 0000000000000000
> [   97.448708] x26: 0000000000000000 x25: 0000000000000000 x24: ffffb94aaaa8f218
> [   97.456004] x23: ffffb94aaaabaae0 x22: ffffb94aaaa8f018 x21: 0000000000000000
> [   97.463301] x20: ffffb94aaaa8fc10 x19: 000000000000008f x18: 00000000fffffffe
> [   97.470598] x17: 0000000000000000 x16: ffffb94aa958fcd0 x15: ffff103acfca0b64
> [   97.477894] x14: ffff800097c6b520 x13: 36312d3d726f7272 x12: ffff103acfc6ffa8
> [   97.485191] x11: ffff103acf6f0000 x10: ffff103bc085c400 x9 : ffffb94aa88a0eb0
> [   97.492488] x8 : 0000000000000001 x7 : 000000000017ffe8 x6 : c0000000fffeffff
> [   97.499784] x5 : ffff003bdf62b408 x4 : 0000000000000000 x3 : 0000000000000000
> [   97.507081] x2 : 0000000000000000 x1 : ffff0000a099d800 x0 : 0000000000000002
> [   97.514379] Call trace:
> [   97.516874]  smp_shutdown_nonboot_cpus+0x104/0x128
> [   97.521769]  machine_shutdown+0x20/0x38
> [   97.525693]  kernel_kexec+0xc4/0xf0
> [   97.529260]  __do_sys_reboot+0x24c/0x278
> [   97.533272]  __arm64_sys_reboot+0x2c/0x40
> [   97.537370]  invoke_syscall.constprop.0+0x74/0xd0
> [   97.542179]  do_el0_svc+0xb0/0xe8
> [   97.545562]  el0_svc+0x44/0x1d0
> [   97.548772]  el0t_64_sync_handler+0x120/0x130
> [   97.553222]  el0t_64_sync+0x1a4/0x1a8
> [   97.556963] Code: a94363f7 a8c47bfd d50323bf d65f03c0 (d4210000)
> [   97.563191] ---[ end trace 0000000000000000 ]---
> [   97.595854] Kernel panic - not syncing: Oops - BUG: Fatal exception
> [   97.602275] Kernel Offset: 0x394a28600000 from 0xffff800080000000
> [   97.608502] PHYS_OFFSET: 0x80000000
> [   97.612062] CPU features: 0x10,0000000d,002a6928,5667fea7
> [   97.617580] Memory Limit: none
> [   97.648626] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]
>
> Tracking down this issue, I found that dl_bw_deactivate() returned
> -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> When a CPU is inactive, its rd is set to def_root_domain. For an
> blocked-state deadline task (in this case, "cppc_fie"), it was not
> migrated to CPU0, and its task_rq() information is stale. As a result,
> its bandwidth is wrongly accounted into def_root_domain during domain
> rebuild.
>
> The following rules stand for deadline sub-system:
>    -1.any cpu belongs to a unique root domain at a given time
>    -2.DL bandwidth checker ensures that the root domain has active cpus.
> And for active cpu, cpu_rq(cpu)->rd always tracks a valid root domain.
>
> Now, let's examine the blocked-state task P.
> If P is attached to a cpuset that is a partition root, it is
> straightforward to find an active CPU.
> If P is attached to a cpuset which later has changed from 'root' to 'member',
> the active CPUs are grouped into the parent root domain. Naturally, the
> CPUs' capacity and reserved DL bandwidth are taken into account in the
> parent root domain. (In practice, it may be unsafe to attach P to an
> arbitrary root domain, since that domain may lack sufficient DL
> bandwidth for P.) Again, it is straightforward to find an active CPU in
> the parent root domain. (parent root domain means the first ancestor
> cpuset which is partition root)
>
> This patch walks up the cpuset hierarchy to find the active CPUs in P's
> root domain and retrieves valid rd from cpu_rq(cpu)->rd.
>
> Signed-off-by: Pingfan Liu <piliu@...hat.com>
> Cc: Ingo Molnar <mingo@...hat.com>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Juri Lelli <juri.lelli@...hat.com>
> Cc: Pierre Gondois <pierre.gondois@....com>
> Cc: Vincent Guittot <vincent.guittot@...aro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@....com>
> Cc: Steven Rostedt <rostedt@...dmis.org>
> Cc: Ben Segall <bsegall@...gle.com>
> Cc: Mel Gorman <mgorman@...e.de>
> Cc: Valentin Schneider <vschneid@...hat.com>
> To: linux-kernel@...r.kernel.org
> ---
>   include/linux/cpuset.h  |  6 ++++++
>   kernel/cgroup/cpuset.c  | 15 +++++++++++++++
>   kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
>   3 files changed, 45 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 2ddb256187b51..478ae68bdfc8f 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -130,6 +130,7 @@ extern void rebuild_sched_domains(void);
>   
>   extern void cpuset_print_current_mems_allowed(void);
>   extern void cpuset_reset_sched_domains(void);
> +extern struct cpumask *cpuset_task_rd_effective_cpus(struct task_struct *p);
>   
>   /*
>    * read_mems_allowed_begin is required when making decisions involving
> @@ -276,6 +277,11 @@ static inline void cpuset_reset_sched_domains(void)
>   	partition_sched_domains(1, NULL, NULL);
>   }
>   
> +static inline struct cpumask *cpuset_task_rd_effective_cpus(struct task_struct *p)
> +{
> +	return cpu_active_mask;
> +}
> +
>   static inline void cpuset_print_current_mems_allowed(void)
>   {
>   }
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 27adb04df675d..25356d3f9d635 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1102,6 +1102,21 @@ void cpuset_reset_sched_domains(void)
>   	mutex_unlock(&cpuset_mutex);
>   }
>   
> +/* caller hold RCU read lock */
> +struct cpumask *cpuset_task_rd_effective_cpus(struct task_struct *p)
> +{
> +	struct cpuset *cs;
> +
> +	cs = task_cs(p);
> +	while (cs != &top_cpuset) {
> +		if (is_sched_load_balance(cs))
> +			break;
> +		cs = parent_cs(cs);
> +	}
> +
> +	return cs->effective_cpus;
> +}
> +
>   /**
>    * cpuset_update_tasks_cpumask - Update the cpumasks of tasks in the cpuset.
>    * @cs: the cpuset in which each task's cpus_allowed mask needs to be changed
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 72c1f72463c75..fe0aec279c19a 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2884,6 +2884,8 @@ void dl_add_task_root_domain(struct task_struct *p)
>   	struct rq_flags rf;
>   	struct rq *rq;
>   	struct dl_bw *dl_b;
> +	unsigned int cpu;
> +	struct cpumask *msk;
>   
>   	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
>   	if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
> @@ -2891,16 +2893,32 @@ void dl_add_task_root_domain(struct task_struct *p)
>   		return;
>   	}
>   
> -	rq = __task_rq_lock(p, &rf);
> -
> +	/* prevent race among cpu hotplug, changing of partition_root_state */
> +	lockdep_assert_cpus_held();
> +	/*
> +	 * If @p is in blocked state, task_cpu() may be not active. In that
> +	 * case, rq->rd does not trace a correct root_domain. On the other hand,
> +	 * @p must belong to an root_domain at any given time, which must have
> +	 * active rq, whose rq->rd traces the valid root domain.
> +	 */
> +	msk = cpuset_task_rd_effective_cpus(p);

For the cppc_fie worker, msk doesn't seem to exclude the isolated CPUs.
The patch seems to work on my setup, but only because the first active
CPU is selected. CPU0 is likely the primary CPU which is offlined last.

IIUC, this patch should work even if we select the last CPU of resulting 
mask,
but it fails on my setup:

cpumask_and(msk, cpu_active_mask, msk0);
cpu = cpumask_last(msk);

------

Also, just to note (as this might be another topic), but the patch 
doesn't solve
the case where many deadline tasks are created first:
   chrt -d -T 1000000 -P 1000000 0 yes > /dev/null &

and we then kexec to another Image

> +	cpu = cpumask_first_and(cpu_active_mask, msk);
> +	/*
> +	 * If a root domain reserves bandwidth for a DL task, the DL bandwidth
> +	 * check prevents CPU hot removal from deactivating all CPUs in that
> +	 * domain.
> +	 */
> +	BUG_ON(cpu >= nr_cpu_ids);
> +	rq = cpu_rq(cpu);
> +	/*
> +	 * This point is under the protection of cpu_hotplug_lock. Hence
> +	 * rq->rd is stable.
> +	 */
>   	dl_b = &rq->rd->dl_bw;
>   	raw_spin_lock(&dl_b->lock);
> -
>   	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
> -
>   	raw_spin_unlock(&dl_b->lock);
> -
> -	task_rq_unlock(rq, p, &rf);
> +	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
>   }
>   
>   void dl_clear_root_domain(struct root_domain *rd)