linux-kernel - Re: [PATCHv3] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPY6VeMfcu_iddY4@fedora>
Date: Mon, 20 Oct 2025 21:34:13 +0800
From: Pingfan Liu <piliu@...hat.com>
To: Juri Lelli <juri.lelli@...hat.com>
Cc: Waiman Long <llong@...hat.com>, cgroups@...r.kernel.org,
	linux-kernel@...r.kernel.org, Tejun Heo <tj@...nel.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Michal Koutný <mkoutny@...e.com>,
	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Pierre Gondois <pierre.gondois@....com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCHv3] sched/deadline: Walk up cpuset hierarchy to decide
 root domain when hot-unplug

Hi Juri,

Thanks for following up on this topic. Please check my comment below.

On Mon, Oct 20, 2025 at 08:03:25AM +0200, Juri Lelli wrote:
> Hi!
> 
> On 20/10/25 11:21, Pingfan Liu wrote:
> > Hi Waiman,
> > 
> > I appreciate your time in reviewing my patch. Please see the comment
> > belows.
> > 
> > On Fri, Oct 17, 2025 at 01:52:45PM -0400, Waiman Long wrote:
> > > On 10/17/25 8:26 AM, Pingfan Liu wrote:
> > > > When testing kexec-reboot on a 144 cpus machine with
> > > > isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> > > > encounter the following bug:
> > > > 
> > > > [   97.114759] psci: CPU142 killed (polled 0 ms)
> > > > [   97.333236] Failed to offline CPU143 - error=-16
> > > > [   97.333246] ------------[ cut here ]------------
> > > > [   97.342682] kernel BUG at kernel/cpu.c:1569!
> > > > [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> > > > [   97.353281] Modules linked in: rfkill sunrpc dax_hmem cxl_acpi cxl_port cxl_core einj vfat fat arm_smmuv3_pmu nvidia_cspmu arm_spe_pmu coresight_trbe arm_cspmu_module rndis_host ipmi_ssif cdc_ether i2c_smbus spi_nor usbnet ast coresight_tmc mii ixgbe i2c_algo_bit mdio mtd coresight_funnel coresight_stm stm_core coresight_etm4x coresight cppc_cpufreq loop fuse nfnetlink xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt nvme nvme_core nvme_auth i2c_tegra acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler dm_mirror dm_region_hash dm_log dm_mod
> > > > [   97.404119] CPU: 0 UID: 0 PID: 2583 Comm: kexec Kdump: loaded Not tainted 6.12.0-41.el10.aarch64 #1
> > > > [   97.413371] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 2.0 07/12/2024
> > > > [   97.420400] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> > > > [   97.427518] pc : smp_shutdown_nonboot_cpus+0x104/0x128
> > > > [   97.432778] lr : smp_shutdown_nonboot_cpus+0x11c/0x128
> > > > [   97.438028] sp : ffff800097c6b9a0
> > > > [   97.441411] x29: ffff800097c6b9a0 x28: ffff0000a099d800 x27: 0000000000000000
> > > > [   97.448708] x26: 0000000000000000 x25: 0000000000000000 x24: ffffb94aaaa8f218
> > > > [   97.456004] x23: ffffb94aaaabaae0 x22: ffffb94aaaa8f018 x21: 0000000000000000
> > > > [   97.463301] x20: ffffb94aaaa8fc10 x19: 000000000000008f x18: 00000000fffffffe
> > > > [   97.470598] x17: 0000000000000000 x16: ffffb94aa958fcd0 x15: ffff103acfca0b64
> > > > [   97.477894] x14: ffff800097c6b520 x13: 36312d3d726f7272 x12: ffff103acfc6ffa8
> > > > [   97.485191] x11: ffff103acf6f0000 x10: ffff103bc085c400 x9 : ffffb94aa88a0eb0
> > > > [   97.492488] x8 : 0000000000000001 x7 : 000000000017ffe8 x6 : c0000000fffeffff
> > > > [   97.499784] x5 : ffff003bdf62b408 x4 : 0000000000000000 x3 : 0000000000000000
> > > > [   97.507081] x2 : 0000000000000000 x1 : ffff0000a099d800 x0 : 0000000000000002
> > > > [   97.514379] Call trace:
> > > > [   97.516874]  smp_shutdown_nonboot_cpus+0x104/0x128
> > > > [   97.521769]  machine_shutdown+0x20/0x38
> > > > [   97.525693]  kernel_kexec+0xc4/0xf0
> > > > [   97.529260]  __do_sys_reboot+0x24c/0x278
> > > > [   97.533272]  __arm64_sys_reboot+0x2c/0x40
> > > > [   97.537370]  invoke_syscall.constprop.0+0x74/0xd0
> > > > [   97.542179]  do_el0_svc+0xb0/0xe8
> > > > [   97.545562]  el0_svc+0x44/0x1d0
> > > > [   97.548772]  el0t_64_sync_handler+0x120/0x130
> > > > [   97.553222]  el0t_64_sync+0x1a4/0x1a8
> > > > [   97.556963] Code: a94363f7 a8c47bfd d50323bf d65f03c0 (d4210000)
> > > > [   97.563191] ---[ end trace 0000000000000000 ]---
> > > > [   97.595854] Kernel panic - not syncing: Oops - BUG: Fatal exception
> > > > [   97.602275] Kernel Offset: 0x394a28600000 from 0xffff800080000000
> > > > [   97.608502] PHYS_OFFSET: 0x80000000
> > > > [   97.612062] CPU features: 0x10,0000000d,002a6928,5667fea7
> > > > [   97.617580] Memory Limit: none
> > > > [   97.648626] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]
> > > > 
> > > > Tracking down this issue, I found that dl_bw_deactivate() returned
> > > > -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> > > > When a CPU is inactive, its rd is set to def_root_domain. For an
> > > > blocked-state deadline task (in this case, "cppc_fie"), it was not
> > > > migrated to CPU0, and its task_rq() information is stale. As a result,
> > > > its bandwidth is wrongly accounted into def_root_domain during domain
> > > > rebuild.
> > > 
> > > First of all, in an emergency situation when we need to shutdown the kernel,
> > > does it really matter if dl_bw_activate() returns -EBUSY? Should we just go
> > > ahead and ignore this dl_bw generated error?
> > > 
> > 
> > Ah, sorry - the previous test example was misleading. Let me restate it
> > as an equivalent operation on a system with 144 CPUs:
> >   sudo bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
> > 
> > That extracts the hot-removal part, which is affected by the bug, from
> > the kexec reboot process. It expects that only cpu0 is online, but in
> > practice, the cpu143 refused to be offline due to this bug.
> 
> I confess I am still perplexed by this, considering the "particular"
> nature of cppc worker that seems to be the only task that is able to
> trigger this problem. First of all, is that indeed the case or are you
> able to reproduce this problem with standard (non-kthread) DEADLINE
> tasks as well?
> 

Yes, I can. I wrote a SCHED_DEADLINE task that waits indefinitely on a
semaphore (or, more precisely, for a very long period that may span the
entire CPU hot-removal process) to emulate waiting for an undetermined
driver input.  Then I spawned multiple instances of this program to
ensure that some of them run on CPU 72. When I attempted to offline CPUs
1–143 one by one, CPU 143 failed to go offline.

> I essentially wonder how cppc worker affinity/migration on hotplug is
> handled. With your isolcpus configuration you have one isolated root

The affinity/migration on hotplug work fine. The keypoint is that they
only handle the task on rq. For the blocked-state tasks (here it is cppc
worker), they just ignore them.

Thanks,

Pingfan

> domain per isolated cpu, so if cppc worker is not migrated away from (in
> the case above) cpu 143, then BW control might be right in saying we
> can't offline that cpu, as the worker still has BW running there. This
> is also why I fist wondered (and suggested) we remove cppc worker BW
> from the picture (make it DEADLINE special) as we don't really seem to
> have a reliable way to associate meaningful BW to it anyway.
> 
> Thanks,
> Juri
>