[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aPZUWF_bhli1CEcn@jlelli-thinkpadt14gen4.remote.csb>
Date: Mon, 20 Oct 2025 17:25:12 +0200
From: Juri Lelli <juri.lelli@...hat.com>
To: Pingfan Liu <piliu@...hat.com>
Cc: Waiman Long <llong@...hat.com>, cgroups@...r.kernel.org,
linux-kernel@...r.kernel.org, Tejun Heo <tj@...nel.org>,
Johannes Weiner <hannes@...xchg.org>,
Michal Koutný <mkoutny@...e.com>,
Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Pierre Gondois <pierre.gondois@....com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCHv3] sched/deadline: Walk up cpuset hierarchy to decide
root domain when hot-unplug
On 20/10/25 21:34, Pingfan Liu wrote:
> Hi Juri,
>
> Thanks for following up on this topic. Please check my comment below.
>
> On Mon, Oct 20, 2025 at 08:03:25AM +0200, Juri Lelli wrote:
> > Hi!
> >
> > On 20/10/25 11:21, Pingfan Liu wrote:
> > > Hi Waiman,
> > >
> > > I appreciate your time in reviewing my patch. Please see the comment
> > > belows.
> > >
> > > On Fri, Oct 17, 2025 at 01:52:45PM -0400, Waiman Long wrote:
> > > > On 10/17/25 8:26 AM, Pingfan Liu wrote:
> > > > > When testing kexec-reboot on a 144 cpus machine with
> > > > > isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> > > > > encounter the following bug:
> > > > >
> > > > > [ 97.114759] psci: CPU142 killed (polled 0 ms)
> > > > > [ 97.333236] Failed to offline CPU143 - error=-16
> > > > > [ 97.333246] ------------[ cut here ]------------
> > > > > [ 97.342682] kernel BUG at kernel/cpu.c:1569!
> > > > > [ 97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> > > > > [ 97.353281] Modules linked in: rfkill sunrpc dax_hmem cxl_acpi cxl_port cxl_core einj vfat fat arm_smmuv3_pmu nvidia_cspmu arm_spe_pmu coresight_trbe arm_cspmu_module rndis_host ipmi_ssif cdc_ether i2c_smbus spi_nor usbnet ast coresight_tmc mii ixgbe i2c_algo_bit mdio mtd coresight_funnel coresight_stm stm_core coresight_etm4x coresight cppc_cpufreq loop fuse nfnetlink xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt nvme nvme_core nvme_auth i2c_tegra acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler dm_mirror dm_region_hash dm_log dm_mod
> > > > > [ 97.404119] CPU: 0 UID: 0 PID: 2583 Comm: kexec Kdump: loaded Not tainted 6.12.0-41.el10.aarch64 #1
> > > > > [ 97.413371] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 2.0 07/12/2024
> > > > > [ 97.420400] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> > > > > [ 97.427518] pc : smp_shutdown_nonboot_cpus+0x104/0x128
> > > > > [ 97.432778] lr : smp_shutdown_nonboot_cpus+0x11c/0x128
> > > > > [ 97.438028] sp : ffff800097c6b9a0
> > > > > [ 97.441411] x29: ffff800097c6b9a0 x28: ffff0000a099d800 x27: 0000000000000000
> > > > > [ 97.448708] x26: 0000000000000000 x25: 0000000000000000 x24: ffffb94aaaa8f218
> > > > > [ 97.456004] x23: ffffb94aaaabaae0 x22: ffffb94aaaa8f018 x21: 0000000000000000
> > > > > [ 97.463301] x20: ffffb94aaaa8fc10 x19: 000000000000008f x18: 00000000fffffffe
> > > > > [ 97.470598] x17: 0000000000000000 x16: ffffb94aa958fcd0 x15: ffff103acfca0b64
> > > > > [ 97.477894] x14: ffff800097c6b520 x13: 36312d3d726f7272 x12: ffff103acfc6ffa8
> > > > > [ 97.485191] x11: ffff103acf6f0000 x10: ffff103bc085c400 x9 : ffffb94aa88a0eb0
> > > > > [ 97.492488] x8 : 0000000000000001 x7 : 000000000017ffe8 x6 : c0000000fffeffff
> > > > > [ 97.499784] x5 : ffff003bdf62b408 x4 : 0000000000000000 x3 : 0000000000000000
> > > > > [ 97.507081] x2 : 0000000000000000 x1 : ffff0000a099d800 x0 : 0000000000000002
> > > > > [ 97.514379] Call trace:
> > > > > [ 97.516874] smp_shutdown_nonboot_cpus+0x104/0x128
> > > > > [ 97.521769] machine_shutdown+0x20/0x38
> > > > > [ 97.525693] kernel_kexec+0xc4/0xf0
> > > > > [ 97.529260] __do_sys_reboot+0x24c/0x278
> > > > > [ 97.533272] __arm64_sys_reboot+0x2c/0x40
> > > > > [ 97.537370] invoke_syscall.constprop.0+0x74/0xd0
> > > > > [ 97.542179] do_el0_svc+0xb0/0xe8
> > > > > [ 97.545562] el0_svc+0x44/0x1d0
> > > > > [ 97.548772] el0t_64_sync_handler+0x120/0x130
> > > > > [ 97.553222] el0t_64_sync+0x1a4/0x1a8
> > > > > [ 97.556963] Code: a94363f7 a8c47bfd d50323bf d65f03c0 (d4210000)
> > > > > [ 97.563191] ---[ end trace 0000000000000000 ]---
> > > > > [ 97.595854] Kernel panic - not syncing: Oops - BUG: Fatal exception
> > > > > [ 97.602275] Kernel Offset: 0x394a28600000 from 0xffff800080000000
> > > > > [ 97.608502] PHYS_OFFSET: 0x80000000
> > > > > [ 97.612062] CPU features: 0x10,0000000d,002a6928,5667fea7
> > > > > [ 97.617580] Memory Limit: none
> > > > > [ 97.648626] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]
> > > > >
> > > > > Tracking down this issue, I found that dl_bw_deactivate() returned
> > > > > -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> > > > > When a CPU is inactive, its rd is set to def_root_domain. For an
> > > > > blocked-state deadline task (in this case, "cppc_fie"), it was not
> > > > > migrated to CPU0, and its task_rq() information is stale. As a result,
> > > > > its bandwidth is wrongly accounted into def_root_domain during domain
> > > > > rebuild.
> > > >
> > > > First of all, in an emergency situation when we need to shutdown the kernel,
> > > > does it really matter if dl_bw_activate() returns -EBUSY? Should we just go
> > > > ahead and ignore this dl_bw generated error?
> > > >
> > >
> > > Ah, sorry - the previous test example was misleading. Let me restate it
> > > as an equivalent operation on a system with 144 CPUs:
> > > sudo bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
> > >
> > > That extracts the hot-removal part, which is affected by the bug, from
> > > the kexec reboot process. It expects that only cpu0 is online, but in
> > > practice, the cpu143 refused to be offline due to this bug.
> >
> > I confess I am still perplexed by this, considering the "particular"
> > nature of cppc worker that seems to be the only task that is able to
> > trigger this problem. First of all, is that indeed the case or are you
> > able to reproduce this problem with standard (non-kthread) DEADLINE
> > tasks as well?
> >
>
> Yes, I can. I wrote a SCHED_DEADLINE task that waits indefinitely on a
> semaphore (or, more precisely, for a very long period that may span the
> entire CPU hot-removal process) to emulate waiting for an undetermined
> driver input. Then I spawned multiple instances of this program to
> ensure that some of them run on CPU 72. When I attempted to offline CPUs
> 1–143 one by one, CPU 143 failed to go offline.
>
> > I essentially wonder how cppc worker affinity/migration on hotplug is
> > handled. With your isolcpus configuration you have one isolated root
>
> The affinity/migration on hotplug work fine. The keypoint is that they
> only handle the task on rq. For the blocked-state tasks (here it is cppc
> worker), they just ignore them.
OK. Thanks for confirming/clarifying.
Powered by blists - more mailing lists