[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPXQra4TWR0NVwDQ@jlelli-thinkpadt14gen4.remote.csb>
Date: Mon, 20 Oct 2025 08:03:25 +0200
From: Juri Lelli <juri.lelli@...hat.com>
To: Pingfan Liu <piliu@...hat.com>
Cc: Waiman Long <llong@...hat.com>, cgroups@...r.kernel.org,
linux-kernel@...r.kernel.org, Tejun Heo <tj@...nel.org>,
Johannes Weiner <hannes@...xchg.org>,
Michal Koutný <mkoutny@...e.com>,
Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Pierre Gondois <pierre.gondois@....com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCHv3] sched/deadline: Walk up cpuset hierarchy to decide
root domain when hot-unplug
Hi!
On 20/10/25 11:21, Pingfan Liu wrote:
> Hi Waiman,
>
> I appreciate your time in reviewing my patch. Please see the comment
> belows.
>
> On Fri, Oct 17, 2025 at 01:52:45PM -0400, Waiman Long wrote:
> > On 10/17/25 8:26 AM, Pingfan Liu wrote:
> > > When testing kexec-reboot on a 144 cpus machine with
> > > isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> > > encounter the following bug:
> > >
> > > [ 97.114759] psci: CPU142 killed (polled 0 ms)
> > > [ 97.333236] Failed to offline CPU143 - error=-16
> > > [ 97.333246] ------------[ cut here ]------------
> > > [ 97.342682] kernel BUG at kernel/cpu.c:1569!
> > > [ 97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> > > [ 97.353281] Modules linked in: rfkill sunrpc dax_hmem cxl_acpi cxl_port cxl_core einj vfat fat arm_smmuv3_pmu nvidia_cspmu arm_spe_pmu coresight_trbe arm_cspmu_module rndis_host ipmi_ssif cdc_ether i2c_smbus spi_nor usbnet ast coresight_tmc mii ixgbe i2c_algo_bit mdio mtd coresight_funnel coresight_stm stm_core coresight_etm4x coresight cppc_cpufreq loop fuse nfnetlink xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt nvme nvme_core nvme_auth i2c_tegra acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler dm_mirror dm_region_hash dm_log dm_mod
> > > [ 97.404119] CPU: 0 UID: 0 PID: 2583 Comm: kexec Kdump: loaded Not tainted 6.12.0-41.el10.aarch64 #1
> > > [ 97.413371] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 2.0 07/12/2024
> > > [ 97.420400] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> > > [ 97.427518] pc : smp_shutdown_nonboot_cpus+0x104/0x128
> > > [ 97.432778] lr : smp_shutdown_nonboot_cpus+0x11c/0x128
> > > [ 97.438028] sp : ffff800097c6b9a0
> > > [ 97.441411] x29: ffff800097c6b9a0 x28: ffff0000a099d800 x27: 0000000000000000
> > > [ 97.448708] x26: 0000000000000000 x25: 0000000000000000 x24: ffffb94aaaa8f218
> > > [ 97.456004] x23: ffffb94aaaabaae0 x22: ffffb94aaaa8f018 x21: 0000000000000000
> > > [ 97.463301] x20: ffffb94aaaa8fc10 x19: 000000000000008f x18: 00000000fffffffe
> > > [ 97.470598] x17: 0000000000000000 x16: ffffb94aa958fcd0 x15: ffff103acfca0b64
> > > [ 97.477894] x14: ffff800097c6b520 x13: 36312d3d726f7272 x12: ffff103acfc6ffa8
> > > [ 97.485191] x11: ffff103acf6f0000 x10: ffff103bc085c400 x9 : ffffb94aa88a0eb0
> > > [ 97.492488] x8 : 0000000000000001 x7 : 000000000017ffe8 x6 : c0000000fffeffff
> > > [ 97.499784] x5 : ffff003bdf62b408 x4 : 0000000000000000 x3 : 0000000000000000
> > > [ 97.507081] x2 : 0000000000000000 x1 : ffff0000a099d800 x0 : 0000000000000002
> > > [ 97.514379] Call trace:
> > > [ 97.516874] smp_shutdown_nonboot_cpus+0x104/0x128
> > > [ 97.521769] machine_shutdown+0x20/0x38
> > > [ 97.525693] kernel_kexec+0xc4/0xf0
> > > [ 97.529260] __do_sys_reboot+0x24c/0x278
> > > [ 97.533272] __arm64_sys_reboot+0x2c/0x40
> > > [ 97.537370] invoke_syscall.constprop.0+0x74/0xd0
> > > [ 97.542179] do_el0_svc+0xb0/0xe8
> > > [ 97.545562] el0_svc+0x44/0x1d0
> > > [ 97.548772] el0t_64_sync_handler+0x120/0x130
> > > [ 97.553222] el0t_64_sync+0x1a4/0x1a8
> > > [ 97.556963] Code: a94363f7 a8c47bfd d50323bf d65f03c0 (d4210000)
> > > [ 97.563191] ---[ end trace 0000000000000000 ]---
> > > [ 97.595854] Kernel panic - not syncing: Oops - BUG: Fatal exception
> > > [ 97.602275] Kernel Offset: 0x394a28600000 from 0xffff800080000000
> > > [ 97.608502] PHYS_OFFSET: 0x80000000
> > > [ 97.612062] CPU features: 0x10,0000000d,002a6928,5667fea7
> > > [ 97.617580] Memory Limit: none
> > > [ 97.648626] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]
> > >
> > > Tracking down this issue, I found that dl_bw_deactivate() returned
> > > -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> > > When a CPU is inactive, its rd is set to def_root_domain. For an
> > > blocked-state deadline task (in this case, "cppc_fie"), it was not
> > > migrated to CPU0, and its task_rq() information is stale. As a result,
> > > its bandwidth is wrongly accounted into def_root_domain during domain
> > > rebuild.
> >
> > First of all, in an emergency situation when we need to shutdown the kernel,
> > does it really matter if dl_bw_activate() returns -EBUSY? Should we just go
> > ahead and ignore this dl_bw generated error?
> >
>
> Ah, sorry - the previous test example was misleading. Let me restate it
> as an equivalent operation on a system with 144 CPUs:
> sudo bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
>
> That extracts the hot-removal part, which is affected by the bug, from
> the kexec reboot process. It expects that only cpu0 is online, but in
> practice, the cpu143 refused to be offline due to this bug.
I confess I am still perplexed by this, considering the "particular"
nature of cppc worker that seems to be the only task that is able to
trigger this problem. First of all, is that indeed the case or are you
able to reproduce this problem with standard (non-kthread) DEADLINE
tasks as well?
I essentially wonder how cppc worker affinity/migration on hotplug is
handled. With your isolcpus configuration you have one isolated root
domain per isolated cpu, so if cppc worker is not migrated away from (in
the case above) cpu 143, then BW control might be right in saying we
can't offline that cpu, as the worker still has BW running there. This
is also why I fist wondered (and suggested) we remove cppc worker BW
from the picture (make it DEADLINE special) as we don't really seem to
have a reliable way to associate meaningful BW to it anyway.
Thanks,
Juri
Powered by blists - more mailing lists