[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aCyRhAeGwLSVf2LZ@localhost.localdomain>
Date: Tue, 20 May 2025 16:28:20 +0200
From: Frederic Weisbecker <frederic@...nel.org>
To: Gabriele Monaco <gmonaco@...hat.com>
Cc: linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
Waiman Long <longman@...hat.com>,
Anna-Maria Behnsen <anna-maria@...utronix.de>
Subject: Re: [PATCH v5 5/6] cgroup/cpuset: Fail if isolated and nohz_full
don't leave any housekeeping
(Please keep Anna-Maria Cc'ed)
Le Thu, May 08, 2025 at 04:53:25PM +0200, Gabriele Monaco a écrit :
> Currently the user can set up isolated cpus via cpuset and nohz_full in
> such a way that leaves no housekeeping CPU (i.e. no CPU that is neither
> domain isolated nor nohz full). This can be a problem for other
> subsystems (e.g. the timer wheel imgration).
>
> Prevent this configuration by blocking any assignation that would cause
> the union of domain isolated cpus and nohz_full to covers all CPUs.
>
> Signed-off-by: Gabriele Monaco <gmonaco@...hat.com>
> ---
> kernel/cgroup/cpuset.c | 67 ++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 65 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 95316d39c282..2f1df6f5b988 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -80,6 +80,12 @@ static cpumask_var_t subpartitions_cpus;
> */
> static cpumask_var_t isolated_cpus;
>
> +/*
> + * Housekeeping CPUs for both HK_TYPE_DOMAIN and HK_TYPE_KERNEL_NOISE
> + */
> +static cpumask_var_t full_hk_cpus;
> +static bool have_boot_nohz_full;
Do you really need to maintain those copies?
> +
> /*
> * Housekeeping (HK_TYPE_DOMAIN) CPUs at boot
> */
> @@ -1253,10 +1259,26 @@ static void reset_partition_data(struct cpuset *cs)
> static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus)
> {
> WARN_ON_ONCE(old_prs == new_prs);
> - if (new_prs == PRS_ISOLATED)
> + if (new_prs == PRS_ISOLATED) {
> cpumask_or(isolated_cpus, isolated_cpus, xcpus);
> - else
> + cpumask_andnot(full_hk_cpus, full_hk_cpus, xcpus);
> + } else {
> cpumask_andnot(isolated_cpus, isolated_cpus, xcpus);
> + cpumask_or(full_hk_cpus, full_hk_cpus, xcpus);
> + }
> +}
> +
> +/*
> + * isolated_cpus_should_update - Returns if the isolated_cpus mask needs update
> + * @prs: new or old partition_root_state
> + * @parent: parent cpuset
> + * Return: true if isolated_cpus needs modification, false otherwise
> + */
> +static bool isolated_cpus_should_update(int prs, struct cpuset *parent)
> +{
> + if (!parent)
> + parent = &top_cpuset;
> + return prs != parent->partition_root_state;
> }
>
> /*
> @@ -1323,6 +1345,25 @@ static bool partition_xcpus_del(int old_prs, struct cpuset *parent,
> return isolcpus_updated;
> }
>
> +/*
> + * isolcpus_nohz_conflict - check for isolated & nohz_full conflicts
> + * @new_cpus: cpu mask
The description lacks explanation about the role of this cpu mask.
> + * Return: true if there is conflict, false otherwise
> + *
> + * If nohz_full is enabled and we have isolated CPUs, their combination must
> + * still leave housekeeping CPUs.
> + */
> +static bool isolcpus_nohz_conflict(struct cpumask *new_cpus)
> +{
> + if (!have_boot_nohz_full)
> + return false;
> +
> + if (!cpumask_weight_andnot(full_hk_cpus, new_cpus))
> + return true;
Do we also need to make sure that in this weight there is an online CPU?
Can you allocate a temporary mask here and do:
cpumask_var_t full_hk_cpus;
int ret;
if (!zalloc_cpumask_var(&full_hk_cpus, GFP_KERNEL))
return true;
cpumask_copy(full_hk_cpus, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
cpumask_and(full_hk_cpus, housekeeping_cpumask(HK_TYPE_DOMAIN));
cpumask_and(full_hk_cpus, cpu_online_mask));
if (!cpumask_weight_andnot(full_hk_cpus, new_cpus))
ret = true;
else
ret = false;
free_cpumask_var(full_hk_cpus);
I also realize something, what makes sure that we don't offline the last
non isolated?
I just did a small test:
# cd /sys/fs/cgroup/
# echo +cpuset > cgroup.subtree_control
# cat cpuset.cpus.effective
0-7
# mkdir test
# cd test
# echo +cpuset > cgroup.subtree_control
# echo 0-6 > cpuset.cpus
# echo isolated > cpuset.cpus.partition
# cat ../cpuset.cpus.effective
7
# echo 0 > /sys/devices/system/cpu/cpu7/online
[ 4590.864066] ------------[ cut here ]------------
[ 4590.866469] WARNING: CPU: 7 PID: 50 at kernel/cgroup/cpuset.c:1906 update_parent_effective_cpumask+0x770/0x8c0
[ 4590.870023] Modules linked in:
[ 4590.871058] CPU: 7 UID: 0 PID: 50 Comm: cpuhp/7 Not tainted 6.15.0-rc2-g996d9d202383 #10 PREEMPT(voluntary)
[ 4590.873588] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014
[ 4590.875689] RIP: 0010:update_parent_effective_cpumask+0x770/0x8c0
[ 4590.876858] Code: 06 48 8b 0c 24 ba 05 00 00 00 48 23 85 f8 00 00 00 41 0f 95 c6 48 89 01 41 8b 84 24 34 01 00 00 45 0f b6 f6 e9 90 fe ff ff 90 <0f> 0b 90e
[ 4590.880010] RSP: 0018:ffffa4ce001ebd40 EFLAGS: 00010086
[ 4590.880963] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 0000000000000001
[ 4590.882342] RDX: 000000000000007f RSI: 0000000000000000 RDI: 0000000000000002
[ 4590.883683] RBP: ffffffffbdf52f00 R08: 0000000000000000 R09: 0000000000000000
[ 4590.885071] R10: ffffa223062d2388 R11: 0000000000000000 R12: ffffa223062d2200
[ 4590.886604] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000004
[ 4590.888309] FS: 0000000000000000(0000) GS:ffffa223bc4d6000(0000) knlGS:0000000000000000
[ 4590.890183] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4590.891385] CR2: 000055ab80ada170 CR3: 00000001084ac000 CR4: 00000000000006f0
[ 4590.892901] DR0: ffffffffbc8c8420 DR1: 0000000000000000 DR2: 0000000000000000
[ 4590.894341] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 4590.895765] Call Trace:
[ 4590.896400] <TASK>
[ 4590.896938] cpuset_update_active_cpus+0x680/0x730
[ 4590.897979] ? kvm_sched_clock_read+0x11/0x20
[ 4590.898916] ? sched_clock+0x10/0x30
[ 4590.899785] sched_cpu_deactivate+0x148/0x170
[ 4590.900812] ? __pfx_sched_cpu_deactivate+0x10/0x10
[ 4590.901925] cpuhp_invoke_callback+0x10e/0x480
[ 4590.902920] ? __pfx_smpboot_thread_fn+0x10/0x10
[ 4590.903928] cpuhp_thread_fun+0xd7/0x160
[ 4590.904818] smpboot_thread_fn+0xee/0x220
[ 4590.905716] kthread+0xf6/0x1f0
[ 4590.906471] ? __pfx_kthread+0x10/0x10
[ 4590.907297] ret_from_fork+0x2f/0x50
[ 4590.908110] ? __pfx_kthread+0x10/0x10
[ 4590.908917] ret_from_fork_asm+0x1a/0x30
[ 4590.909833] </TASK>
[ 4590.910465] ---[ end trace 0000000000000000 ]---
[ 4590.916786] smpboot: CPU 7 is now offline
Apparently you can't trigger the same with isolcpus=0-6, for some reason.
One last thing, nohz_full makes sure that we never offline the timekeeper
(see tick_nohz_cpu_down()). The timekeeper also never shuts down its tick
and therefore never go idle, from tmigr perspective, this way when a nohz_full
CPU shuts down its tick, it makes sure that its global timers are handled by
the timekeeper in last resort, because it's the last global migrator, always
alive.
But if the timekeeper is HK_TYPE_DOMAIN, or isolated by cpuset, it will go out
of the tmigr hierarchy, breaking the guarantee to have a live global migrator
for nohz_full.
That one is a bit more tricky to solve. The easiest is to forbid the timekeeper
from ever being made unavailable. It is also possible to migrate the timekeeping duty
to another common housekeeper.
We probably need to do the latter...
Thanks.
--
Frederic Weisbecker
SUSE Labs
Powered by blists - more mailing lists