lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aCyRhAeGwLSVf2LZ@localhost.localdomain>
Date: Tue, 20 May 2025 16:28:20 +0200
From: Frederic Weisbecker <frederic@...nel.org>
To: Gabriele Monaco <gmonaco@...hat.com>
Cc: linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
	Waiman Long <longman@...hat.com>,
	Anna-Maria Behnsen <anna-maria@...utronix.de>
Subject: Re: [PATCH v5 5/6] cgroup/cpuset: Fail if isolated and nohz_full
 don't leave any housekeeping

(Please keep Anna-Maria Cc'ed)

Le Thu, May 08, 2025 at 04:53:25PM +0200, Gabriele Monaco a écrit :
> Currently the user can set up isolated cpus via cpuset and nohz_full in
> such a way that leaves no housekeeping CPU (i.e. no CPU that is neither
> domain isolated nor nohz full). This can be a problem for other
> subsystems (e.g. the timer wheel imgration).
> 
> Prevent this configuration by blocking any assignation that would cause
> the union of domain isolated cpus and nohz_full to covers all CPUs.
> 
> Signed-off-by: Gabriele Monaco <gmonaco@...hat.com>
> ---
>  kernel/cgroup/cpuset.c | 67 ++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 65 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 95316d39c282..2f1df6f5b988 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -80,6 +80,12 @@ static cpumask_var_t	subpartitions_cpus;
>   */
>  static cpumask_var_t	isolated_cpus;
>  
> +/*
> + * Housekeeping CPUs for both HK_TYPE_DOMAIN and HK_TYPE_KERNEL_NOISE
> + */
> +static cpumask_var_t	full_hk_cpus;
> +static bool		have_boot_nohz_full;

Do you really need to maintain those copies?

> +
>  /*
>   * Housekeeping (HK_TYPE_DOMAIN) CPUs at boot
>   */
> @@ -1253,10 +1259,26 @@ static void reset_partition_data(struct cpuset *cs)
>  static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus)
>  {
>  	WARN_ON_ONCE(old_prs == new_prs);
> -	if (new_prs == PRS_ISOLATED)
> +	if (new_prs == PRS_ISOLATED) {
>  		cpumask_or(isolated_cpus, isolated_cpus, xcpus);
> -	else
> +		cpumask_andnot(full_hk_cpus, full_hk_cpus, xcpus);
> +	} else {
>  		cpumask_andnot(isolated_cpus, isolated_cpus, xcpus);
> +		cpumask_or(full_hk_cpus, full_hk_cpus, xcpus);
> +	}
> +}
> +
> +/*
> + * isolated_cpus_should_update - Returns if the isolated_cpus mask needs update
> + * @prs: new or old partition_root_state
> + * @parent: parent cpuset
> + * Return: true if isolated_cpus needs modification, false otherwise
> + */
> +static bool isolated_cpus_should_update(int prs, struct cpuset *parent)
> +{
> +	if (!parent)
> +		parent = &top_cpuset;
> +	return prs != parent->partition_root_state;
>  }
>  
>  /*
> @@ -1323,6 +1345,25 @@ static bool partition_xcpus_del(int old_prs, struct cpuset *parent,
>  	return isolcpus_updated;
>  }
>  
> +/*
> + * isolcpus_nohz_conflict - check for isolated & nohz_full conflicts
> + * @new_cpus: cpu mask

The description lacks explanation about the role of this cpu mask.

> + * Return: true if there is conflict, false otherwise
> + *
> + * If nohz_full is enabled and we have isolated CPUs, their combination must
> + * still leave housekeeping CPUs.
> + */
> +static bool isolcpus_nohz_conflict(struct cpumask *new_cpus)
> +{
> +	if (!have_boot_nohz_full)
> +		return false;
> +
> +	if (!cpumask_weight_andnot(full_hk_cpus, new_cpus))
> +		return true;

Do we also need to make sure that in this weight there is an online CPU?

Can you allocate a temporary mask here and do:

cpumask_var_t full_hk_cpus;
int ret;

if (!zalloc_cpumask_var(&full_hk_cpus, GFP_KERNEL))
    return true;

cpumask_copy(full_hk_cpus, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
cpumask_and(full_hk_cpus, housekeeping_cpumask(HK_TYPE_DOMAIN));
cpumask_and(full_hk_cpus, cpu_online_mask));
if (!cpumask_weight_andnot(full_hk_cpus, new_cpus))
    ret = true;
else
    ret = false;

free_cpumask_var(full_hk_cpus);

I also realize something, what makes sure that we don't offline the last
non isolated?

I just did a small test:

# cd /sys/fs/cgroup/
# echo +cpuset > cgroup.subtree_control
# cat cpuset.cpus.effective 
0-7
# mkdir test
# cd test
# echo +cpuset > cgroup.subtree_control
# echo 0-6 > cpuset.cpus
# echo isolated > cpuset.cpus.partition
# cat ../cpuset.cpus.effective 
7
# echo 0 > /sys/devices/system/cpu/cpu7/online
[ 4590.864066] ------------[ cut here ]------------
[ 4590.866469] WARNING: CPU: 7 PID: 50 at kernel/cgroup/cpuset.c:1906 update_parent_effective_cpumask+0x770/0x8c0
[ 4590.870023] Modules linked in:
[ 4590.871058] CPU: 7 UID: 0 PID: 50 Comm: cpuhp/7 Not tainted 6.15.0-rc2-g996d9d202383 #10 PREEMPT(voluntary) 
[ 4590.873588] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014
[ 4590.875689] RIP: 0010:update_parent_effective_cpumask+0x770/0x8c0
[ 4590.876858] Code: 06 48 8b 0c 24 ba 05 00 00 00 48 23 85 f8 00 00 00 41 0f 95 c6 48 89 01 41 8b 84 24 34 01 00 00 45 0f b6 f6 e9 90 fe ff ff 90 <0f> 0b 90e
[ 4590.880010] RSP: 0018:ffffa4ce001ebd40 EFLAGS: 00010086
[ 4590.880963] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 0000000000000001
[ 4590.882342] RDX: 000000000000007f RSI: 0000000000000000 RDI: 0000000000000002
[ 4590.883683] RBP: ffffffffbdf52f00 R08: 0000000000000000 R09: 0000000000000000
[ 4590.885071] R10: ffffa223062d2388 R11: 0000000000000000 R12: ffffa223062d2200
[ 4590.886604] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000004
[ 4590.888309] FS:  0000000000000000(0000) GS:ffffa223bc4d6000(0000) knlGS:0000000000000000
[ 4590.890183] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4590.891385] CR2: 000055ab80ada170 CR3: 00000001084ac000 CR4: 00000000000006f0
[ 4590.892901] DR0: ffffffffbc8c8420 DR1: 0000000000000000 DR2: 0000000000000000
[ 4590.894341] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 4590.895765] Call Trace:
[ 4590.896400]  <TASK>
[ 4590.896938]  cpuset_update_active_cpus+0x680/0x730
[ 4590.897979]  ? kvm_sched_clock_read+0x11/0x20
[ 4590.898916]  ? sched_clock+0x10/0x30
[ 4590.899785]  sched_cpu_deactivate+0x148/0x170
[ 4590.900812]  ? __pfx_sched_cpu_deactivate+0x10/0x10
[ 4590.901925]  cpuhp_invoke_callback+0x10e/0x480
[ 4590.902920]  ? __pfx_smpboot_thread_fn+0x10/0x10
[ 4590.903928]  cpuhp_thread_fun+0xd7/0x160
[ 4590.904818]  smpboot_thread_fn+0xee/0x220
[ 4590.905716]  kthread+0xf6/0x1f0
[ 4590.906471]  ? __pfx_kthread+0x10/0x10
[ 4590.907297]  ret_from_fork+0x2f/0x50
[ 4590.908110]  ? __pfx_kthread+0x10/0x10
[ 4590.908917]  ret_from_fork_asm+0x1a/0x30
[ 4590.909833]  </TASK>
[ 4590.910465] ---[ end trace 0000000000000000 ]---
[ 4590.916786] smpboot: CPU 7 is now offline

Apparently you can't trigger the same with isolcpus=0-6, for some reason.

One last thing, nohz_full makes sure that we never offline the timekeeper
(see tick_nohz_cpu_down()). The timekeeper also never shuts down its tick
and therefore never go idle, from tmigr perspective, this way when a nohz_full
CPU shuts down its tick, it makes sure that its global timers are handled by
the timekeeper in last resort, because it's the last global migrator, always
alive.

But if the timekeeper is HK_TYPE_DOMAIN, or isolated by cpuset, it will go out
of the tmigr hierarchy, breaking the guarantee to have a live global migrator
for nohz_full.

That one is a bit more tricky to solve. The easiest is to forbid the timekeeper
from ever being made unavailable. It is also possible to migrate the timekeeping duty
to another common housekeeper.

We probably need to do the latter...

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ