[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <da1865ac15e46469e9b4bcf1933edb3959755885.camel@redhat.com>
Date: Tue, 20 May 2025 17:24:22 +0200
From: Gabriele Monaco <gmonaco@...hat.com>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
Waiman Long <longman@...hat.com>, Anna-Maria Behnsen
<anna-maria@...utronix.de>
Subject: Re: [PATCH v5 5/6] cgroup/cpuset: Fail if isolated and nohz_full
don't leave any housekeeping
On Tue, 2025-05-20 at 16:28 +0200, Frederic Weisbecker wrote:
> (Please keep Anna-Maria Cc'ed)
>
> Le Thu, May 08, 2025 at 04:53:25PM +0200, Gabriele Monaco a écrit :
> > Currently the user can set up isolated cpus via cpuset and
> > nohz_full in
> > such a way that leaves no housekeeping CPU (i.e. no CPU that is
> > neither
> > domain isolated nor nohz full). This can be a problem for other
> > subsystems (e.g. the timer wheel imgration).
> >
> > Prevent this configuration by blocking any assignation that would
> > cause
> > the union of domain isolated cpus and nohz_full to covers all CPUs.
> >
> > Signed-off-by: Gabriele Monaco <gmonaco@...hat.com>
> > ---
> > kernel/cgroup/cpuset.c | 67
> > ++++++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 65 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 95316d39c282..2f1df6f5b988 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -80,6 +80,12 @@ static cpumask_var_t subpartitions_cpus;
> > */
> > static cpumask_var_t isolated_cpus;
> >
> > +/*
> > + * Housekeeping CPUs for both HK_TYPE_DOMAIN and
> > HK_TYPE_KERNEL_NOISE
> > + */
> > +static cpumask_var_t full_hk_cpus;
> > +static bool have_boot_nohz_full;
>
> Do you really need to maintain those copies?
>
Yeah good point, I wanted to avoid allocating temporary masks but it's
probably better than maintaining those..
> > +
> > /*
> > * Housekeeping (HK_TYPE_DOMAIN) CPUs at boot
> > */
> > @@ -1253,10 +1259,26 @@ static void reset_partition_data(struct
> > cpuset *cs)
> > static void isolated_cpus_update(int old_prs, int new_prs, struct
> > cpumask *xcpus)
> > {
> > WARN_ON_ONCE(old_prs == new_prs);
> > - if (new_prs == PRS_ISOLATED)
> > + if (new_prs == PRS_ISOLATED) {
> > cpumask_or(isolated_cpus, isolated_cpus, xcpus);
> > - else
> > + cpumask_andnot(full_hk_cpus, full_hk_cpus, xcpus);
> > + } else {
> > cpumask_andnot(isolated_cpus, isolated_cpus,
> > xcpus);
> > + cpumask_or(full_hk_cpus, full_hk_cpus, xcpus);
> > + }
> > +}
> > +
> > +/*
> > + * isolated_cpus_should_update - Returns if the isolated_cpus mask
> > needs update
> > + * @prs: new or old partition_root_state
> > + * @parent: parent cpuset
> > + * Return: true if isolated_cpus needs modification, false
> > otherwise
> > + */
> > +static bool isolated_cpus_should_update(int prs, struct cpuset
> > *parent)
> > +{
> > + if (!parent)
> > + parent = &top_cpuset;
> > + return prs != parent->partition_root_state;
> > }
> >
> > /*
> > @@ -1323,6 +1345,25 @@ static bool partition_xcpus_del(int old_prs,
> > struct cpuset *parent,
> > return isolcpus_updated;
> > }
> >
> > +/*
> > + * isolcpus_nohz_conflict - check for isolated & nohz_full
> > conflicts
> > + * @new_cpus: cpu mask
>
> The description lacks explanation about the role of this cpu mask.
>
Mmh yeah, that was a copy paste from prstate_housekeeping_conflict but
I agree, I should describe it better at least here..
> > + * Return: true if there is conflict, false otherwise
> > + *
> > + * If nohz_full is enabled and we have isolated CPUs, their
> > combination must
> > + * still leave housekeeping CPUs.
> > + */
> > +static bool isolcpus_nohz_conflict(struct cpumask *new_cpus)
> > +{
> > + if (!have_boot_nohz_full)
> > + return false;
> > +
> > + if (!cpumask_weight_andnot(full_hk_cpus, new_cpus))
> > + return true;
>
> Do we also need to make sure that in this weight there is an online
> CPU?
>
> Can you allocate a temporary mask here and do:
>
> cpumask_var_t full_hk_cpus;
> int ret;
>
> if (!zalloc_cpumask_var(&full_hk_cpus, GFP_KERNEL))
> return true;
>
> cpumask_copy(full_hk_cpus,
> housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
> cpumask_and(full_hk_cpus, housekeeping_cpumask(HK_TYPE_DOMAIN));
> cpumask_and(full_hk_cpus, cpu_online_mask));
> if (!cpumask_weight_andnot(full_hk_cpus, new_cpus))
> ret = true;
> else
> ret = false;
>
> free_cpumask_var(full_hk_cpus);
>
Yeah that looks safer. I'll do with a mask.
> I also realize something, what makes sure that we don't offline the
> last
> non isolated?
>
Mmh, I guess we need to enforce that too then, I remember in some
condition the system was preventing me from doing that, but need to
play a bit more to understand what's going on...
> I just did a small test:
>
> # cd /sys/fs/cgroup/
> # echo +cpuset > cgroup.subtree_control
> # cat cpuset.cpus.effective
> 0-7
> # mkdir test
> # cd test
> # echo +cpuset > cgroup.subtree_control
> # echo 0-6 > cpuset.cpus
> # echo isolated > cpuset.cpus.partition
> # cat ../cpuset.cpus.effective
> 7
> # echo 0 > /sys/devices/system/cpu/cpu7/online
> [ 4590.864066] ------------[ cut here ]------------
> [ 4590.866469] WARNING: CPU: 7 PID: 50 at kernel/cgroup/cpuset.c:1906
> update_parent_effective_cpumask+0x770/0x8c0
> [ 4590.870023] Modules linked in:
> [ 4590.871058] CPU: 7 UID: 0 PID: 50 Comm: cpuhp/7 Not tainted
> 6.15.0-rc2-g996d9d202383 #10 PREEMPT(voluntary)
> [ 4590.873588] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014
> [ 4590.875689] RIP: 0010:update_parent_effective_cpumask+0x770/0x8c0
> [ 4590.876858] Code: 06 48 8b 0c 24 ba 05 00 00 00 48 23 85 f8 00 00
> 00 41 0f 95 c6 48 89 01 41 8b 84 24 34 01 00 00 45 0f b6 f6 e9 90 fe
> ff ff 90 <0f> 0b 90e
> [ 4590.880010] RSP: 0018:ffffa4ce001ebd40 EFLAGS: 00010086
> [ 4590.880963] RAX: 00000000ffffffff RBX: 0000000000000000 RCX:
> 0000000000000001
> [ 4590.882342] RDX: 000000000000007f RSI: 0000000000000000 RDI:
> 0000000000000002
> [ 4590.883683] RBP: ffffffffbdf52f00 R08: 0000000000000000 R09:
> 0000000000000000
> [ 4590.885071] R10: ffffa223062d2388 R11: 0000000000000000 R12:
> ffffa223062d2200
> [ 4590.886604] R13: 0000000000000002 R14: 0000000000000001 R15:
> 0000000000000004
> [ 4590.888309] FS: 0000000000000000(0000) GS:ffffa223bc4d6000(0000)
> knlGS:0000000000000000
> [ 4590.890183] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4590.891385] CR2: 000055ab80ada170 CR3: 00000001084ac000 CR4:
> 00000000000006f0
> [ 4590.892901] DR0: ffffffffbc8c8420 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 4590.894341] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000600
> [ 4590.895765] Call Trace:
> [ 4590.896400] <TASK>
> [ 4590.896938] cpuset_update_active_cpus+0x680/0x730
> [ 4590.897979] ? kvm_sched_clock_read+0x11/0x20
> [ 4590.898916] ? sched_clock+0x10/0x30
> [ 4590.899785] sched_cpu_deactivate+0x148/0x170
> [ 4590.900812] ? __pfx_sched_cpu_deactivate+0x10/0x10
> [ 4590.901925] cpuhp_invoke_callback+0x10e/0x480
> [ 4590.902920] ? __pfx_smpboot_thread_fn+0x10/0x10
> [ 4590.903928] cpuhp_thread_fun+0xd7/0x160
> [ 4590.904818] smpboot_thread_fn+0xee/0x220
> [ 4590.905716] kthread+0xf6/0x1f0
> [ 4590.906471] ? __pfx_kthread+0x10/0x10
> [ 4590.907297] ret_from_fork+0x2f/0x50
> [ 4590.908110] ? __pfx_kthread+0x10/0x10
> [ 4590.908917] ret_from_fork_asm+0x1a/0x30
> [ 4590.909833] </TASK>
> [ 4590.910465] ---[ end trace 0000000000000000 ]---
> [ 4590.916786] smpboot: CPU 7 is now offline
>
> Apparently you can't trigger the same with isolcpus=0-6, for some
> reason.
>
> One last thing, nohz_full makes sure that we never offline the
> timekeeper
> (see tick_nohz_cpu_down()). The timekeeper also never shuts down its
> tick
> and therefore never go idle, from tmigr perspective, this way when a
> nohz_full
> CPU shuts down its tick, it makes sure that its global timers are
> handled by
> the timekeeper in last resort, because it's the last global migrator,
> always
> alive.
>
> But if the timekeeper is HK_TYPE_DOMAIN, or isolated by cpuset, it
> will go out
> of the tmigr hierarchy, breaking the guarantee to have a live global
> migrator
> for nohz_full.
>
> That one is a bit more tricky to solve. The easiest is to forbid the
> timekeeper
> from ever being made unavailable. It is also possible to migrate the
> timekeeping duty
> to another common housekeeper.
>
Mmh, if I get it correctly, this would mean we need to:
1. make sure the timekeeper is not in isolcpus at boot
2. change timekeeper in case it is included in an isolated cpuset
3. avoid picking domain isolated CPUs when the timekeeper gets offline
All those would be meaningful only if nohz_full is active, right? Am I
missing any corner case? Are any of those already happening?
Thanks,
Gabriele
Powered by blists - more mailing lists