[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220427010654.GC84190@shbuild999.sh.intel.com>
Date: Wed, 27 Apr 2022 09:06:54 +0800
From: Feng Tang <feng.tang@...el.com>
To: Waiman Long <longman@...hat.com>
Cc: Tejun Heo <tj@...nel.org>, Zefan Li <lizefan.x@...edance.com>,
Johannes Weiner <hannes@...xchg.org>,
"cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Michal Hocko <mhocko@...nel.org>,
"Hansen, Dave" <dave.hansen@...el.com>,
"Huang, Ying" <ying.huang@...el.com>,
"stable@...r.kernel.org" <stable@...r.kernel.org>
Subject: Re: [PATCH v2] cgroup/cpuset: Remove cpus_allowed/mems_allowed setup
in cpuset_init_smp()
On Tue, Apr 26, 2022 at 10:58:21PM +0800, Waiman Long wrote:
> On 4/25/22 23:23, Feng Tang wrote:
> > Hi Waiman,
> >
> > On Mon, Apr 25, 2022 at 11:55:05AM -0400, Waiman Long wrote:
> >> There are 3 places where the cpu and node masks of the top cpuset can
> >> be initialized in the order they are executed:
> >> 1) start_kernel -> cpuset_init()
> >> 2) start_kernel -> cgroup_init() -> cpuset_bind()
> >> 3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp()
> >>
> >> The first cpuset_init() function just sets all the bits in the masks.
> >> The last one executed is cpuset_init_smp() which sets up cpu and node
> >> masks suitable for v1, but not v2. cpuset_bind() does the right setup
> >> for both v1 and v2.
> >>
> >> For systems with cgroup v2 setup, cpuset_bind() is called once. For
> >> systems with cgroup v1 setup, cpuset_bind() is called twice. It is
> >> first called before cpuset_init_smp() in cgroup v2 mode. Then it is
> >> called again when cgroup v1 filesystem is mounted in v1 mode after
> >> cpuset_init_smp().
> >>
> >> [ 2.609781] cpuset_bind() called - v2 = 1
> >> [ 3.079473] cpuset_init_smp() called
> >> [ 7.103710] cpuset_bind() called - v2 = 0
> > I run some test, on a server with centOS, this did happen that
> > cpuset_bind() is called twice, first as v2 during kernel boot,
> > and then as v1 post-boot.
> >
> > However on a QEMU running with a basic debian rootfs image,
> > the second call of cpuset_bind() didn't happen.
>
> The first time cpuset_bind() is called in cgroup_init(), the kernel
> doesn't know if userspace is going to mount v1 or v2 cgroup. By default,
> it is assumed to be v2. However, if userspace mounts the cgroup v1
> filesystem for cpuset, cpuset_bind() will be run at this point by
> rebind_subsystem() to set up cgroup v1 environment and
> cpus_allowed/mems_allowed will be correctly set at this point. Mounting
> the cgroup v2 filesystem, however, does not cause rebind_subsystem() to
> run and hence cpuset_bind() is not called again.
>
> Is the QEMU setup not mounting any cgroup filesystem at all? If so, does
> it matter whether v1 or v2 setup is used?
When I got the cpuset binding error report, I tried first on qemu to
reproduce and failed (due to there was no memory hotplug), then I
reproduced it on a real server. For both system, I used "cgroup_no_v1=all"
cmdline parameter to test cgroup-v2, could this be the reason? (TBH,
this is the first time I use cgroup-v2).
Here is the info dump:
# mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
#cat /proc/filesystems | grep cgroup
nodev cgroup
nodev cgroup2
Thanks,
Feng
> >> As a result, cpu and memory node hot add may fail to update the cpu and
> >> node masks of the top cpuset to include the newly added cpu or node in
> >> a cgroup v2 environment.
> >>
> >> smp_init() is called after the first two init functions. So we don't
> >> have a complete list of active cpus and memory nodes until later in
> >> cpuset_init_smp() which is the right time to set up effective_cpus
> >> and effective_mems.
> >>
> >> To fix this problem, the potentially incorrect cpus_allowed &
> >> mems_allowed setup in cpuset_init_smp() are removed. For cgroup v2
> >> systems, the initial cpuset_bind() call will set them up correctly.
> >> For cgroup v1 systems, the second call to cpuset_bind() will do the
> >> right setup.
> >>
> >> cc: stable@...r.kernel.org
> >> Signed-off-by: Waiman Long <longman@...hat.com>
> >> ---
> >> kernel/cgroup/cpuset.c | 5 +++--
> >> 1 file changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> >> index 9390bfd9f1cd..6bd8f5ef40fe 100644
> >> --- a/kernel/cgroup/cpuset.c
> >> +++ b/kernel/cgroup/cpuset.c
> >> @@ -3390,8 +3390,9 @@ static struct notifier_block cpuset_track_online_nodes_nb = {
> >> */
> >> void __init cpuset_init_smp(void)
> >> {
> >> - cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
> >> - top_cpuset.mems_allowed = node_states[N_MEMORY];
> > So can we keep line
> > cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
> >
> > and only remove line
> > top_cpuset.mems_allowed = node_states[N_MEMORY];
> > ?
>
> That may cause cpusets.cpu to be set incorrectly for systems using
> cgroup v2. What is really important is that effective_cpus and
> effective_mems are set correctly.
>
> Cheers,
> Longman
>
Powered by blists - more mailing lists