linux-kernel - Re: [PATCH] sched/schedutil: Fix deadlock between cpuset and cpu hotplug when using schedutil

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220712125702.yg4eqbaakvj56k6m@wubuntu>
Date:   Tue, 12 Jul 2022 13:57:02 +0100
From:   Qais Yousef <qais.yousef@....com>
To:     Tejun Heo <tj@...nel.org>
Cc:     Xuewen Yan <xuewen.yan@...soc.com>, rafael@...nel.org,
        viresh.kumar@...aro.org, mingo@...hat.com, peterz@...radead.org,
        juri.lelli@...hat.com, vincent.guittot@...aro.org,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        mgorman@...e.de, bristot@...hat.com, linux-kernel@...r.kernel.org,
        ke.wang@...soc.com, xuewyan@...mail.com, linux-pm@...r.kernel.org,
        Waiman Long <longman@...hat.com>
Subject: Re: [PATCH] sched/schedutil: Fix deadlock between cpuset and cpu
 hotplug when using schedutil

On 07/11/22 10:58, Tejun Heo wrote:
> (cc'ing Waiman)
> 
> On Mon, Jul 11, 2022 at 06:46:29PM +0100, Qais Yousef wrote:
> > Have you tried running with PROVE_LOCKDEP enabled? It'll help print a useful
> > output about the DEADLOCK. But your explanation was good and clear to me.
> 
> I don't think lockdep would be able to track CPU1 -> CPU2 dependency here
> unfortunately.

I see. It seems I got it wrong and CPU1 and CPU2 below run from the same
context, IIU Xuewen's response correctly this time that is.

> 
> > AFAIU:
> > 
> > 
> > CPU0                                     CPU1                                   CPU2
> > 
> > // attach task to a different
> > // cpuset cgroup via sysfs
> > __acquire(cgroup_threadgroup_rwsem)
> > 
> >                                          // pring up CPU2 online
> >                                          __acquire(cpu_hotplug_lock)
> >                                          // wait for CPU2 to come online
> >                                                                                 // bringup cpu online
> >                                                                                 // call cpufreq_online() which tries to create sugov kthread
> > __acquire(cpu_hotplug_lock)                                                     copy_process()
> >                                                                                    cgroup_can_fork()
> >                                                                                       cgroup_css_set_fork()
> >                                                                                       __acquire(cgroup_threadgroup_rwsem)
> > // blocks forever                        // blocks forever                            // blocks forever
> > 
> > 
> > Is this a correct summary of the problem?
> > 
> > The locks are held in reverse order and we end up with a DEADLOCK.
> > 
> > I believe the same happens on offline it's just the path to hold the
> > cgroup_threadgroup_rwsem on CPU2 is different.
> > 
> > This will be a tricky one. Your proposed patch might fix it for this case, but
> > if there's anything else that creates a kthread when a cpu goes online/offline
> > then we'll hit the same problem again.
> > 
> > I haven't reviewed your patch to be honest, but I think worth seeing first if
> > there's something that can be done at the 'right level' first.
> > 
> > Needs head scratching from my side at least. This is the not the first type of
> > locking issue between hotplug and cpuset :-/
> 
> Well, the only thing I can think of is always grabbing cpus_read_lock()
> before grabbing threadgroup_rwsem. Waiman, what do you think?

Is there a lot of subsystems beside cpuset that needs the cpus_read_lock()?
A quick grep tells me it's the only one.

Can't we instead use cpus_read_trylock() in cpuset_can_attach() so that we
either hold the lock successfully then before we go ahead and call
cpuset_attach(), or bail out and cancel the whole attach operation which should
unlock the threadgroup_rwsem() lock?

Sounds simple in theory, but as always the devil is in the details which
I don't have all inside my head. AFAICS cgroup_migration_execute() will call
ss->can_attach() for each subsystem. And it's the only caller of can_attach().

Not sure if it'd be safe to do cpus_read_unlock() at the end of cpuset_attach()
or we need to introduce an additional callback to 'cpuset_attach_finish()'
where we'd release the lock.

ie something like the below untested-and-for-demo-purposes-only patch:


diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 71a418858a5e..86c7c6aa0f12 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2226,6 +2226,12 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
        struct task_struct *task;
        int ret;

+       ret = cpus_read_trylock();
+       if (!ret) {
+               ret = -EBUSY;
+               goto out_unlock;
+       }
+
        /* used later by cpuset_attach() */
        cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
        cs = css_cs(css);
@@ -2289,7 +2295,8 @@ static void cpuset_attach(struct cgroup_taskset *tset)
        cgroup_taskset_first(tset, &css);
        cs = css_cs(css);

-       cpus_read_lock();
+       // Requires cpus_read_lock() to be held already
+       // releases cpus_read_lock() when done
        percpu_down_write(&cpuset_rwsem);

        guarantee_online_mems(cs, &cpuset_attach_nodemask_to);


Cheers

--
Qais Yousef