[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAB8ipk_mB0uo8YUj6Ct5t9BWjXba1PznJ7QL7UVg+mSO401=bg@mail.gmail.com>
Date: Tue, 12 Jul 2022 13:57:59 +0800
From: Xuewen Yan <xuewen.yan94@...il.com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Tejun Heo <tj@...nel.org>, Qais Yousef <qais.yousef@....com>,
Xuewen Yan <xuewen.yan@...soc.com>, rafael@...nel.org,
viresh.kumar@...aro.org, mingo@...hat.com, peterz@...radead.org,
juri.lelli@...hat.com, vincent.guittot@...aro.org,
dietmar.eggemann@....com, bsegall@...gle.com, mgorman@...e.de,
bristot@...hat.com, linux-kernel@...r.kernel.org,
ke.wang@...soc.com, xuewyan@...mail.com, linux-pm@...r.kernel.org,
Waiman Long <longman@...hat.com>,
Lukasz Luba <Lukasz.Luba@....com>
Subject: Re: [PATCH] sched/schedutil: Fix deadlock between cpuset and cpu
hotplug when using schedutil
[RESEND]
On Tue, Jul 12, 2022 at 5:34 AM Steven Rostedt <rostedt@...dmis.org> wrote:
>
> On Mon, 11 Jul 2022 10:58:28 -1000
> Tejun Heo <tj@...nel.org> wrote:
>
> > I don't think lockdep would be able to track CPU1 -> CPU2 dependency here
> > unfortunately.
> >
> > > AFAIU:
> > >
> > >
> > > CPU0 CPU1 CPU2
> > >
> > > // attach task to a different
> > > // cpuset cgroup via sysfs
> > > __acquire(cgroup_threadgroup_rwsem)
> > >
> > > // pring up CPU2 online
> > > __acquire(cpu_hotplug_lock)
> > > // wait for CPU2 to come online
>
> Should there be some annotation here that tells lockdep that CPU1 is now
> blocked on CPU2?
>
> Then this case would be caught by lockdep.
>
> -- Steve
>
>
> > > // bringup cpu online
> > > // call cpufreq_online() which tries to create sugov kthread
> > > __acquire(cpu_hotplug_lock) copy_process()
> > > cgroup_can_fork()
> > > cgroup_css_set_fork()
> > > __acquire(cgroup_threadgroup_rwsem)
> > > // blocks forever // blocks forever // blocks forever
> > >
Indeed, It's caused by threads instead of cpus.
Our soc contains two cpufreq policy.0-5 belongs to policy0, 6-7
belongs to policy1.
when cpu6/7 online
Thread-A Thread-B
cgroup_file_write device_online
cgroup1_tasks_write ...
__cgroup1_procs_write _cpu_up
write(&cgroup_threadgroup_rwsem); << cpus_write_lock();<<
cgroup_attach_task ......
cgroup_migrate_execute cpuhp_kick_ap
cpuset_attach //wakeup cpuhp
cpus_read_lock() //waitingfor cpuhp
cpuhp/6 kthreadd
cpuhp_thread_fun
cpuhp_invoke_callback
cpuhp_cpufreq_online
cpufreq_online
sugov_init
__kthread_create_on_node copy_process
//blocked, waiting for kthreadd cgroup_can_fork
cgroup_css_set_fork
__acquires(&cgroup_threadgroup_rwsem)
//blocked
So it's logic is:
Thread-A --->Thread-B---->cpuhp--->kthreadd---->Thread-A
When cpu offline, the sugov thread would stop, so it would waiting
cgroup_threadgroup_rwsem when kthread_stop();
It's logic is:
Thread-A --->Thread-B---->cpuhp--->sugov---->Thread-A
As Qais said:
> if there's anything else that creates a kthread when a cpu goes online/offline
> then we'll hit the same problem again.
Indeed, only the cpuhp thread create/destroy kthread would cause the case.
I have put the test script in the mail, and I have tested it without
monkey test, the deadlock still occurs..
Thanks!
xuewen.yan
View attachment "test_hotplug.sh" of type "text/x-sh" (1287 bytes)
Powered by blists - more mailing lists