[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zo9XAmjpP6y0ZDGH@google.com>
Date: Thu, 11 Jul 2024 03:52:34 +0000
From: Roman Gushchin <roman.gushchin@...ux.dev>
To: chenridong <chenridong@...wei.com>, tj@...nel.org
Cc: martin.lau@...ux.dev, ast@...nel.org, daniel@...earbox.net,
andrii@...nel.org, eddyz87@...il.com, song@...nel.org,
yonghong.song@...ux.dev, john.fastabend@...il.com,
kpsingh@...nel.org, sdf@...gle.com, haoluo@...gle.com,
jolsa@...nel.org, tj@...nel.org, lizefan.x@...edance.com,
hannes@...xchg.org, bpf@...r.kernel.org, cgroups@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH -next] cgroup: Fix AA deadlock caused by
cgroup_bpf_release
On Wed, Jul 10, 2024 at 11:02:57AM +0800, chenridong wrote:
>
>
> On 2024/7/9 21:42, chenridong wrote:
> >
> >
> > On 2024/6/10 10:47, Roman Gushchin wrote:
> > > Hi Chen!
> > >
> > > Was this problem found in the real life? Do you have a LOCKDEP
> > > splash available?
> > >
> > Sorry for the late email response.
> > Yes, it was. The issue occurred after a long period of stress testing,
> > with a very low probability.
> > > > On Jun 7, 2024, at 4:09 AM, Chen Ridong <chenridong@...wei.com> wrote:
> > > >
> > > > We found an AA deadlock problem as shown belowed:
> > > >
> > > > cgroup_destroy_wq TaskB WatchDog
> > > > system_wq
> > > >
> > > > ...
> > > > css_killed_work_fn:
> > > > P(cgroup_mutex)
> > > > ...
> > > > ...
> > > > __lockup_detector_reconfigure:
> > > > P(cpu_hotplug_lock.read)
> > > > ...
> > > > ...
> > > > percpu_down_write:
> > > > P(cpu_hotplug_lock.write)
> > > > ...
> > > > cgroup_bpf_release:
> > > > P(cgroup_mutex)
> > > > smp_call_on_cpu:
> > > > Wait system_wq
> > > >
> > > > cpuset_css_offline:
> > > > P(cpu_hotplug_lock.read)
> > > >
> > > > WatchDog is waiting for system_wq, who is waiting for cgroup_mutex, to
> > > > finish the jobs, but the owner of the cgroup_mutex is waiting for
> > > > cpu_hotplug_lock. This problem caused by commit 4bfc0bb2c60e ("bpf:
> > > > decouple the lifetime of cgroup_bpf from cgroup itself")
> > > > puts cgroup_bpf release work into system_wq. As cgroup_bpf is a
> > > > member of
> > > > cgroup, it is reasonable to put cgroup bpf release work into
> > > > cgroup_destroy_wq, which is only used for cgroup's release work, and the
> > > > preblem is solved.
> > >
> > > I need to think more on this, but at first glance the fix looks a
> > > bit confusing. cgroup_bpf_release() looks quite innocent, it only
> > > takes a cgroup_mutex. It’s not obvious why it’s not ok and requires
> > > a dedicated work queue. What exactly is achieved by placing it back
> > > on the dedicated cgroup destroy queue?
> > >
> > > I’m not trying to say your fix won’t work, but it looks like it
> > > might cover a more serious problem.
> >
> > The issue lies in the fact that different tasks require the cgroup_mutex
> > and cpu_hotplug_lock locks, eventually forming a deadlock. Placing
> > cgroup bpf release work on cgroup destroy queue can break loop.
> >
> The max_active of system_wq is WQ_DFL_ACTIVE(256). If all active works are
> cgroup bpf release works, it will block smp_call_on_cpu work which enque
> after cgroup bpf releases. So smp_call_on_cpu holding cpu_hotplug_lock will
> wait for completion, but it can never get a completion because cgroup bpf
> release works can not get cgroup_mutex and will never finish.
> However, Placing the cgroup bpf release works on cgroup destroy will never
> block smp_call_on_cpu work, which means loop is broken. Thus, it can solve
> the problem.
Tejun,
do you have an opinion on this?
If there are certain limitations from the cgroup side on what can be done
in a generic work context, it would be nice to document (e.g. don't grab
cgroup mutex), but I still struggle to understand what exactly is wrong
with the blamed commit.
Thanks,
Roman
Powered by blists - more mailing lists