[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d079b7d4-c538-8a50-3375-fab0d3a0f0e6@linaro.org>
Date: Tue, 7 Jun 2022 11:47:00 -0700
From: Tadeusz Struk <tadeusz.struk@...aro.org>
To: Michal Koutný <mkoutny@...e.com>
Cc: Tejun Heo <tj@...nel.org>, Zefan Li <lizefan.x@...edance.com>,
Johannes Weiner <hannes@...xchg.org>,
Christian Brauner <brauner@...nel.org>,
Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <kafai@...com>,
Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
John Fastabend <john.fastabend@...il.com>,
KP Singh <kpsingh@...nel.org>, cgroups@...r.kernel.org,
netdev@...r.kernel.org, bpf@...r.kernel.org,
stable@...r.kernel.org, linux-kernel@...r.kernel.org,
syzbot+e42ae441c3b10acf9e9d@...kaller.appspotmail.com
Subject: Re: [PATCH v2] cgroup: serialize css kill and release paths
On 6/6/22 05:39, Michal Koutný wrote:
> On Fri, Jun 03, 2022 at 11:13:21AM -0700, Tadeusz Struk<tadeusz.struk@...aro.org> wrote:
>> In such scenario the css_killed_work_fn will be en-queued via
>> cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to
>> cgroup_kn_unlock(). Then cgroup_kn_unlock() will call:
>> cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue
>> css_release_work_fn for the same css instance, causing a list_add
>> corruption bug, as can be seen in the syzkaller report [1].
> This hypothesis doesn't add up to me (I am sorry).
>
> The kill_css(css) would be a css associated with a subsys (css.ss !=
> NULL) whereas css_put(&cgrp->self) is a different css just for the
> cgroup (css.ss == NULL).
Yes, you are right. I couldn't figure it out where the extra css_put()
is called from, and the only place that fitted into my theory was from
the cgroup_kn_unlock() in cgroup_apply_control_disable().
After some more debugging I can see that, as you said, the cgrp->self
is a different css. The offending _put() is actually called by the
percpu_ref_kill_and_confirm(), as it not only calls the passed confirm_kill
percpu_ref_func_t, but also it puts the refcnt iself.
Because the cgroup_apply_control_disable() will loop for_each_live_descendant,
and call css_kill() on all css'es, and css_killed_work_fn() will also loop
and call css_put() on all parents, the css_release() will be called on the
first parent prematurely, causing the BUG(). What I think should be done
to balance put/get is to call css_get() for all the parents in kill_css():
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c1e1a5c34e77..3ca61325bc4e 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5527,6 +5527,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
*/
static void kill_css(struct cgroup_subsys_state *css)
{
+ struct cgroup_subsys_state *_css = css;
+
lockdep_assert_held(&cgroup_mutex);
if (css->flags & CSS_DYING)
@@ -5541,10 +5543,13 @@ static void kill_css(struct cgroup_subsys_state *css)
css_clear_dir(css);
/*
- * Killing would put the base ref, but we need to keep it alive
- * until after ->css_offline().
+ * Killing would put the base ref, but we need to keep it alive,
+ * and all its parents, until after ->css_offline().
*/
- css_get(css);
+ do {
+ css_get(_css);
+ _css = _css->parent;
+ } while (_css && atomic_read(&_css->online_cnt));
/*
* cgroup core guarantees that, by the time ->css_offline() is
This will be then "reverted" in css_killed_work_fn()
Please let me know if it makes sense to you.
I'm still testing it, but syzbot is very slow today.
--
Thanks,
Tadeusz
Powered by blists - more mailing lists