[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1fb4d8d7-ccc0-b020-715e-38c2dfd94c23@linaro.org>
Date: Wed, 1 Jun 2022 16:37:17 -0700
From: Tadeusz Struk <tadeusz.struk@...aro.org>
To: Tejun Heo <tj@...nel.org>
Cc: Michal Koutný <mkoutny@...e.com>,
cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
Zefan Li <lizefan.x@...edance.com>,
Johannes Weiner <hannes@...xchg.org>,
Bui Quang Minh <minhquangbui99@...il.com>
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path
On 6/1/22 16:20, Tejun Heo wrote:
> On Wed, Jun 01, 2022 at 04:13:32PM -0700, Tadeusz Struk wrote:
>>> On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
>>>> // ref=A: initial state
>>>> kill_css()
>>>> css_get // ref+=F == A+F: fuse
>>>> percpu_ref_kill_and_confirm
>>>> __percpu_ref_switch_to_atomic
>>>> percpu_ref_get
>>>> // ref += 1 == A+F+1: atomic mode, self-protection
>>>> percpu_ref_put
>>>> // ref -= 1 == A+F: kill the base reference
>>>> [via rcu]
>>>> percpu_ref_switch_to_atomic_rcu
>>>> percpu_ref_call_confirm_rcu
>>>> css_killed_ref_fn == refcnt.confirm_switch
>>>> queue_work(css->destroy_work) (1)
>>>> [via css->destroy_work]
>>>> css_killed_work_fn == wq.func
>>>> offline_css() // needs fuse
>>>> css_put // ref -= F == A: de-fuse
>>>> percpu_ref_put
>>>> // ref -= 1 == A-1: remove self-protection
>>>> css_release // A <= 1 -> 2nd queue_work explodes!
>>>
>>> I'm not sure I'm following it but it's perfectly fine to re-use the work
>>> item at this point. The work item actually can be re-cycled from the very
>>> beginning of the work function. The only thing we need to make sure is that
>>> we don't css_put() prematurely to avoid it being freed while we're using it.
>>
>> Yes, it is ok to reuse a work struct, but it's not ok to have the same
>> work struct enqueued twice on the same WQ when list debug is enabled.
>> That's why we are getting this "BUG: corrupted list.."
>
> The above scenario isn't that tho. Once the work item starts executing, wq
> doesn't care about what happens to it and as killed_work_fn is holding a
> reference, the release scheduling shouldn't happen before it starts
> executing unless somebody is screwing up the refcnting.
>
>> That's right. Michal was on the right track for the kill_css() part.
>> What I think is going on is that once css_create() fails then
>> cgroup_subtree_control_write() ends up calling first kill_css() and
>> then css_put() on the same css, I think it's &cgrp->self of the kernfs_node.
>> The each_live_descendant_post() also iterates on the root.
>> Here is the call flow (sorry for long lines):
>>
>> cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error
>> |
>> |-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called
>> |
>> | css_killed_ref_fn() <- first css->destroy_work enqueue
>> | |
>> | |-> INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work);
>> |
>> |
>> |-> goto out_unlock;
>> | |
>> | |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called
>> |
>> |
>> css_release(percpu_ref) <- second css->destroy_work enqueue
>> |
>> |-> INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption.
>>
>>
>> What seems to work for me as the simplest fix is to prevent enqueuing a dying
>> css in css_release() as below. Please let me know if that makes sense to you.
>>
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 1779ccddb734..5618211487cc 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref)
>> struct cgroup_subsys_state *css =
>> container_of(ref, struct cgroup_subsys_state, refcnt);
>> - INIT_WORK(&css->destroy_work, css_release_work_fn);
>> - queue_work(cgroup_destroy_wq, &css->destroy_work);
>> + if (!(css->flags & CSS_DYING)) {
>> + INIT_WORK(&css->destroy_work, css_release_work_fn);
>> + queue_work(cgroup_destroy_wq, &css->destroy_work);
>> + }
>
> When the problem is ref imbalance, how can above be the solution? Of course
> release path won't cause an issue if they don't run, but we still need to
> free the thing, right?
Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn)
doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called
on it. Only css_get() & css_put() modify the refcnt value.
And for the "free the thing" the css_killed_work_fn() does that.
It calls offline_css(css) and css_put(css) for the whole css hierarchy.
--
Thanks,
Tadeusz
Powered by blists - more mailing lists