linux-kernel - Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1fb4d8d7-ccc0-b020-715e-38c2dfd94c23@linaro.org>
Date:   Wed, 1 Jun 2022 16:37:17 -0700
From:   Tadeusz Struk <tadeusz.struk@...aro.org>
To:     Tejun Heo <tj@...nel.org>
Cc:     Michal Koutný <mkoutny@...e.com>,
        cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
        Zefan Li <lizefan.x@...edance.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Bui Quang Minh <minhquangbui99@...il.com>
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

On 6/1/22 16:20, Tejun Heo wrote:
> On Wed, Jun 01, 2022 at 04:13:32PM -0700, Tadeusz Struk wrote:
>>> On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
>>>> // ref=A: initial state
>>>> kill_css()
>>>>     css_get // ref+=F == A+F: fuse
>>>>     percpu_ref_kill_and_confirm
>>>>       __percpu_ref_switch_to_atomic
>>>>         percpu_ref_get
>>>>           // ref += 1 == A+F+1: atomic mode, self-protection
>>>>       percpu_ref_put
>>>>         // ref -= 1 == A+F: kill the base reference
>>>>     [via rcu]
>>>>     percpu_ref_switch_to_atomic_rcu
>>>>       percpu_ref_call_confirm_rcu
>>>>         css_killed_ref_fn == refcnt.confirm_switch
>>>>           queue_work(css->destroy_work)        (1)
>>>>                                                        [via css->destroy_work]
>>>>                                                        css_killed_work_fn == wq.func
>>>>                                                          offline_css() // needs fuse
>>>>                                                          css_put // ref -= F == A: de-fuse
>>>>         percpu_ref_put
>>>>           // ref -= 1 == A-1: remove self-protection
>>>>           css_release                                   // A <= 1 -> 2nd queue_work explodes!
>>>
>>> I'm not sure I'm following it but it's perfectly fine to re-use the work
>>> item at this point. The work item actually can be re-cycled from the very
>>> beginning of the work function. The only thing we need to make sure is that
>>> we don't css_put() prematurely to avoid it being freed while we're using it.
>>
>> Yes, it is ok to reuse a work struct, but it's not ok to have the same
>> work struct enqueued twice on the same WQ when list debug is enabled.
>> That's why we are getting this "BUG: corrupted list.."
> 
> The above scenario isn't that tho. Once the work item starts executing, wq
> doesn't care about what happens to it and as killed_work_fn is holding a
> reference, the release scheduling shouldn't happen before it starts
> executing unless somebody is screwing up the refcnting.
> 
>> That's right. Michal was on the right track for the kill_css() part.
>> What I think is going on is that once css_create() fails then
>> cgroup_subtree_control_write() ends up calling first kill_css() and
>> then css_put() on the same css, I think it's &cgrp->self of the kernfs_node.
>> The each_live_descendant_post() also iterates on the root.
>> Here is the call flow (sorry for long lines):
>>
>> cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error
>>    |
>>    |-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called
>>    |
>>    |  css_killed_ref_fn() <- first css->destroy_work enqueue
>>    |    |
>>    |    |->  INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work);
>>    |
>>    |
>>    |-> goto out_unlock;
>>    |     |
>>    |     |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called
>>    |
>>    |
>>       css_release(percpu_ref) <- second css->destroy_work enqueue
>>         |
>>         |->  INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption.
>>
>>
>> What seems to work for me as the simplest fix is to prevent enqueuing a dying
>> css in css_release() as below. Please let me know if that makes sense to you.
>>
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 1779ccddb734..5618211487cc 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref)
>>   	struct cgroup_subsys_state *css =
>>   		container_of(ref, struct cgroup_subsys_state, refcnt);
>> -	INIT_WORK(&css->destroy_work, css_release_work_fn);
>> -	queue_work(cgroup_destroy_wq, &css->destroy_work);
>> +	if (!(css->flags & CSS_DYING)) {
>> +		INIT_WORK(&css->destroy_work, css_release_work_fn);
>> +		queue_work(cgroup_destroy_wq, &css->destroy_work);
>> +	}
> 
> When the problem is ref imbalance, how can above be the solution? Of course
> release path won't cause an issue if they don't run, but we still need to
> free the thing, right?

Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn)
doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called
on it. Only css_get() & css_put() modify the refcnt value.
And for the "free the thing" the css_killed_work_fn() does that.
It calls offline_css(css) and css_put(css) for the whole css hierarchy.

-- 
Thanks,
Tadeusz