linux-kernel - Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181113205507.GB15590@tower.DHCP.thefacebook.com>
Date:   Tue, 13 Nov 2018 20:55:11 +0000
From:   Roman Gushchin <guro@...com>
To:     Tejun Heo <tj@...nel.org>
CC:     Roman Gushchin <guroan@...il.com>, Oleg Nesterov <oleg@...hat.com>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Kernel Team <Kernel-team@...com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        "Andrew Morton" <akpm@...ux-foundation.org>,
        Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer

On Tue, Nov 13, 2018 at 11:15:41AM -0800, Tejun Heo wrote:
> Hello, Roman.
> 
> On Tue, Nov 13, 2018 at 06:47:55PM +0000, Roman Gushchin wrote:
> > > > +	/* Should the cgroup and its descendants be frozen. */
> > > > +	bool freeze;
> > > 
> > > Why not have this in freezer too?
> > 
> > I thought that this variable is just the state of the cgroup.freeze knob,
> > where the freezer field contains the internal state of the freezer, and
> > can in theory be allocated dynamically.
> > 
> > Not a strong preference, I can move it there too, if you prefer to.
> 
> Yeah, let's just put it together.
> 
> > > > +void cgroup_freezer_enter(void);
> > > > +void cgroup_freezer_leave(void);
> > > 
> > > So, if we use freeze, freezing, frozen instead, the aboves can be
> > > cgroup_frozen_enter() and cgroup_frozen_leave() (or begin/end).
> > 
> > Idk, maybe cgroup_enter_frozen()/cgroup_leave_frozen() ?
> 
> Sure.
> 
> > > > +	/* task is in the cgroup freezer loop */
> > > 
> > > The above comment isn't strictly true, right?
> > 
> > Why so?
> > 
> > It actually means that the task is looping somewhere in the signal delivery loop
> > after entering cgroup_freezer_enter() and before cgroup_freezer_leave().
> > 
> > Maybe simple "task is frozen by the cgroup freezer"?
> 
> Yeah, sounds good.
> 
> > > > @@ -5642,6 +5700,23 @@ void cgroup_post_fork(struct task_struct *child)
> > > >  			cset->nr_tasks++;
> > > >  			css_set_move_task(child, NULL, cset, false);
> > > >  		}
> > > > +
> > > > +		if (unlikely(cgroup_frozen(child) &&
> > > > +			     (child->flags & ~PF_KTHREAD))) {
> > > 
> > > I don't think we need explicit PF_KTHREAD test here.  We don't allow
> > > kthreads in non-root cgroups anyway and if we wanna change that there
> > > are a bunch of other things which need updating anyway.
> > 
> > Don't we? I think we do. I've proposed a patch to fix this some time ago
> > (https://lkml.org/lkml/2017/10/12/556), but was NAKed by Peter.
> 
> Ah, right, I thought that went in.  Oh well, let's keep the test then.
> 
> > > > +	/*
> > > > +	 * Did we race with fork() or exit()? Np, everything is still frozen.
> > > > +	 */
> > > > +	if (frozen == test_bit(CGRP_FROZEN, &cgrp->flags))
> > > > +		return;
> > > > +
> > > > +	if (frozen)
> > > > +		set_bit(CGRP_FROZEN, &cgrp->flags);
> > > > +	else
> > > > +		clear_bit(CGRP_FROZEN, &cgrp->flags);
> > > 
> > > I'm not sure this is wrong but it feels a bit weird to tie the actual
> > > state transition to notification.  Wouldn't it be more
> > > straight-forward if CGRP_FROZEN bit is purely determined by whether
> > > the tasks are frozen or not and the notification just checks that
> > > against the last notified state and generate a notification if they're
> > > different?
> > 
> > So, maybe cgroup_notify_frozen() is not the best name, maybe
> > cgroup_propagate_frozen() better reflects what's happening here.
> > We need to recalc the state of ancestor cgroups, and we have to do it
> > with cgroup_mutex held, this is why we do it from the delayed work
> > context (on hot paths).
> 
> Can't we protect that state with css_set_lock too?  That's what task
> states are protected by and the cgroup state is a mere aggregation of
> task states.

See below.

> 
> > The first pat of the function can be probably separated and called
> > immediately. Is this what you're suggesting?
> 
> Pretty much.  I think separating out state transitions and
> notifications would make it more straightforward.
> 
> > > So that all these state transitions are synchronous with the actual
> > > freezing events and we can just queue per-cgroup work items all the
> > > way to the top if the new state is different from the last one
> > > cgroup-by-cgroup?
> > 
> > Hm, Idk. Why it's better?
> 
> So, the pieces are - 1. task states, 2. cgroup states and
> 3. notifications.  The current code ties together #2 and #3 together
> which is weird because #2 is a mere aggregation of #1.  Also, that
> way, notifications become a lot more robust because whether to
> generate a notification or not can be solely determined from #2
> flipping.  ie. sth like the following
> 
> 	change_task_frozen_state()
> 	{
> 		update counters
> 		if (cgroup state needs to change) {
> 			change cgroup state;
> 			queue notification work;
> 			repeat for the parent;
> 		}
> 	}
> 
> where notification work always notifies should work and trivially
> satisfies the requirement (there should be at least one notification
> since the last state transition) without any further work.  Wouldn't
> this be easier and more robust?  The current code depends on
> annotating each possible transition event, which is kinda fragile.

Ok, it looks like I can additionally synchronize descendant counters
using css_set_lock, and then eliminate the whole thing with delayed
transitions/notifications.

Will do in v3.

> 
> > > > +	if (lock_task_sighand(task, &flags)) {
> > > > +		if (test_bit(CGRP_FREEZE, &dst->flags))
> > > > +			task->jobctl |= JOBCTL_TRAP_FREEZE;
> > > > +		else
> > > > +			task->jobctl &= ~JOBCTL_TRAP_FREEZE;
> > > 
> > > How are these flags synchronized?
> > 
> > Using the css_set_lock.
> 
> But other JOBCTL_TRAP bits aren't synchronized by css_set_lock, right?


But if we don't touch this bit anywhere else, should be fine, right?

Thanks!