linux-kernel - Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181113191541.GM2509588@devbig004.ftw2.facebook.com>
Date:   Tue, 13 Nov 2018 11:15:41 -0800
From:   Tejun Heo <tj@...nel.org>
To:     Roman Gushchin <guro@...com>
Cc:     Roman Gushchin <guroan@...il.com>, Oleg Nesterov <oleg@...hat.com>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Kernel Team <Kernel-team@...com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer

Hello, Roman.

On Tue, Nov 13, 2018 at 06:47:55PM +0000, Roman Gushchin wrote:
> > > +	/* Should the cgroup and its descendants be frozen. */
> > > +	bool freeze;
> > 
> > Why not have this in freezer too?
> 
> I thought that this variable is just the state of the cgroup.freeze knob,
> where the freezer field contains the internal state of the freezer, and
> can in theory be allocated dynamically.
> 
> Not a strong preference, I can move it there too, if you prefer to.

Yeah, let's just put it together.

> > > +void cgroup_freezer_enter(void);
> > > +void cgroup_freezer_leave(void);
> > 
> > So, if we use freeze, freezing, frozen instead, the aboves can be
> > cgroup_frozen_enter() and cgroup_frozen_leave() (or begin/end).
> 
> Idk, maybe cgroup_enter_frozen()/cgroup_leave_frozen() ?

Sure.

> > > +	/* task is in the cgroup freezer loop */
> > 
> > The above comment isn't strictly true, right?
> 
> Why so?
> 
> It actually means that the task is looping somewhere in the signal delivery loop
> after entering cgroup_freezer_enter() and before cgroup_freezer_leave().
> 
> Maybe simple "task is frozen by the cgroup freezer"?

Yeah, sounds good.

> > > @@ -5642,6 +5700,23 @@ void cgroup_post_fork(struct task_struct *child)
> > >  			cset->nr_tasks++;
> > >  			css_set_move_task(child, NULL, cset, false);
> > >  		}
> > > +
> > > +		if (unlikely(cgroup_frozen(child) &&
> > > +			     (child->flags & ~PF_KTHREAD))) {
> > 
> > I don't think we need explicit PF_KTHREAD test here.  We don't allow
> > kthreads in non-root cgroups anyway and if we wanna change that there
> > are a bunch of other things which need updating anyway.
> 
> Don't we? I think we do. I've proposed a patch to fix this some time ago
> (https://lkml.org/lkml/2017/10/12/556), but was NAKed by Peter.

Ah, right, I thought that went in.  Oh well, let's keep the test then.

> > > +	/*
> > > +	 * Did we race with fork() or exit()? Np, everything is still frozen.
> > > +	 */
> > > +	if (frozen == test_bit(CGRP_FROZEN, &cgrp->flags))
> > > +		return;
> > > +
> > > +	if (frozen)
> > > +		set_bit(CGRP_FROZEN, &cgrp->flags);
> > > +	else
> > > +		clear_bit(CGRP_FROZEN, &cgrp->flags);
> > 
> > I'm not sure this is wrong but it feels a bit weird to tie the actual
> > state transition to notification.  Wouldn't it be more
> > straight-forward if CGRP_FROZEN bit is purely determined by whether
> > the tasks are frozen or not and the notification just checks that
> > against the last notified state and generate a notification if they're
> > different?
> 
> So, maybe cgroup_notify_frozen() is not the best name, maybe
> cgroup_propagate_frozen() better reflects what's happening here.
> We need to recalc the state of ancestor cgroups, and we have to do it
> with cgroup_mutex held, this is why we do it from the delayed work
> context (on hot paths).

Can't we protect that state with css_set_lock too?  That's what task
states are protected by and the cgroup state is a mere aggregation of
task states.

> The first pat of the function can be probably separated and called
> immediately. Is this what you're suggesting?

Pretty much.  I think separating out state transitions and
notifications would make it more straightforward.

> > So that all these state transitions are synchronous with the actual
> > freezing events and we can just queue per-cgroup work items all the
> > way to the top if the new state is different from the last one
> > cgroup-by-cgroup?
> 
> Hm, Idk. Why it's better?

So, the pieces are - 1. task states, 2. cgroup states and
3. notifications.  The current code ties together #2 and #3 together
which is weird because #2 is a mere aggregation of #1.  Also, that
way, notifications become a lot more robust because whether to
generate a notification or not can be solely determined from #2
flipping.  ie. sth like the following

	change_task_frozen_state()
	{
		update counters
		if (cgroup state needs to change) {
			change cgroup state;
			queue notification work;
			repeat for the parent;
		}
	}

where notification work always notifies should work and trivially
satisfies the requirement (there should be at least one notification
since the last state transition) without any further work.  Wouldn't
this be easier and more robust?  The current code depends on
annotating each possible transition event, which is kinda fragile.

> > > +	if (lock_task_sighand(task, &flags)) {
> > > +		if (test_bit(CGRP_FREEZE, &dst->flags))
> > > +			task->jobctl |= JOBCTL_TRAP_FREEZE;
> > > +		else
> > > +			task->jobctl &= ~JOBCTL_TRAP_FREEZE;
> > 
> > How are these flags synchronized?
> 
> Using the css_set_lock.

But other JOBCTL_TRAP bits aren't synchronized by css_set_lock, right?

Thanks.

-- 
tejun