linux-kernel - Re: [PATCH rfc 2/5] mm: kmem: add direct objcg pointer to task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ZRw8AvDt2wrDlKhG@P9FQF9L96D.corp.robot.car>
Date:   Tue, 3 Oct 2023 09:06:26 -0700
From:   Roman Gushchin <roman.gushchin@...ux.dev>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        cgroups@...r.kernel.org, Michal Hocko <mhocko@...nel.org>,
        Shakeel Butt <shakeelb@...gle.com>,
        Muchun Song <muchun.song@...ux.dev>,
        Dennis Zhou <dennis@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH rfc 2/5] mm: kmem: add direct objcg pointer to task_struct

On Tue, Oct 03, 2023 at 10:22:55AM -0400, Johannes Weiner wrote:
> On Mon, Oct 02, 2023 at 03:03:48PM -0700, Roman Gushchin wrote:
> > On Mon, Oct 02, 2023 at 04:12:54PM -0400, Johannes Weiner wrote:
> > > On Wed, Sep 27, 2023 at 08:08:29AM -0700, Roman Gushchin wrote:
> > > > @@ -3001,6 +3001,47 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
> > > >  	return objcg;
> > > >  }
> > > >  
> > > > +static DEFINE_SPINLOCK(current_objcg_lock);
> > > > +
> > > > +static struct obj_cgroup *current_objcg_update(struct obj_cgroup *old)
> > > > +{
> > > > +	struct mem_cgroup *memcg;
> > > > +	struct obj_cgroup *objcg;
> > > > +	unsigned long flags;
> > > > +
> > > > +	old = current_objcg_clear_update_flag(old);
> > > > +	if (old)
> > > > +		obj_cgroup_put(old);
> > > > +
> > > > +	spin_lock_irqsave(&current_objcg_lock, flags);
> > > > +	rcu_read_lock();
> > > > +	memcg = mem_cgroup_from_task(current);
> > > > +	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> > > > +		objcg = rcu_dereference(memcg->objcg);
> > > > +		if (objcg && obj_cgroup_tryget(objcg))
> > > > +			break;
> > > > +		objcg = NULL;
> > > > +	}
> > > > +	rcu_read_unlock();
> > > 
> > > Can this tryget() actually fail when this is called on the current
> > > task during fork() and attach()? A cgroup cannot be offlined while
> > > there is a task in it.
> > 
> > Highly theoretically it can if it races against a migration of the current
> > task to another memcg and the previous memcg is getting offlined.
> 
> Ah right, if this runs between css_set_move_task() and ->attach(). The
> cache would be briefly updated to a parent in the old hierarchy, but
> then quickly reset from the ->attach().

Even simpler:
	rcu_read_lock();
	memcg = mem_cgroup_from_task(current);
---------
	Here the task can be moved to another memcg and the previous one
	can be offlined, making objcg fully detached.
---------
	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
		objcg = rcu_dereference(memcg->objcg);
		if (objcg && obj_cgroup_tryget(objcg))
---------
	Objcg can be NULL here or it can be not NULL, but loose the last reference
	between the objcg check and obj_cgroup_tryget().
---------
			break;
		objcg = NULL;
	}
	rcu_read_unlock();

> 
> Can you please add a comment along these lines?

Sure, will do.

> 
> > I actually might make sense to apply the same approach for memcgs as well
> > (saving a lazily-updating memcg pointer on task_struct). Then it will be
> > possible to ditch this "for" loop. But I need some time to master the code
> > and run benchmarks. Idk if it will make enough difference to justify the change.
> 
> Yeah the memcg pointer is slightly less attractive from an
> optimization POV because it already is a pretty direct pointer from
> task through the cset array.
> 
> If you still want to look into it from a simplification POV that
> sounds reasonable, but IMO it would be fine with a comment.

I'll come back with some numbers, hard to speculate without it. In this case
the majority of savings came from not bumping and decreasing a percpu objcg
refcounter on the slab allocation path - that was quite surprising to me.

> 
> > > > @@ -6345,6 +6393,22 @@ static void mem_cgroup_move_task(void)
> > > >  		mem_cgroup_clear_mc();
> > > >  	}
> > > >  }
> > > > +
> > > > +#ifdef CONFIG_MEMCG_KMEM
> > > > +static void mem_cgroup_fork(struct task_struct *task)
> > > > +{
> > > > +	task->objcg = (struct obj_cgroup *)0x1;
> > > 
> > > dup_task_struct() will copy this pointer from the old task. Would it
> > > be possible to bump the refcount here instead? That would save quite a
> > > bit of work during fork().
> > 
> > Yeah, it should be possible. It won't save a lot, but I agree it makes
> > sense. I'll take a look and will prepare a separate patch for this.
> 
> I guess the hairiest part would be synchronizing against a migration
> because all these cgroup core callbacks are unlocked.

Yep.

> 
> Would it make sense to add ->fork_locked() and ->attach_locked()
> callbacks that are dispatched under the css_set_lock? Then this could
> be a simple if (p && !(p & 0x1)) obj_cgroup_get(), which would
> certainly be nice to workloads where fork() is hot, with little
> downside otherwise.

Maybe, but then the question is if it really worth it. In the final version
the update path doesn't need a spinlock, so it's quite cheap and happens
once on the first allocation, so Idk if it's worth it at all, but I'll take
a look.

I think the bigger question I have here (and probably worth a lsfmmbpf/plumbers
discussion) - what if we introduce a cgroup mount (or even Kconfig) option to
prohibit moving tasks between cgroups and rely solely on fork to enter the right
cgroup (a-la namespaces). I start thinking that this is the right path long-term,
things will be not only more reliable, but we also can ditch a lot of
synchronization and get better performance. Obviously not a small project.

Thanks!