linux-kernel - Re: [PATCH v3 06/19] mm: memcg/slab: obj

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200512225645.GA488426@cmpxchg.org>
Date:   Tue, 12 May 2020 18:56:45 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Roman Gushchin <guro@...com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Michal Hocko <mhocko@...nel.org>, linux-mm@...ck.org,
        kernel-team@...com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API

On Thu, May 07, 2020 at 03:26:31PM -0700, Roman Gushchin wrote:
> On Thu, May 07, 2020 at 05:03:14PM -0400, Johannes Weiner wrote:
> > On Wed, Apr 22, 2020 at 01:46:55PM -0700, Roman Gushchin wrote:
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -257,6 +257,78 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
> > >  }
> > >  
> > >  #ifdef CONFIG_MEMCG_KMEM
> > > +extern spinlock_t css_set_lock;
> > > +
> > > +static void obj_cgroup_release(struct percpu_ref *ref)
> > > +{
> > > +	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> > > +	struct mem_cgroup *memcg;
> > > +	unsigned int nr_bytes;
> > > +	unsigned int nr_pages;
> > > +	unsigned long flags;
> > > +
> > > +	nr_bytes = atomic_read(&objcg->nr_charged_bytes);
> > > +	WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
> > > +	nr_pages = nr_bytes >> PAGE_SHIFT;
> > 
> > What guarantees that we don't have a partial page in there at this
> > point? I guess any outstanding allocations would pin the objcg, so
> > when it's released all objects have been freed.
> 
> Right, this is exactly the reason why there can't be a partial page
> at this point.
> 
> > 
> > But if that's true, how can we have full pages remaining in there now?
> 
> Imagine the following sequence:
> 1) CPU0: objcg == stock->cached_objcg
> 2) CPU1: we do a small allocation (e.g. 92 bytes), page is charged
> 3) CPU1: a process from another memcg is allocating something, stock if flushed,
>    objcg->nr_charged_bytes = PAGE_SIZE - 92
> 5) CPU0: we do release this object, 92 bytes are added to stock->nr_bytes
> 6) CPU0: stock is flushed, 92 bytes are added to objcg->nr_charged_bytes
> 
> In the result, nr_charged_bytes == PAGE_SIZE. This PAGE will be uncharged
> in obj_cgroup_release().
> 
> I've double checked this, it's actually pretty easy to trigger in the real life.

Ah, so no outstanding allocations, but a full page split between the
percpu cache and objcg->nr_charged_bytes.

Would it simplify things if refill_obj_stock() drained on >= PAGE_SIZE
stock instead of > PAGE_SIZE?

Otherwise, the scenario above would be good to have as a comment as
the drain on release is not self-explanatory.

> > > +int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> > > +{
> > > +	struct mem_cgroup *memcg;
> > > +	unsigned int nr_pages, nr_bytes;
> > > +	int ret;
> > > +
> > > +	if (consume_obj_stock(objcg, size))
> > > +		return 0;
> > > +
> > > +	rcu_read_lock();
> > > +	memcg = obj_cgroup_memcg(objcg);
> > > +	css_get(&memcg->css);
> > > +	rcu_read_unlock();
> > > +
> > > +	nr_pages = size >> PAGE_SHIFT;
> > > +	nr_bytes = size & (PAGE_SIZE - 1);
> > > +
> > > +	if (nr_bytes)
> > > +		nr_pages += 1;
> > > +
> > > +	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> > 
> > If consume_obj_stock() fails because some other memcg is cached,
> > should this try to consume the partial page in objcg->nr_charged_bytes
> > before getting more pages?
> 
> We can definitely do it, but I'm not sure if it's good for the performance.
> 
> Dealing with nr_charged_bytes will require up to two atomic writes,
> so calling __memcg_kmem_charge() can be faster if memcg is cached
> on percpu stock.

Hm, but it's the slowpath. And sooner or later somebody has to deal
with the remaining memory in there.

> > > +	if (!ret && nr_bytes)
> > > +		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
> > 
> > This will put the cgroup into the cache if the allocation resulted in
> > a partially free page.
> > 
> > But if this was a page allocation, we may have objcg->nr_cache_bytes
> > from a previous subpage allocation that we should probably put back
> > into the stock.
> 
> Yeah, we can do this, but I don't know if there will be any benefits.

It's mostly about understanding the code.

> Actually we don't wanna to touch objcg->nr_cache_bytes too often, as
> it can become a contention point if there are many threads allocating
> in the memory cgroup.
> 
> So maybe we want to do the opposite: relax it a bit and stop flushing
> it on every stock refill and flush only if it exceeds a certain value.

That could be useful, yes.

> > It's not a common case, I'm just trying to understand what
> > objcg->nr_cache_bytes holds and when it does so.
> 
> So it's actually a centralized leftover from the rounding of the actual
> charge to the page size.

It would be good to add code comments explaining this.

Thanks!