linux-kernel - Re: [PATCH RFC 01/14] mm: memcg: subpage charging API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190917183308.GA9776@castle>
Date:   Tue, 17 Sep 2019 18:33:15 +0000
From:   Roman Gushchin <guro@...com>
To:     Johannes Weiner <hannes@...xchg.org>
CC:     "linux-mm@...ck.org" <linux-mm@...ck.org>,
        Michal Hocko <mhocko@...nel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Kernel Team <Kernel-team@...com>,
        "Shakeel Butt" <shakeelb@...gle.com>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Waiman Long <longman@...hat.com>
Subject: Re: [PATCH RFC 01/14] mm: memcg: subpage charging API

On Tue, Sep 17, 2019 at 10:50:04AM +0200, Johannes Weiner wrote:
> On Tue, Sep 17, 2019 at 02:27:19AM +0000, Roman Gushchin wrote:
> > On Mon, Sep 16, 2019 at 02:56:11PM +0200, Johannes Weiner wrote:
> > > On Thu, Sep 05, 2019 at 02:45:45PM -0700, Roman Gushchin wrote:
> > > > Introduce an API to charge subpage objects to the memory cgroup.
> > > > The API will be used by the new slab memory controller. Later it
> > > > can also be used to implement percpu memory accounting.
> > > > In both cases, a single page can be shared between multiple cgroups
> > > > (and in percpu case a single allocation is split over multiple pages),
> > > > so it's not possible to use page-based accounting.
> > > > 
> > > > The implementation is based on percpu stocks. Memory cgroups are still
> > > > charged in pages, and the residue is stored in perpcu stock, or on the
> > > > memcg itself, when it's necessary to flush the stock.
> > > 
> > > Did you just implement a slab allocator for page_counter to track
> > > memory consumed by the slab allocator?
> > 
> > :)
> > 
> > > 
> > > > @@ -2500,8 +2577,9 @@ void mem_cgroup_handle_over_high(void)
> > > >  }
> > > >  
> > > >  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > > -		      unsigned int nr_pages)
> > > > +		      unsigned int amount, bool subpage)
> > > >  {
> > > > +	unsigned int nr_pages = subpage ? ((amount >> PAGE_SHIFT) + 1) : amount;
> > > >  	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
> > > >  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > > >  	struct mem_cgroup *mem_over_limit;
> > > > @@ -2514,7 +2592,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > >  	if (mem_cgroup_is_root(memcg))
> > > >  		return 0;
> > > >  retry:
> > > > -	if (consume_stock(memcg, nr_pages))
> > > > +	if (subpage && consume_subpage_stock(memcg, amount))
> > > > +		return 0;
> > > > +	else if (!subpage && consume_stock(memcg, nr_pages))
> > > >  		return 0;
> > > 
> > > The layering here isn't clean. We have an existing per-cpu cache to
> > > batch-charge the page counter. Why does the new subpage allocator not
> > > sit on *top* of this, instead of wedged in between?
> > > 
> > > I think what it should be is a try_charge_bytes() that simply gets one
> > > page from try_charge() and then does its byte tracking, regardless of
> > > how try_charge() chooses to implement its own page tracking.
> > > 
> > > That would avoid the awkward @amount + @subpage multiplexing, as well
> > > as annotating all existing callsites of try_charge() with a
> > > non-descript "false" parameter.
> > > 
> > > You can still reuse the stock data structures, use the lower bits of
> > > stock->nr_bytes for a different cgroup etc., but the charge API should
> > > really be separate.
> > 
> > Hm, I kinda like the idea, however there is a complication: for the subpage
> > accounting the css reference management is done in a different way, so that
> > all existing code should avoid changing the css refcounter. So I'd need
> > to pass a boolean argument anyway.
> 
> Can you elaborate on the refcounting scheme? I don't quite understand
> how there would be complications with that.
> 
> Generally, references are held for each page that is allocated in the
> page_counter. try_charge() allocates a batch of css references,
> returns one and keeps the rest in stock.
> 
> So couldn't the following work?
> 
> When somebody allocates a subpage, the css reference returned by
> try_charge() is shared by the allocated subpage object and the
> remainder that is kept via stock->subpage_cache and stock->nr_bytes
> (or memcg->nr_stocked_bytes when the percpu cache is reset).

Because individual objects are a subject of reparenting and can outlive
the origin memory cgroup, they shouldn't hold a direct reference to the
memory cgroup. Instead they hold a reference to the mem_cgroup_ptr object,
and this objects holds a single reference to the memory cgroup.
Underlying pages shouldn't hold a reference too.

Btw, it's already true, just kmem_cache plays the role of such intermediate
object, and we do an explicit transfer of charge (look at memcg_charge_slab()).
So we initially associate a page with the memcg, and almost immediately
after break this association and insert kmem_cache in between.

But with subpage accounting it's not possible, as a page is shared between
multiple cgroups, and it can't be attributed to any specific cgroup at
any time.

> 
> When the subpage objects are freed, you'll eventually have a full page
> again in stock->nr_bytes, at which point you page_counter_uncharge()
> paired with css_put(_many) as per usual.
> 
> A remainder left in old->nr_stocked_bytes would continue to hold on to
> one css reference. (I don't quite understand who is protecting this
> remainder in your current version, actually. A bug?)
> 
> Instead of doing your own batched page_counter uncharging in
> refill_subpage_stock() -> drain_subpage_stock(), you should be able to
> call refill_stock() when stock->nr_bytes adds up to a whole page again.
> 
> Again, IMO this would be much cleaner architecture if there was a
> try_charge_bytes() byte allocator that would sit on top of a cleanly
> abstracted try_charge() page allocator, just like the slab allocator
> is sitting on top of the page allocator - instead of breaking through
> the abstraction layer of the underlying page allocator.
> 

As I said, I like the idea to put it on top, but it can't be put on top
without changes in css refcounting (or I don't see how). I don't know
how to mix stocks which are holding css references and which are not,
so I might end up with two stocks as in current implementation. Then
the idea of having another layer of caching on top looks slightly less
appealing, but maybe still worth a try.

Thanks!