linux-kernel - Re: [PATCH v6 06/19] mm: memcg/slab: obj

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200619213810.GA237539@carbon.dhcp.thefacebook.com>
Date:   Fri, 19 Jun 2020 14:38:10 -0700
From:   Roman Gushchin <guro@...com>
To:     Shakeel Butt <shakeelb@...gle.com>
CC:     Andrew Morton <akpm@...ux-foundation.org>,
        Christoph Lameter <cl@...ux.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Linux MM <linux-mm@...ck.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Kernel Team <kernel-team@...com>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v6 06/19] mm: memcg/slab: obj_cgroup API

On Fri, Jun 19, 2020 at 08:42:34AM -0700, Shakeel Butt wrote:
> On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@...com> wrote:
> >
> > Obj_cgroup API provides an ability to account sub-page sized kernel
> > objects, which potentially outlive the original memory cgroup.
> >
> > The top-level API consists of the following functions:
> >   bool obj_cgroup_tryget(struct obj_cgroup *objcg);
> >   void obj_cgroup_get(struct obj_cgroup *objcg);
> >   void obj_cgroup_put(struct obj_cgroup *objcg);
> >
> >   int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
> >   void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
> >
> >   struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);
> >   struct obj_cgroup *get_obj_cgroup_from_current(void);
> >
> > Object cgroup is basically a pointer to a memory cgroup with a per-cpu
> > reference counter. It substitutes a memory cgroup in places where
> > it's necessary to charge a custom amount of bytes instead of pages.
> >
> > All charged memory rounded down to pages is charged to the
> > corresponding memory cgroup using __memcg_kmem_charge().
> >
> > It implements reparenting: on memcg offlining it's getting reattached
> > to the parent memory cgroup. Each online memory cgroup has an
> > associated active object cgroup to handle new allocations and the list
> > of all attached object cgroups. On offlining of a cgroup this list is
> > reparented and for each object cgroup in the list the memcg pointer is
> > swapped to the parent memory cgroup. It prevents long-living objects
> > from pinning the original memory cgroup in the memory.
> >
> > The implementation is based on byte-sized per-cpu stocks. A sub-page
> > sized leftover is stored in an atomic field, which is a part of
> > obj_cgroup object. So on cgroup offlining the leftover is automatically
> > reparented.
> >
> > memcg->objcg is rcu protected.
> > objcg->memcg is a raw pointer, which is always pointing at a memory
> > cgroup, but can be atomically swapped to the parent memory cgroup. So
> > the caller
> 
> What type of caller? The allocator?

Basically whoever uses the pointer. Is it better to s/caller/user?

> 
> > must ensure the lifetime of the cgroup, e.g. grab
> > rcu_read_lock or css_set_lock.
> >
> > Suggested-by: Johannes Weiner <hannes@...xchg.org>
> > Signed-off-by: Roman Gushchin <guro@...com>
> > ---
> >  include/linux/memcontrol.h |  51 +++++++
> >  mm/memcontrol.c            | 288 ++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 338 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 93dbc7f9d8b8..c69e66fe4f12 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -23,6 +23,7 @@
> >  #include <linux/page-flags.h>
> >
> >  struct mem_cgroup;
> > +struct obj_cgroup;
> >  struct page;
> >  struct mm_struct;
> >  struct kmem_cache;
> > @@ -192,6 +193,22 @@ struct memcg_cgwb_frn {
> >         struct wb_completion done;      /* tracks in-flight foreign writebacks */
> >  };
> >
> > +/*
> > + * Bucket for arbitrarily byte-sized objects charged to a memory
> > + * cgroup. The bucket can be reparented in one piece when the cgroup
> > + * is destroyed, without having to round up the individual references
> > + * of all live memory objects in the wild.
> > + */
> > +struct obj_cgroup {
> > +       struct percpu_ref refcnt;
> > +       struct mem_cgroup *memcg;
> > +       atomic_t nr_charged_bytes;
> 
> So, we still charge the mem page counter in pages but keep the
> remaining sub-page slack charge in nr_charge_bytes, right?

Kind of. The remainder is usually kept in a per-cpu stock,
but if the stock has to be flushed, it's getting flushed to nr_charge_bytes.

> 
> > +       union {
> > +               struct list_head list;
> > +               struct rcu_head rcu;
> > +       };
> > +};
> > +
> >  /*
> >   * The memory controller data structure. The memory controller controls both
> >   * page cache and RSS per cgroup. We would eventually like to provide
> > @@ -301,6 +318,8 @@ struct mem_cgroup {
> >         int kmemcg_id;
> >         enum memcg_kmem_state kmem_state;
> >         struct list_head kmem_caches;
> > +       struct obj_cgroup __rcu *objcg;
> > +       struct list_head objcg_list;
> >  #endif
> >
> [snip]
> > +
> > +static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> > +                                 struct mem_cgroup *parent)
> > +{
> > +       struct obj_cgroup *objcg, *iter;
> > +
> > +       objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
> > +
> > +       spin_lock_irq(&css_set_lock);
> > +
> > +       /* Move active objcg to the parent's list */
> > +       xchg(&objcg->memcg, parent);
> > +       css_get(&parent->css);
> > +       list_add(&objcg->list, &parent->objcg_list);
> 
> So, memcg->objcs_list will always only contain the offlined
> descendants objcgs. I would recommend to rename objcg_list to clearly
> show that. Maybe offlined_objcg_list or descendants_objcg_list or
> something else.

Right. Let me add a comment for now and think of a better name.

> 
> > +
> > +       /* Move already reparented objcgs to the parent's list */
> > +       list_for_each_entry(iter, &memcg->objcg_list, list) {
> > +               css_get(&parent->css);
> > +               xchg(&iter->memcg, parent);
> > +               css_put(&memcg->css);
> > +       }
> > +       list_splice(&memcg->objcg_list, &parent->objcg_list);
> > +
> > +       spin_unlock_irq(&css_set_lock);
> > +
> > +       percpu_ref_kill(&objcg->refcnt);
> > +}
> > +
> >  /*
> [snip]
> >
> > +__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
> > +{
> > +       struct obj_cgroup *objcg = NULL;
> > +       struct mem_cgroup *memcg;
> > +
> > +       if (unlikely(!current->mm))
> > +               return NULL;
> 
> I have not seen the users of this function yet but shouldn't the above
> check be (!current->mm && !current->active_memcg)?

Yes, good catch, it might save a couple of cycles if
current->mm == current->active_memcg == NULL. Adding.

> 
> Do we need a mem_cgroup_disabled() check as well?

As now both call sides are guarded by memcg_kmem_enabled(),
so we don't need it.

But maybe it's a good target for some refactorings,
e.g. moving !current->mm and !current->active_memcg checks out
of memcg_kmem_bypass(). And _maybe_ it's better to move memcg_kmem_enabled()
here, but I'm not sure.

> 
> > +
> > +       rcu_read_lock();
> > +       if (unlikely(current->active_memcg))
> > +               memcg = rcu_dereference(current->active_memcg);
> > +       else
> > +               memcg = mem_cgroup_from_task(current);
> > +
> > +       for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> > +               objcg = rcu_dereference(memcg->objcg);
> > +               if (objcg && obj_cgroup_tryget(objcg))
> > +                       break;
> > +       }
> > +       rcu_read_unlock();
> > +
> > +       return objcg;
> > +}
> > +
> [...]
> > +
> > +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> > +{
> > +       struct memcg_stock_pcp *stock;
> > +       unsigned long flags;
> > +
> > +       local_irq_save(flags);
> > +
> > +       stock = this_cpu_ptr(&memcg_stock);
> > +       if (stock->cached_objcg != objcg) { /* reset if necessary */
> > +               drain_obj_stock(stock);
> > +               obj_cgroup_get(objcg);
> > +               stock->cached_objcg = objcg;
> > +               stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
> > +       }
> > +       stock->nr_bytes += nr_bytes;
> > +
> > +       if (stock->nr_bytes > PAGE_SIZE)
> > +               drain_obj_stock(stock);
> 
> The normal stock can go to 32*nr_cpus*PAGE_SIZE. I am wondering if
> just PAGE_SIZE is too less for obj stock.

It works on top of the current stock of 32 pages, so it can grab these
32 pages without any atomic operations. And it should be easy to increase
this limit if we'll see any benefits.

Thank you for looking into the patchset!

Andrew, can you, please, squash the following fix based on Shakeel's suggestions?
Thanks!

--

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7ed3af71a6fb..2499f78cf32d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -326,7 +326,7 @@ struct mem_cgroup {
        int kmemcg_id;
        enum memcg_kmem_state kmem_state;
        struct obj_cgroup __rcu *objcg;
-       struct list_head objcg_list;
+       struct list_head objcg_list; /* list of inherited objcgs */
 #endif
 
 #ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 70cd44b28db1..9f14b91700d9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2843,7 +2843,7 @@ __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
        struct obj_cgroup *objcg = NULL;
        struct mem_cgroup *memcg;
 
-       if (unlikely(!current->mm))
+       if (unlikely(!current->mm && !current->active_memcg))
                return NULL;
 
        rcu_read_lock();