linux-kernel - Re: [PATCH v2] mm, memcg: Add a memcg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALvZod4oOddDvuvuXyp=p2Dq=h354a-D72daagfya_Ewp_ggSA@mail.gmail.com>
Date:   Thu, 20 Jun 2019 07:39:36 -0700
From:   Shakeel Butt <shakeelb@...gle.com>
To:     Waiman Long <longman@...hat.com>
Cc:     Christoph Lameter <cl@...ux.com>,
        Pekka Enberg <penberg@...nel.org>,
        David Rientjes <rientjes@...gle.com>,
        Joonsoo Kim <iamjoonsoo.kim@....com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linux MM <linux-mm@...ck.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Michal Hocko <mhocko@...nel.org>, Roman Gushchin <guro@...com>,
        Johannes Weiner <hannes@...xchg.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>
Subject: Re: [PATCH v2] mm, memcg: Add a memcg_slabinfo debugfs file

On Thu, Jun 20, 2019 at 7:24 AM Waiman Long <longman@...hat.com> wrote:
>
> On 6/19/19 7:48 PM, Shakeel Butt wrote:
> > Hi Waiman,
> >
> > On Wed, Jun 19, 2019 at 10:16 AM Waiman Long <longman@...hat.com> wrote:
> >> There are concerns about memory leaks from extensive use of memory
> >> cgroups as each memory cgroup creates its own set of kmem caches. There
> >> is a possiblity that the memcg kmem caches may remain even after the
> >> memory cgroups have been offlined. Therefore, it will be useful to show
> >> the status of each of memcg kmem caches.
> >>
> >> This patch introduces a new <debugfs>/memcg_slabinfo file which is
> >> somewhat similar to /proc/slabinfo in format, but lists only information
> >> about kmem caches that have child memcg kmem caches. Information
> >> available in /proc/slabinfo are not repeated in memcg_slabinfo.
> >>
> >> A portion of a sample output of the file was:
> >>
> >>   # <name> <css_id[:dead]> <active_objs> <num_objs> <active_slabs> <num_slabs>
> >>   rpc_inode_cache   root          13     51      1      1
> >>   rpc_inode_cache     48           0      0      0      0
> >>   fat_inode_cache   root           1     45      1      1
> >>   fat_inode_cache     41           2     45      1      1
> >>   xfs_inode         root         770    816     24     24
> >>   xfs_inode           92          22     34      1      1
> >>   xfs_inode           88:dead      1     34      1      1
> >>   xfs_inode           89:dead     23     34      1      1
> >>   xfs_inode           85           4     34      1      1
> >>   xfs_inode           84           9     34      1      1
> >>
> >> The css id of the memcg is also listed. If a memcg is not online,
> >> the tag ":dead" will be attached as shown above.
> >>
> >> Suggested-by: Shakeel Butt <shakeelb@...gle.com>
> >> Signed-off-by: Waiman Long <longman@...hat.com>
> >> ---
> >>  mm/slab_common.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 57 insertions(+)
> >>
> >> diff --git a/mm/slab_common.c b/mm/slab_common.c
> >> index 58251ba63e4a..2bca1558a722 100644
> >> --- a/mm/slab_common.c
> >> +++ b/mm/slab_common.c
> >> @@ -17,6 +17,7 @@
> >>  #include <linux/uaccess.h>
> >>  #include <linux/seq_file.h>
> >>  #include <linux/proc_fs.h>
> >> +#include <linux/debugfs.h>
> >>  #include <asm/cacheflush.h>
> >>  #include <asm/tlbflush.h>
> >>  #include <asm/page.h>
> >> @@ -1498,6 +1499,62 @@ static int __init slab_proc_init(void)
> >>         return 0;
> >>  }
> >>  module_init(slab_proc_init);
> >> +
> >> +#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_MEMCG_KMEM)
> >> +/*
> >> + * Display information about kmem caches that have child memcg caches.
> >> + */
> >> +static int memcg_slabinfo_show(struct seq_file *m, void *unused)
> >> +{
> >> +       struct kmem_cache *s, *c;
> >> +       struct slabinfo sinfo;
> >> +
> >> +       mutex_lock(&slab_mutex);
> > On large machines there can be thousands of memcgs and potentially
> > each memcg can have hundreds of kmem caches. So, the slab_mutex can be
> > held for a very long time.
>
> But that is also what /proc/slabinfo does by doing mutex_lock() at
> slab_start() and mutex_unlock() at slab_stop(). So the same problem will
> happen when /proc/slabinfo is being read.
>
> When you are in a situation that reading /proc/slabinfo take a long time
> because of the large number of memcg's, the system is in some kind of
> trouble anyway. I am saying that we should not improve the scalability
> of this patch. It is just that some nasty race conditions may pop up if
> we release the lock and re-acquire it latter. That will greatly
> complicate the code to handle all those edge cases.
>

We have been using that interface and implementation for couple of
years and have not seen any race condition. However I am fine with
what you have here for now. We can always come back if we think we
need to improve it.

> > Our internal implementation traverses the memcg tree and then
> > traverses 'memcg->kmem_caches' within the slab_mutex (and
> > cond_resched() after unlock).
> For cgroup v1, the setting of the CONFIG_SLUB_DEBUG option will allow
> you to iterate and display slabinfo just for that particular memcg. I am
> thinking of extending the debug controller to do similar thing for
> cgroup v2.

I was also planning to look into that and it seems like you are
already on it. Do CC me the patches.

> >> +       seq_puts(m, "# <name> <css_id[:dead]> <active_objs> <num_objs>");
> >> +       seq_puts(m, " <active_slabs> <num_slabs>\n");
> >> +       list_for_each_entry(s, &slab_root_caches, root_caches_node) {
> >> +               /*
> >> +                * Skip kmem caches that don't have any memcg children.
> >> +                */
> >> +               if (list_empty(&s->memcg_params.children))
> >> +                       continue;
> >> +
> >> +               memset(&sinfo, 0, sizeof(sinfo));
> >> +               get_slabinfo(s, &sinfo);
> >> +               seq_printf(m, "%-17s root      %6lu %6lu %6lu %6lu\n",
> >> +                          cache_name(s), sinfo.active_objs, sinfo.num_objs,
> >> +                          sinfo.active_slabs, sinfo.num_slabs);
> >> +
> >> +               for_each_memcg_cache(c, s) {
> >> +                       struct cgroup_subsys_state *css;
> >> +                       char *dead = "";
> >> +
> >> +                       css = &c->memcg_params.memcg->css;
> >> +                       if (!(css->flags & CSS_ONLINE))
> >> +                               dead = ":dead";
> > Please note that Roman's kmem cache reparenting patch series have made
> > kmem caches of zombie memcgs a bit tricky. On memcg offlining the
> > memcg kmem caches are reparented and the css->id can get recycled. So,
> > we want to know that the a kmem cache is reparented and which memcg it
> > belonged to initially. Determining if a kmem cache is reparented, we
> > can store a flag on the kmem cache and for the previous memcg we can
> > use fhandle. However to not make this more complicated, for now, we
> > can just have the info that the kmem cache was reparented i.e. belongs
> > to an offlined memcg.
>
> I need to play with Roman's kmem cache reparenting patch a bit more to
> see how to properly recognize a reparent'ed kmem cache. What I have
> noticed is that the dead kmem caches that I saw at boot up were gone
> after applying his patch. So that is a good thing.
>

By gone, do you mean the kmem cache got freed or the kmem cache is not
part of online parent memcg and thus no more dead kmem cache?

> For now, I think the current patch is good enough for its purpose. I may
> send follow-up if I see something that can be improved.
>

I would like to see the recognition of reparent'ed kmem cache in this
patch. However if others are ok with the current status of the patch
then I will not stand in the way.

thanks,
Shakeel