linux-kernel - Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YItf3GIUs2skeuyi@carbon.dhcp.thefacebook.com>
Date:   Thu, 29 Apr 2021 18:39:40 -0700
From:   Roman Gushchin <guro@...com>
To:     Dave Chinner <david@...morbit.com>
CC:     Muchun Song <songmuchun@...edance.com>, <willy@...radead.org>,
        <akpm@...ux-foundation.org>, <hannes@...xchg.org>,
        <mhocko@...nel.org>, <vdavydov.dev@...il.com>,
        <shakeelb@...gle.com>, <shy828301@...il.com>, <alexs@...nel.org>,
        <alexander.h.duyck@...ux.intel.com>, <richard.weiyang@...il.com>,
        <linux-fsdevel@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
        <linux-mm@...ck.org>
Subject: Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal

On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote:
> On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote:
> > In our server, we found a suspected memory leak problem. The kmalloc-32
> > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > memory.
> > 
> > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > cache is the cause of list_lru_one allocation.
> > 
> >   crash> p memcg_nr_cache_ids
> >   memcg_nr_cache_ids = $2 = 24574
> > 
> > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > can be calculated with the following formula.
> > 
> >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> > 
> > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> > 
> >   crash> list super_blocks | wc -l
> >   952
> 
> The more I see people trying to work around this, the more I think
> that the way memcgs have been grafted into the list_lru is back to
> front.
> 
> We currently allocate scope for every memcg to be able to tracked on
> every not on every superblock instantiated in the system, regardless
> of whether that superblock is even accessible to that memcg.
> 
> These huge memcg counts come from container hosts where memcgs are
> confined to just a small subset of the total number of superblocks
> that instantiated at any given point in time.
> 
> IOWs, for these systems with huge container counts, list_lru does
> not need the capability of tracking every memcg on every superblock.
> 
> What it comes down to is that the list_lru is only needed for a
> given memcg if that memcg is instatiating and freeing objects on a
> given list_lru.
> 
> Which makes me think we should be moving more towards "add the memcg
> to the list_lru at the first insert" model rather than "instantiate
> all at memcg init time just in case". The model we originally came
> up with for supprting memcgs is really starting to show it's limits,
> and we should address those limitations rahter than hack more
> complexity into the system that does nothing to remove the
> limitations that are causing the problems in the first place.

I totally agree.

It looks like the initial implementation of the whole kernel memory accounting
and memcg-aware shrinkers was based on the idea that the number of memory
cgroups is relatively small and stable. With systemd creating a separate cgroup
for everything including short-living processes it simple not true anymore.

Thanks!