linux-kernel - Re: [PATCH v5 01/16] mm: list_lru: optimize memory consumption of arrays of per cgroup lists

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Ydx+BWQp18hjdO32@carbon.dhcp.thefacebook.com>
Date:   Mon, 10 Jan 2022 10:42:13 -0800
From:   Roman Gushchin <guro@...com>
To:     Muchun Song <songmuchun@...edance.com>
CC:     Matthew Wilcox <willy@...radead.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Yang Shi <shy828301@...il.com>, Alex Shi <alexs@...nel.org>,
        Wei Yang <richard.weiyang@...il.com>,
        Dave Chinner <david@...morbit.com>,
        <trond.myklebust@...merspace.com>, <anna.schumaker@...app.com>,
        <jaegeuk@...nel.org>, <chao@...nel.org>,
        Kari Argillander <kari.argillander@...il.com>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux Memory Management List <linux-mm@...ck.org>,
        <linux-nfs@...r.kernel.org>, Qi Zheng <zhengqi.arch@...edance.com>,
        Xiongchun duan <duanxiongchun@...edance.com>,
        Fam Zheng <fam.zheng@...edance.com>,
        Muchun Song <smuchun@...il.com>
Subject: Re: [PATCH v5 01/16] mm: list_lru: optimize memory consumption of
 arrays of per cgroup lists

On Sun, Jan 09, 2022 at 12:49:56PM +0800, Muchun Song wrote:
> On Fri, Jan 7, 2022 at 8:05 AM Roman Gushchin <guro@...com> wrote:
> >
> > On Mon, Dec 20, 2021 at 04:56:34PM +0800, Muchun Song wrote:
> > > The list_lru uses an array (list_lru_memcg->lru) to store pointers
> > > which point to the list_lru_one. And the array is per memcg per node.
> > > Therefore, the size of the arrays will be 10K * number_of_node * 8 (
> > > a pointer size on 64 bits system) when we run 10k containers in the
> > > system. The memory consumption of the arrays becomes significant. The
> > > more numa node, the more memory it consumes.
> > >
> > > I have done a simple test, which creates 10K memcg and mount point
> > > each in a two-node system. The memory consumption of the list_lru
> > > will be 24464MB. After converting the array from per memcg per node
> > > to per memcg, the memory consumption is going to be 21957MB. It is
> > > reduces by 2.5GB. In our AMD servers with 8 numa nodes in those
> > > sysuem, the memory consumption could be more significant. The savings
> > > come from the list_lru_one heads, that it also simplifies the
> > > alloc/dealloc path.
> > >
> > > The new scheme looks like the following.
> > >
> > >   +----------+   mlrus   +----------------+   mlru   +----------------------+
> > >   | list_lru +---------->| list_lru_memcg +--------->|  list_lru_per_memcg  |
> > >   +----------+           +----------------+          +----------------------+
> > >                                                      |  list_lru_per_memcg  |
> > >                                                      +----------------------+
> > >                                                      |          ...         |
> > >                           +--------------+   node    +----------------------+
> > >                           | list_lru_one |<----------+  list_lru_per_memcg  |
> > >                           +--------------+           +----------------------+
> > >                           | list_lru_one |
> > >                           +--------------+
> > >                           |      ...     |
> > >                           +--------------+
> > >                           | list_lru_one |
> > >                           +--------------+
> > >
> > > Signed-off-by: Muchun Song <songmuchun@...edance.com>
> > > Acked-by: Johannes Weiner <hannes@...xchg.org>
> >
> > As much as I like the code changes (there is indeed a significant simplification!),
> > I don't like the commit message and title, because I wasn't able to understand
> > what the patch is doing and some parts look simply questionable. Overall it
> > sounds like you reduce the number of list_lru_one structures, which is not true.
> >
> > How about something like this?
> >
> > --
> > mm: list_lru: transpose the array of per-node per-memcg lru lists
> >
> > The current scheme of maintaining per-node per-memcg lru lists looks like:
> >   struct list_lru {
> >     struct list_lru_node *node;           (for each node)
> >       struct list_lru_memcg *memcg_lrus;
> >         struct list_lru_one *lru[];       (for each memcg)
> >   }
> >
> > By effectively transposing the two-dimension array of list_lru_one's structures
> > (per-node per-memcg => per-memcg per-node) it's possible to save some memory
> > and simplify alloc/dealloc paths. The new scheme looks like:
> >   struct list_lru {
> >     struct list_lru_memcg *mlrus;
> >       struct list_lru_per_memcg *mlru[];  (for each memcg)
> >         struct list_lru_one node[0];      (for each node)
> >   }
> >
> > Memory savings are coming from having fewer list_lru_memcg structures, which
> > contain an extra struct rcu_head to handle the destruction process.
> 
> My bad English. Actually, the saving is coming from not only 'struct rcu_head'
> but also some pointer arrays used to store the pointer to 'struct list_lru_one'.
> The array is per node and its size is 8 (a pointer) * num_memcgs.

Nice! Please, add this to the commit log.

> So the total
> size of the arrays is  8 * num_nodes * memcg_nr_cache_ids. After this patch,
> the size becomes 8 * memcg_nr_cache_ids. So the saving is
> 
>    8 * (num_nodes - 1) * memcg_nr_cache_ids.
> 
> > --
> >
> > But what worries me is that memory savings numbers you posted don't do up.
> > In theory we can save
> > 16 (size of struct rcu_head) * 10000 (number of cgroups) * 2 (number of numa nodes) = 320k
> > per slab cache. Did you have a ton of mount points? Otherwise I don't understand
> > where these 2.5Gb are coming from.
> 
> memcg_nr_cache_ids is 12286 when creating 10k memcgs. So the saving
> of arrays of one list_lru is 8 * 1 (number of numa nodes - 1) * 12286 = 96k.
> There will be 2 * 10k list_lru when mounting 10k points. So the total
> saving is 96k * 2 * 10k = 1920 M.

So, there are 10k cgroups _and_ 10k mountpoints. Please, make it obvious from
the commit log. Most users don't have that many mount points (and likely cgroups),
so they shouldn't expect Gb's in savings.

Thanks!

PS I hope to review the rest of the patchset till the end of this week.