[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZqLg30nOUVlerBh1@google.com>
Date: Thu, 25 Jul 2024 23:33:51 +0000
From: Roman Gushchin <roman.gushchin@...ux.dev>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Michal Hocko <mhocko@...nel.org>,
Shakeel Butt <shakeel.butt@...ux.dev>,
Muchun Song <muchun.song@...ux.dev>
Subject: Re: [PATCH v2 3/5] mm: memcg: merge multiple page_counters into a
single structure
On Thu, Jul 25, 2024 at 05:42:27PM -0400, Johannes Weiner wrote:
> On Wed, Jul 24, 2024 at 08:21:01PM +0000, Roman Gushchin wrote:
> > --- a/include/linux/page_counter.h
> > +++ b/include/linux/page_counter.h
> > @@ -5,14 +5,71 @@
> > #include <linux/atomic.h>
> > #include <linux/cache.h>
> > #include <linux/limits.h>
> > +#include <linux/mm_types.h>
> > #include <asm/page.h>
> >
> > +/*
> > + * Page counters are used by memory and hugetlb cgroups.
> > + * Memory cgroups are using up to 4 separate counters:
> > + * memory, swap (memory + swap on cgroup v1), kmem and tcpmem.
> > + * Hugetlb cgroups are using 2 arrays of counters with HUGE_MAX_HSTATE
> > + * in each to track the usage and reservations of each supported
> > + * hugepage size.
> > + *
> > + * Protection (min/low) is supported only for the first counter
> > + * with idx 0 and only if the counter was initialized with the protection
> > + * support.
> > + */
> > +
> > +enum mem_counter_type {
> > +#ifdef CONFIG_MEMCG
> > + /* Unified memory counter */
> > + MCT_MEM,
> > +
> > + /* Swap */
> > + MCT_SWAP,
> > +
> > + /* Memory + swap */
> > + MCT_MEMSW = MCT_SWAP,
> > +
> > +#ifdef CONFIG_MEMCG_V1
> > + /* Kernel memory */
> > + MCT_KMEM,
> > +
> > + /* Tcp memory */
> > + MCT_TCPMEM,
> > +#endif /* CONFIG_MEMCG_V1 */
> > +#endif /* CONFIG_MEMCG */
> > +
> > + /* Maximum number of memcg counters */
> > + MCT_NR_MEMCG_ITEMS,
> > +};
> > +
> > +#ifdef CONFIG_CGROUP_HUGETLB
> > +#ifdef HUGE_MAX_HSTATE
> > +#define MCT_NR_HUGETLB_ITEMS HUGE_MAX_HSTATE
> > +#else
> > +#define MCT_NR_HUGETLB_ITEMS 1
> > +#endif
> > +
> > +/*
> > + * max() can't be used here: even though __builtin_choose_expr() evaluates
> > + * to true, the false clause generates a compiler error:
> > + * error: braced-group within expression allowed only inside a function .
> > + */
> > +#define MCT_NR_ITEMS (__builtin_choose_expr(MCT_NR_MEMCG_ITEMS > MCT_NR_HUGETLB_ITEMS, \
> > + MCT_NR_MEMCG_ITEMS, MCT_NR_HUGETLB_ITEMS))
> > +
> > +#else /* CONFIG_CGROUP_HUGETLB */
> > +#define MCT_NR_ITEMS MCT_NR_MEMCG_ITEMS
> > +#endif /* CONFIG_CGROUP_HUGETLB */
> > +
> > struct page_counter {
> > /*
> > * Make sure 'usage' does not share cacheline with any other field. The
> > * memcg->memory.usage is a hot member of struct mem_cgroup.
> > */
> > - atomic_long_t usage;
> > + atomic_long_t usage[MCT_NR_ITEMS];
> > CACHELINE_PADDING(_pad1_);
> >
> > /* effective memory.min and memory.min usage tracking */
> > @@ -25,9 +82,9 @@ struct page_counter {
> > atomic_long_t low_usage;
> > atomic_long_t children_low_usage;
> >
> > - unsigned long watermark;
> > - unsigned long local_watermark; /* track min of fd-local resets */
> > - unsigned long failcnt;
> > + unsigned long watermark[MCT_NR_ITEMS];
> > + unsigned long local_watermark[MCT_NR_ITEMS]; /* track min of fd-local resets */
> > + unsigned long failcnt[MCT_NR_ITEMS];
> >
> > /* Keep all the read most fields in a separete cacheline. */
> > CACHELINE_PADDING(_pad2_);
> > @@ -35,8 +92,9 @@ struct page_counter {
> > bool protection_support;
> > unsigned long min;
> > unsigned long low;
> > - unsigned long high;
> > - unsigned long max;
> > + unsigned long high[MCT_NR_ITEMS];
> > + unsigned long max[MCT_NR_ITEMS];
> > +
> > struct page_counter *parent;
> > } ____cacheline_internodealigned_in_smp;
>
> This hardcodes way too much user specifics into what should be a
> self-contained piece of abstraction. Should anybody else try to use
> the 'struct page_counter' for a single resource elsewhere, they'd end
> up with up to 5 counters, watermarks, failcnts, highs and maxs etc.
>
> It's clear that the hierarchical aspect of the page_counter no longer
> works. It was okay-ish when all we had was a simple hard limit. Now we
> carry to much stuff in it that is only useful to a small set of users.
> Instead of doubling down and repeating the mistakes we made for struct
> page, it would be better to move user specific stuff out of it
> entirely. I think there are two patterns that would be more natural:
> a) pass a callback function and have a priv pointer in page_counter
> for user-specifics; or b) remove the hierarchy aspect from the page
> counter entirely, let the *caller* walk the tree, call the page
> counter code to trycharge (and log watermark) only that level, and
> handle everything caller specific on the caller side.
>
> All users already have parent linkage in the css, so this would
> eliminate the parent pointer in page_counter altogether.
>
> The protection stuff could be moved to memcg proper, eliminating yet
> more fields that aren't interesting to any other user.
Hm, Idk, I do agree with what you're saying about the self-contained
piece of abstraction (and I had very similar thoughts in the process),
but there are also some complications.
First, funny enough, the protection calculation code was just moved to
mm/page_counter.c by a8585ac68621 ("mm/page_counter: move calculating protection
values to page_counter"). The commit log says that it's gonna be used by the drm
cgroup controller, but the code is not (yet?) upstream apparently. I don't have
any insights here. If there will be the second user for the protection
functionality, moving it back to memcontrol.c is not feasible.
Second, I agree that it would be nice to get rid of the parent pointer in
struct page_cgroup entirely and use one in css. But Idk how to do it
without making the code way more messy or duplicate a lot of tree walks.
In C++ (or another language with generics) we could make struct page_counter
taking the number of counters and the set of features as template parameters.
I feel like with memcg1 code being factored out the benefit of this reorg
lowered, so how about me resending 2 first patches separately and spending
more time on thinking on the best long-term strategy? I have some ideas
here and I thought about this work as a preparatory step, but maybe you're
right and it makes sense to approach it more comprehensively.
Thanks!
Powered by blists - more mailing lists