linux-kernel - Re: [RFC PATCH -next 1/7] vmscan: add memcg heat level for reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJj2-QEvrgQ+R-nc3LZ-cBfnzjakxfSgmNbqDa-RFBVOpdVaAQ@mail.gmail.com>
Date: Fri, 6 Feb 2026 16:47:31 -0600
From: Yuanchu Xie <yuanchu@...gle.com>
To: Chen Ridong <chenridong@...weicloud.com>
Cc: akpm@...ux-foundation.org, axelrasmussen@...gle.com, weixugc@...gle.com, 
	david@...nel.org, lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com, 
	vbabka@...e.cz, rppt@...nel.org, surenb@...gle.com, mhocko@...e.com, 
	corbet@....net, skhan@...uxfoundation.org, hannes@...xchg.org, 
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev, 
	zhengqi.arch@...edance.com, linux-mm@...ck.org, linux-doc@...r.kernel.org, 
	linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, lujialin4@...wei.com, 
	ryncsn@...il.com
Subject: Re: [RFC PATCH -next 1/7] vmscan: add memcg heat level for reclaim

Hi Ridong,

Thanks for working to reconcile the gaps between the LRU implementations.

On Tue, Jan 20, 2026 at 7:57 AM Chen Ridong <chenridong@...weicloud.com> wrote:
>
> From: Chen Ridong <chenridong@...wei.com>
>
> The memcg LRU was originally introduced to improve scalability during
> global reclaim. However, it is complex and only works with gen lru
> global reclaim. Moreover, its implementation complexity has led to
> performance regressions when handling a large number of memory cgroups [1].
>
> This patch introduces a per-memcg heat level for reclaim, aiming to unify
> gen lru and traditional LRU global reclaim. The core idea is to track
> per-node per-memcg reclaim state, including heat, last_decay, and
> last_refault. The last_refault records the total reclaimed data from the
> previous memcg reclaim. The last_decay is a time-based parameter; the heat
> level decays over time if the memcg is not reclaimed again. Both last_decay
> and last_refault are used to calculate the current heat level when reclaim
> starts.
>
> Three reclaim heat levels are defined: cold, warm, and hot. Cold memcgs are
> reclaimed first; only if cold memcgs cannot reclaim enough pages, warm
> memcgs become eligible for reclaim. Hot memcgs are reclaimed last.
>
> While this design can be applied to all memcg reclaim scenarios, this patch
> is conservative and only introduces heat levels for traditional LRU global
> reclaim. Subsequent patches will replace the memcg LRU with
> heat-level-based reclaim.
>
> Based on tests provided by YU Zhao, traditional LRU global reclaim shows
> significant performance improvement with heat-level reclaim enabled.
>
> The results below are from a 2-hour run of the test [2].
>
> Throughput (number of requests)         before     after        Change
> Total                                   1734169    2353717      +35%
>
> Tail latency (number of requests)       before     after        Change
> [128s, inf)                             1231       1057         -14%
> [64s, 128s)                             586        444          -24%
> [32s, 64s)                              1658       1061         -36%
> [16s, 32s)                              4611       2863         -38%

Do you have any numbers comparing heat-based reclaim to memcg LRU?  I
know Johannes suggested removing memcg LRU, and what you have here
applies to more reclaim scenarios.

>
> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
> [2] https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/
>
> Signed-off-by: Chen Ridong <chenridong@...wei.com>
> ---
>  include/linux/memcontrol.h |   7 ++
>  mm/memcontrol.c            |   3 +
>  mm/vmscan.c                | 227 +++++++++++++++++++++++++++++--------
>  3 files changed, 192 insertions(+), 45 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index af352cabedba..b293caf70034 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -76,6 +76,12 @@ struct memcg_vmstats;
>  struct lruvec_stats_percpu;
>  struct lruvec_stats;
>
> +struct memcg_reclaim_state {
> +       atomic_long_t heat;
> +       unsigned long last_decay;
> +       atomic_long_t last_refault;
> +};
> +
>  struct mem_cgroup_reclaim_iter {
>         struct mem_cgroup *position;
>         /* scan generation, increased every round-trip */
> @@ -114,6 +120,7 @@ struct mem_cgroup_per_node {
>         CACHELINE_PADDING(_pad2_);
>         unsigned long           lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
>         struct mem_cgroup_reclaim_iter  iter;
> +       struct memcg_reclaim_state      reclaim;
>
>  #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC
>         /* slab stats for nmi context */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f2b87e02574e..675d49ad7e2c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3713,6 +3713,9 @@ static bool alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>
>         lruvec_init(&pn->lruvec);
>         pn->memcg = memcg;
> +       atomic_long_set(&pn->reclaim.heat, 0);
> +       pn->reclaim.last_decay = jiffies;
> +       atomic_long_set(&pn->reclaim.last_refault, 0);
>
>         memcg->nodeinfo[node] = pn;
>         return true;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4aa73f125772..3759cd52c336 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5978,6 +5978,124 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>         return inactive_lru_pages > pages_for_compaction;
>  }
>
> +enum memcg_scan_level {
> +       MEMCG_LEVEL_COLD,
> +       MEMCG_LEVEL_WARM,
> +       MEMCG_LEVEL_HOT,
> +       MEMCG_LEVEL_MAX,
> +};
> +
> +#define MEMCG_HEAT_WARM                4
> +#define MEMCG_HEAT_HOT         8
> +#define MEMCG_HEAT_MAX         12
> +#define MEMCG_HEAT_DECAY_STEP  1
> +#define MEMCG_HEAT_DECAY_INTERVAL      (1 * HZ)
I agree with Kairui; I'm somewhat concerned about this fixed decay
interval and how it behaves with many memcgs or heavy pressure.

> +
> +static void memcg_adjust_heat(struct mem_cgroup_per_node *pn, long delta)
> +{
> +       long heat, new_heat;
> +
> +       if (mem_cgroup_is_root(pn->memcg))
> +               return;
> +
> +       heat = atomic_long_read(&pn->reclaim.heat);
> +       do {
> +               new_heat = clamp_t(long, heat + delta, 0, MEMCG_HEAT_MAX);
> +               if (atomic_long_cmpxchg(&pn->reclaim.heat, heat, new_heat) == heat)
> +                       break;
> +               heat = atomic_long_read(&pn->reclaim.heat);
> +       } while (1);
> +}
> +
> +static void memcg_decay_heat(struct mem_cgroup_per_node *pn)
> +{
> +       unsigned long last;
> +       unsigned long now = jiffies;
> +
> +       if (mem_cgroup_is_root(pn->memcg))
> +               return;
> +
> +       last = READ_ONCE(pn->reclaim.last_decay);
> +       if (!time_after(now, last + MEMCG_HEAT_DECAY_INTERVAL))
> +               return;
> +
> +       if (cmpxchg(&pn->reclaim.last_decay, last, now) != last)
> +               return;
> +
> +       memcg_adjust_heat(pn, -MEMCG_HEAT_DECAY_STEP);
> +}
> +
> +static int memcg_heat_level(struct mem_cgroup_per_node *pn)
> +{
> +       long heat;
> +
> +       if (mem_cgroup_is_root(pn->memcg))
> +               return MEMCG_LEVEL_COLD;
> +
> +       memcg_decay_heat(pn);
The decay here is somewhat counterintuitive given the name memcg_heat_level.

> +       heat = atomic_long_read(&pn->reclaim.heat);
> +
> +       if (heat >= MEMCG_HEAT_HOT)
> +               return MEMCG_LEVEL_HOT;
> +       if (heat >= MEMCG_HEAT_WARM)
> +               return MEMCG_LEVEL_WARM;
> +       return MEMCG_LEVEL_COLD;
> +}
> +
> +static void memcg_record_reclaim_result(struct mem_cgroup_per_node *pn,
> +                                       struct lruvec *lruvec,
> +                                       unsigned long scanned,
> +                                       unsigned long reclaimed)
> +{
> +       long delta;
> +
> +       if (mem_cgroup_is_root(pn->memcg))
> +               return;
> +
> +       memcg_decay_heat(pn);
Could you combine the decay and adjust later in this function?

> +
> +       /*
> +        * Memory cgroup heat adjustment algorithm:
> +        * - If scanned == 0: mark as hottest (+MAX_HEAT)
> +        * - If reclaimed >= 50% * scanned: strong cool (-2)
> +        * - If reclaimed >= 25% * scanned: mild cool (-1)
> +        * - Otherwise:  warm up (+1)
> +        */
> +       if (!scanned)
> +               delta = MEMCG_HEAT_MAX;
> +       else if (reclaimed * 2 >= scanned)
> +               delta = -2;
> +       else if (reclaimed * 4 >= scanned)
> +               delta = -1;
> +       else
> +               delta = 1;
> +
> +       /*
> +        * Refault-based heat adjustment:
> +        * - If refault increase > reclaimed pages: heat up (more cautious reclaim)
> +        * - If no refaults and currently warm:     cool down (allow more reclaim)
> +        * This prevents thrashing by backing off when refaults indicate over-reclaim.
> +        */
> +       if (lruvec) {
> +               unsigned long total_refaults;
> +               unsigned long prev;
> +               long refault_delta;
> +
> +               total_refaults = lruvec_page_state(lruvec, WORKINGSET_ACTIVATE_ANON);
> +               total_refaults += lruvec_page_state(lruvec, WORKINGSET_ACTIVATE_FILE);
> +
> +               prev = atomic_long_xchg(&pn->reclaim.last_refault, total_refaults);
> +               refault_delta = total_refaults - prev;
> +
> +               if (refault_delta > reclaimed)
> +                       delta++;
> +               else if (!refault_delta && delta > 0)
> +                       delta--;
> +       }

I think this metric is based more on the memcg's reclaimability than
on heat. Though the memcgs are grouped based on absolute metrics and
not relative to others.

> +
> +       memcg_adjust_heat(pn, delta);
> +}
> +
>  static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  {
> ...snip
>  }

Thanks,
Yuanchu