linux-kernel - Re: [RFC PATCH -next 1/7] vmscan: add memcg heat level for reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6ad1fb5d-a859-4611-8af9-aa4d37aeeb38@huaweicloud.com>
Date: Mon, 9 Feb 2026 16:17:10 +0800
From: Chen Ridong <chenridong@...weicloud.com>
To: Yuanchu Xie <yuanchu@...gle.com>
Cc: akpm@...ux-foundation.org, axelrasmussen@...gle.com, weixugc@...gle.com,
 david@...nel.org, lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com,
 vbabka@...e.cz, rppt@...nel.org, surenb@...gle.com, mhocko@...e.com,
 corbet@....net, skhan@...uxfoundation.org, hannes@...xchg.org,
 roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev,
 zhengqi.arch@...edance.com, linux-mm@...ck.org, linux-doc@...r.kernel.org,
 linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, lujialin4@...wei.com,
 ryncsn@...il.com
Subject: Re: [RFC PATCH -next 1/7] vmscan: add memcg heat level for reclaim

Hi Yuanchu,

On 2026/2/7 6:47, Yuanchu Xie wrote:
> Hi Ridong,
> 
> Thanks for working to reconcile the gaps between the LRU implementations.
> 
> On Tue, Jan 20, 2026 at 7:57 AM Chen Ridong <chenridong@...weicloud.com> wrote:
>>
>> From: Chen Ridong <chenridong@...wei.com>
>>
>> The memcg LRU was originally introduced to improve scalability during
>> global reclaim. However, it is complex and only works with gen lru
>> global reclaim. Moreover, its implementation complexity has led to
>> performance regressions when handling a large number of memory cgroups [1].
>>
>> This patch introduces a per-memcg heat level for reclaim, aiming to unify
>> gen lru and traditional LRU global reclaim. The core idea is to track
>> per-node per-memcg reclaim state, including heat, last_decay, and
>> last_refault. The last_refault records the total reclaimed data from the
>> previous memcg reclaim. The last_decay is a time-based parameter; the heat
>> level decays over time if the memcg is not reclaimed again. Both last_decay
>> and last_refault are used to calculate the current heat level when reclaim
>> starts.
>>
>> Three reclaim heat levels are defined: cold, warm, and hot. Cold memcgs are
>> reclaimed first; only if cold memcgs cannot reclaim enough pages, warm
>> memcgs become eligible for reclaim. Hot memcgs are reclaimed last.
>>
>> While this design can be applied to all memcg reclaim scenarios, this patch
>> is conservative and only introduces heat levels for traditional LRU global
>> reclaim. Subsequent patches will replace the memcg LRU with
>> heat-level-based reclaim.
>>
>> Based on tests provided by YU Zhao, traditional LRU global reclaim shows
>> significant performance improvement with heat-level reclaim enabled.
>>
>> The results below are from a 2-hour run of the test [2].
>>
>> Throughput (number of requests)         before     after        Change
>> Total                                   1734169    2353717      +35%
>>
>> Tail latency (number of requests)       before     after        Change
>> [128s, inf)                             1231       1057         -14%
>> [64s, 128s)                             586        444          -24%
>> [32s, 64s)                              1658       1061         -36%
>> [16s, 32s)                              4611       2863         -38%
> 
> Do you have any numbers comparing heat-based reclaim to memcg LRU?  I
> know Johannes suggested removing memcg LRU, and what you have here
> applies to more reclaim scenarios.
> 

Yes, the test data is provided in patch 5/7.

>>
>> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
>> [2] https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/
>>
>> Signed-off-by: Chen Ridong <chenridong@...wei.com>
>> ---
>>  include/linux/memcontrol.h |   7 ++
>>  mm/memcontrol.c            |   3 +
>>  mm/vmscan.c                | 227 +++++++++++++++++++++++++++++--------
>>  3 files changed, 192 insertions(+), 45 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index af352cabedba..b293caf70034 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -76,6 +76,12 @@ struct memcg_vmstats;
>>  struct lruvec_stats_percpu;
>>  struct lruvec_stats;
>>
>> +struct memcg_reclaim_state {
>> +       atomic_long_t heat;
>> +       unsigned long last_decay;
>> +       atomic_long_t last_refault;
>> +};
>> +
>>  struct mem_cgroup_reclaim_iter {
>>         struct mem_cgroup *position;
>>         /* scan generation, increased every round-trip */
>> @@ -114,6 +120,7 @@ struct mem_cgroup_per_node {
>>         CACHELINE_PADDING(_pad2_);
>>         unsigned long           lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
>>         struct mem_cgroup_reclaim_iter  iter;
>> +       struct memcg_reclaim_state      reclaim;
>>
>>  #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC
>>         /* slab stats for nmi context */
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index f2b87e02574e..675d49ad7e2c 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3713,6 +3713,9 @@ static bool alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>>
>>         lruvec_init(&pn->lruvec);
>>         pn->memcg = memcg;
>> +       atomic_long_set(&pn->reclaim.heat, 0);
>> +       pn->reclaim.last_decay = jiffies;
>> +       atomic_long_set(&pn->reclaim.last_refault, 0);
>>
>>         memcg->nodeinfo[node] = pn;
>>         return true;
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 4aa73f125772..3759cd52c336 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -5978,6 +5978,124 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>>         return inactive_lru_pages > pages_for_compaction;
>>  }
>>
>> +enum memcg_scan_level {
>> +       MEMCG_LEVEL_COLD,
>> +       MEMCG_LEVEL_WARM,
>> +       MEMCG_LEVEL_HOT,
>> +       MEMCG_LEVEL_MAX,
>> +};
>> +
>> +#define MEMCG_HEAT_WARM                4
>> +#define MEMCG_HEAT_HOT         8
>> +#define MEMCG_HEAT_MAX         12
>> +#define MEMCG_HEAT_DECAY_STEP  1
>> +#define MEMCG_HEAT_DECAY_INTERVAL      (1 * HZ)
> I agree with Kairui; I'm somewhat concerned about this fixed decay
> interval and how it behaves with many memcgs or heavy pressure.
> 

Yes, a fixed decay interval may not be optimal for all scenarios. It serves as a
foundational baseline. Perhaps we could expose a BPF hook here for more flexible
tuning.

The referenced benchmark [2] specifically tests under heavy pressure
(continuously triggering global reclaim) and with a large number of memory cgroups.


>> +
>> +static void memcg_adjust_heat(struct mem_cgroup_per_node *pn, long delta)
>> +{
>> +       long heat, new_heat;
>> +
>> +       if (mem_cgroup_is_root(pn->memcg))
>> +               return;
>> +
>> +       heat = atomic_long_read(&pn->reclaim.heat);
>> +       do {
>> +               new_heat = clamp_t(long, heat + delta, 0, MEMCG_HEAT_MAX);
>> +               if (atomic_long_cmpxchg(&pn->reclaim.heat, heat, new_heat) == heat)
>> +                       break;
>> +               heat = atomic_long_read(&pn->reclaim.heat);
>> +       } while (1);
>> +}
>> +
>> +static void memcg_decay_heat(struct mem_cgroup_per_node *pn)
>> +{
>> +       unsigned long last;
>> +       unsigned long now = jiffies;
>> +
>> +       if (mem_cgroup_is_root(pn->memcg))
>> +               return;
>> +
>> +       last = READ_ONCE(pn->reclaim.last_decay);
>> +       if (!time_after(now, last + MEMCG_HEAT_DECAY_INTERVAL))
>> +               return;
>> +
>> +       if (cmpxchg(&pn->reclaim.last_decay, last, now) != last)
>> +               return;
>> +
>> +       memcg_adjust_heat(pn, -MEMCG_HEAT_DECAY_STEP);
>> +}
>> +
>> +static int memcg_heat_level(struct mem_cgroup_per_node *pn)
>> +{
>> +       long heat;
>> +
>> +       if (mem_cgroup_is_root(pn->memcg))
>> +               return MEMCG_LEVEL_COLD;
>> +
>> +       memcg_decay_heat(pn);
> The decay here is somewhat counterintuitive given the name memcg_heat_level.
> 

The decay is integrated into the level retrieval. Essentially, whenever
memcg_heat_level is fetched, we check if the decay interval has elapsed
(interval > MEMCG_HEAT_DECAY_INTERVAL). If so, the decay is applied.

>> +       heat = atomic_long_read(&pn->reclaim.heat);
>> +
>> +       if (heat >= MEMCG_HEAT_HOT)
>> +               return MEMCG_LEVEL_HOT;
>> +       if (heat >= MEMCG_HEAT_WARM)
>> +               return MEMCG_LEVEL_WARM;
>> +       return MEMCG_LEVEL_COLD;
>> +}
>> +
>> +static void memcg_record_reclaim_result(struct mem_cgroup_per_node *pn,
>> +                                       struct lruvec *lruvec,
>> +                                       unsigned long scanned,
>> +                                       unsigned long reclaimed)
>> +{
>> +       long delta;
>> +
>> +       if (mem_cgroup_is_root(pn->memcg))
>> +               return;
>> +
>> +       memcg_decay_heat(pn);
> Could you combine the decay and adjust later in this function?
> 

Sure.

>> +
>> +       /*
>> +        * Memory cgroup heat adjustment algorithm:
>> +        * - If scanned == 0: mark as hottest (+MAX_HEAT)
>> +        * - If reclaimed >= 50% * scanned: strong cool (-2)
>> +        * - If reclaimed >= 25% * scanned: mild cool (-1)
>> +        * - Otherwise:  warm up (+1)
>> +        */
>> +       if (!scanned)
>> +               delta = MEMCG_HEAT_MAX;
>> +       else if (reclaimed * 2 >= scanned)
>> +               delta = -2;
>> +       else if (reclaimed * 4 >= scanned)
>> +               delta = -1;
>> +       else
>> +               delta = 1;
>> +
>> +       /*
>> +        * Refault-based heat adjustment:
>> +        * - If refault increase > reclaimed pages: heat up (more cautious reclaim)
>> +        * - If no refaults and currently warm:     cool down (allow more reclaim)
>> +        * This prevents thrashing by backing off when refaults indicate over-reclaim.
>> +        */
>> +       if (lruvec) {
>> +               unsigned long total_refaults;
>> +               unsigned long prev;
>> +               long refault_delta;
>> +
>> +               total_refaults = lruvec_page_state(lruvec, WORKINGSET_ACTIVATE_ANON);
>> +               total_refaults += lruvec_page_state(lruvec, WORKINGSET_ACTIVATE_FILE);
>> +
>> +               prev = atomic_long_xchg(&pn->reclaim.last_refault, total_refaults);
>> +               refault_delta = total_refaults - prev;
>> +
>> +               if (refault_delta > reclaimed)
>> +                       delta++;
>> +               else if (!refault_delta && delta > 0)
>> +                       delta--;
>> +       }
> 
> I think this metric is based more on the memcg's reclaimability than
> on heat. Though the memcgs are grouped based on absolute metrics and
> not relative to others.
> 

I might be misunderstanding your comment. Could you elaborate?

As designed, the heat level is indeed derived from the memcg's own
reclaimability (reclaimed/scanned) and refault behavior. In essence, it
quantifies the difficulty or “heat” of reclaiming memory from that specific
cgroup. This metric directly correlates to whether a memcg can release memory
easily or not.

>> +
>> +       memcg_adjust_heat(pn, delta);
>> +}
>> +
>>  static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>>  {
>> ...snip
>>  }
> 
> Thanks,
> Yuanchu

-- 
Best regards,
Ridong