[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6ad1fb5d-a859-4611-8af9-aa4d37aeeb38@huaweicloud.com>
Date: Mon, 9 Feb 2026 16:17:10 +0800
From: Chen Ridong <chenridong@...weicloud.com>
To: Yuanchu Xie <yuanchu@...gle.com>
Cc: akpm@...ux-foundation.org, axelrasmussen@...gle.com, weixugc@...gle.com,
david@...nel.org, lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com,
vbabka@...e.cz, rppt@...nel.org, surenb@...gle.com, mhocko@...e.com,
corbet@....net, skhan@...uxfoundation.org, hannes@...xchg.org,
roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev,
zhengqi.arch@...edance.com, linux-mm@...ck.org, linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, lujialin4@...wei.com,
ryncsn@...il.com
Subject: Re: [RFC PATCH -next 1/7] vmscan: add memcg heat level for reclaim
Hi Yuanchu,
On 2026/2/7 6:47, Yuanchu Xie wrote:
> Hi Ridong,
>
> Thanks for working to reconcile the gaps between the LRU implementations.
>
> On Tue, Jan 20, 2026 at 7:57 AM Chen Ridong <chenridong@...weicloud.com> wrote:
>>
>> From: Chen Ridong <chenridong@...wei.com>
>>
>> The memcg LRU was originally introduced to improve scalability during
>> global reclaim. However, it is complex and only works with gen lru
>> global reclaim. Moreover, its implementation complexity has led to
>> performance regressions when handling a large number of memory cgroups [1].
>>
>> This patch introduces a per-memcg heat level for reclaim, aiming to unify
>> gen lru and traditional LRU global reclaim. The core idea is to track
>> per-node per-memcg reclaim state, including heat, last_decay, and
>> last_refault. The last_refault records the total reclaimed data from the
>> previous memcg reclaim. The last_decay is a time-based parameter; the heat
>> level decays over time if the memcg is not reclaimed again. Both last_decay
>> and last_refault are used to calculate the current heat level when reclaim
>> starts.
>>
>> Three reclaim heat levels are defined: cold, warm, and hot. Cold memcgs are
>> reclaimed first; only if cold memcgs cannot reclaim enough pages, warm
>> memcgs become eligible for reclaim. Hot memcgs are reclaimed last.
>>
>> While this design can be applied to all memcg reclaim scenarios, this patch
>> is conservative and only introduces heat levels for traditional LRU global
>> reclaim. Subsequent patches will replace the memcg LRU with
>> heat-level-based reclaim.
>>
>> Based on tests provided by YU Zhao, traditional LRU global reclaim shows
>> significant performance improvement with heat-level reclaim enabled.
>>
>> The results below are from a 2-hour run of the test [2].
>>
>> Throughput (number of requests) before after Change
>> Total 1734169 2353717 +35%
>>
>> Tail latency (number of requests) before after Change
>> [128s, inf) 1231 1057 -14%
>> [64s, 128s) 586 444 -24%
>> [32s, 64s) 1658 1061 -36%
>> [16s, 32s) 4611 2863 -38%
>
> Do you have any numbers comparing heat-based reclaim to memcg LRU? I
> know Johannes suggested removing memcg LRU, and what you have here
> applies to more reclaim scenarios.
>
Yes, the test data is provided in patch 5/7.
>>
>> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
>> [2] https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/
>>
>> Signed-off-by: Chen Ridong <chenridong@...wei.com>
>> ---
>> include/linux/memcontrol.h | 7 ++
>> mm/memcontrol.c | 3 +
>> mm/vmscan.c | 227 +++++++++++++++++++++++++++++--------
>> 3 files changed, 192 insertions(+), 45 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index af352cabedba..b293caf70034 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -76,6 +76,12 @@ struct memcg_vmstats;
>> struct lruvec_stats_percpu;
>> struct lruvec_stats;
>>
>> +struct memcg_reclaim_state {
>> + atomic_long_t heat;
>> + unsigned long last_decay;
>> + atomic_long_t last_refault;
>> +};
>> +
>> struct mem_cgroup_reclaim_iter {
>> struct mem_cgroup *position;
>> /* scan generation, increased every round-trip */
>> @@ -114,6 +120,7 @@ struct mem_cgroup_per_node {
>> CACHELINE_PADDING(_pad2_);
>> unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
>> struct mem_cgroup_reclaim_iter iter;
>> + struct memcg_reclaim_state reclaim;
>>
>> #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC
>> /* slab stats for nmi context */
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index f2b87e02574e..675d49ad7e2c 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3713,6 +3713,9 @@ static bool alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>>
>> lruvec_init(&pn->lruvec);
>> pn->memcg = memcg;
>> + atomic_long_set(&pn->reclaim.heat, 0);
>> + pn->reclaim.last_decay = jiffies;
>> + atomic_long_set(&pn->reclaim.last_refault, 0);
>>
>> memcg->nodeinfo[node] = pn;
>> return true;
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 4aa73f125772..3759cd52c336 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -5978,6 +5978,124 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>> return inactive_lru_pages > pages_for_compaction;
>> }
>>
>> +enum memcg_scan_level {
>> + MEMCG_LEVEL_COLD,
>> + MEMCG_LEVEL_WARM,
>> + MEMCG_LEVEL_HOT,
>> + MEMCG_LEVEL_MAX,
>> +};
>> +
>> +#define MEMCG_HEAT_WARM 4
>> +#define MEMCG_HEAT_HOT 8
>> +#define MEMCG_HEAT_MAX 12
>> +#define MEMCG_HEAT_DECAY_STEP 1
>> +#define MEMCG_HEAT_DECAY_INTERVAL (1 * HZ)
> I agree with Kairui; I'm somewhat concerned about this fixed decay
> interval and how it behaves with many memcgs or heavy pressure.
>
Yes, a fixed decay interval may not be optimal for all scenarios. It serves as a
foundational baseline. Perhaps we could expose a BPF hook here for more flexible
tuning.
The referenced benchmark [2] specifically tests under heavy pressure
(continuously triggering global reclaim) and with a large number of memory cgroups.
>> +
>> +static void memcg_adjust_heat(struct mem_cgroup_per_node *pn, long delta)
>> +{
>> + long heat, new_heat;
>> +
>> + if (mem_cgroup_is_root(pn->memcg))
>> + return;
>> +
>> + heat = atomic_long_read(&pn->reclaim.heat);
>> + do {
>> + new_heat = clamp_t(long, heat + delta, 0, MEMCG_HEAT_MAX);
>> + if (atomic_long_cmpxchg(&pn->reclaim.heat, heat, new_heat) == heat)
>> + break;
>> + heat = atomic_long_read(&pn->reclaim.heat);
>> + } while (1);
>> +}
>> +
>> +static void memcg_decay_heat(struct mem_cgroup_per_node *pn)
>> +{
>> + unsigned long last;
>> + unsigned long now = jiffies;
>> +
>> + if (mem_cgroup_is_root(pn->memcg))
>> + return;
>> +
>> + last = READ_ONCE(pn->reclaim.last_decay);
>> + if (!time_after(now, last + MEMCG_HEAT_DECAY_INTERVAL))
>> + return;
>> +
>> + if (cmpxchg(&pn->reclaim.last_decay, last, now) != last)
>> + return;
>> +
>> + memcg_adjust_heat(pn, -MEMCG_HEAT_DECAY_STEP);
>> +}
>> +
>> +static int memcg_heat_level(struct mem_cgroup_per_node *pn)
>> +{
>> + long heat;
>> +
>> + if (mem_cgroup_is_root(pn->memcg))
>> + return MEMCG_LEVEL_COLD;
>> +
>> + memcg_decay_heat(pn);
> The decay here is somewhat counterintuitive given the name memcg_heat_level.
>
The decay is integrated into the level retrieval. Essentially, whenever
memcg_heat_level is fetched, we check if the decay interval has elapsed
(interval > MEMCG_HEAT_DECAY_INTERVAL). If so, the decay is applied.
>> + heat = atomic_long_read(&pn->reclaim.heat);
>> +
>> + if (heat >= MEMCG_HEAT_HOT)
>> + return MEMCG_LEVEL_HOT;
>> + if (heat >= MEMCG_HEAT_WARM)
>> + return MEMCG_LEVEL_WARM;
>> + return MEMCG_LEVEL_COLD;
>> +}
>> +
>> +static void memcg_record_reclaim_result(struct mem_cgroup_per_node *pn,
>> + struct lruvec *lruvec,
>> + unsigned long scanned,
>> + unsigned long reclaimed)
>> +{
>> + long delta;
>> +
>> + if (mem_cgroup_is_root(pn->memcg))
>> + return;
>> +
>> + memcg_decay_heat(pn);
> Could you combine the decay and adjust later in this function?
>
Sure.
>> +
>> + /*
>> + * Memory cgroup heat adjustment algorithm:
>> + * - If scanned == 0: mark as hottest (+MAX_HEAT)
>> + * - If reclaimed >= 50% * scanned: strong cool (-2)
>> + * - If reclaimed >= 25% * scanned: mild cool (-1)
>> + * - Otherwise: warm up (+1)
>> + */
>> + if (!scanned)
>> + delta = MEMCG_HEAT_MAX;
>> + else if (reclaimed * 2 >= scanned)
>> + delta = -2;
>> + else if (reclaimed * 4 >= scanned)
>> + delta = -1;
>> + else
>> + delta = 1;
>> +
>> + /*
>> + * Refault-based heat adjustment:
>> + * - If refault increase > reclaimed pages: heat up (more cautious reclaim)
>> + * - If no refaults and currently warm: cool down (allow more reclaim)
>> + * This prevents thrashing by backing off when refaults indicate over-reclaim.
>> + */
>> + if (lruvec) {
>> + unsigned long total_refaults;
>> + unsigned long prev;
>> + long refault_delta;
>> +
>> + total_refaults = lruvec_page_state(lruvec, WORKINGSET_ACTIVATE_ANON);
>> + total_refaults += lruvec_page_state(lruvec, WORKINGSET_ACTIVATE_FILE);
>> +
>> + prev = atomic_long_xchg(&pn->reclaim.last_refault, total_refaults);
>> + refault_delta = total_refaults - prev;
>> +
>> + if (refault_delta > reclaimed)
>> + delta++;
>> + else if (!refault_delta && delta > 0)
>> + delta--;
>> + }
>
> I think this metric is based more on the memcg's reclaimability than
> on heat. Though the memcgs are grouped based on absolute metrics and
> not relative to others.
>
I might be misunderstanding your comment. Could you elaborate?
As designed, the heat level is indeed derived from the memcg's own
reclaimability (reclaimed/scanned) and refault behavior. In essence, it
quantifies the difficulty or “heat” of reclaiming memory from that specific
cgroup. This metric directly correlates to whether a memcg can release memory
easily or not.
>> +
>> + memcg_adjust_heat(pn, delta);
>> +}
>> +
>> static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>> {
>> ...snip
>> }
>
> Thanks,
> Yuanchu
--
Best regards,
Ridong
Powered by blists - more mailing lists