[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b13fc805-728a-494e-93ea-f2dea351eb00@amd.com>
Date: Mon, 6 Oct 2025 11:27:21 +0530
From: Bharata B Rao <bharata@....com>
To: Jonathan Cameron <jonathan.cameron@...wei.com>
CC: <linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
<dave.hansen@...el.com>, <gourry@...rry.net>, <hannes@...xchg.org>,
<mgorman@...hsingularity.net>, <mingo@...hat.com>, <peterz@...radead.org>,
<raghavendra.kt@....com>, <riel@...riel.com>, <rientjes@...gle.com>,
<sj@...nel.org>, <weixugc@...gle.com>, <willy@...radead.org>,
<ying.huang@...ux.alibaba.com>, <ziy@...dia.com>, <dave@...olabs.net>,
<nifan.cxl@...il.com>, <xuezhengchu@...wei.com>, <yiannis@...corp.com>,
<akpm@...ux-foundation.org>, <david@...hat.com>, <byungchul@...com>,
<kinseyho@...gle.com>, <joshua.hahnjy@...il.com>, <yuanchu@...gle.com>,
<balbirs@...dia.com>, <alok.rathore@...sung.com>
Subject: Re: [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from
NUMAB=2 to kpromoted
On 03-Oct-25 6:08 PM, Jonathan Cameron wrote:
> On Wed, 10 Sep 2025 20:16:53 +0530
> Bharata B Rao <bharata@....com> wrote:
>
>> Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING
>> mode of NUMA Balancing) does hot page detection (via hint faults),
>> hot page classification and eventual promotion, all by itself and
>> sits within the scheduler.
>>
>> With the new hot page tracking and promotion mechanism being
>> available, NUMA Balancing can limit itself to detection of
>> hot pages (via hint faults) and off-load rest of the
>> functionality to the common hot page tracking system.
>>
>> pghot_record_access(PGHOT_HINT_FAULT) API is used to feed the
>> hot page info. In addition, the migration rate limiting and
>> dynamic threshold logic are moved to kpromoted so that the same
>> can be used for hot pages reported by other sources too.
>>
>> Signed-off-by: Bharata B Rao <bharata@....com>
>
> Making a direct replacement without any fallback to previous method
> is going to need a lot of data to show there are no important regressions.
>
> So bold move if that's the intent!
Firstly I am only moving the existing hot page heuristics that is part of
NUMAB=2 to kpromoted so that the same can be applied to hot pages being
identified by other sources. So the hint fault mechanism that is inherent
to NUMAB=2 still remains.
In fact, kscand effort started as a potential replacement for the existing
hot page promotion mechanism by getting rid of hint faults and moving the
page table scanning out of process context.
In any case, I will start including numbers from the next post.
>>
>> static unsigned int sysctl_pghot_freq_window = KPROMOTED_FREQ_WINDOW;
>>
>> +/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
>> +static unsigned int sysctl_pghot_promote_rate_limit = 65536;
>
> If the comment correlates with the value, this is 64 GiB/s? That seems
> unlikely if I guess possible.
IIUC, the existing logic tries to limit promotion rate to 64 GiB/s by
limiting the number of candidate pages that are promoted within the
1s observation interval.
Are you saying that achieving the rate of 64 GiB/s is not possible
or unlikely?
>
>> +
>> #ifdef CONFIG_SYSCTL
>> static const struct ctl_table pghot_sysctls[] = {
>> {
>> @@ -44,8 +50,17 @@ static const struct ctl_table pghot_sysctls[] = {
>> .proc_handler = proc_dointvec_minmax,
>> .extra1 = SYSCTL_ZERO,
>> },
>> + {
>> + .procname = "pghot_promote_rate_limit_MBps",
>> + .data = &sysctl_pghot_promote_rate_limit,
>> + .maxlen = sizeof(unsigned int),
>> + .mode = 0644,
>> + .proc_handler = proc_dointvec_minmax,
>> + .extra1 = SYSCTL_ZERO,
>> + },
>> };
>> #endif
>> +
> Put that in earlier patch to reduce noise here.
This patch moves the hot page heuristics to kpromoted and hence this
related sysctl is also being moved in this patch.
>
>> static bool phi_heap_less(const void *lhs, const void *rhs, void *args)
>> {
>> return (*(struct pghot_info **)lhs)->frequency >
>> @@ -94,11 +109,99 @@ static bool phi_heap_insert(struct max_heap *phi_heap, struct pghot_info *phi)
>> return true;
>> }
>>
>> +/*
>> + * For memory tiering mode, if there are enough free pages (more than
>> + * enough watermark defined here) in fast memory node, to take full
>
> I'd use enough_wmark Just because "more than enough" is a common
> English phrase and I at least tripped over that sentence as a result!
Ah I see that, but as you note later, I am currently only doing the
movement.
>
>> + * advantage of fast memory capacity, all recently accessed slow
>> + * memory pages will be migrated to fast memory node without
>> + * considering hot threshold.
>> + */
>> +static bool pgdat_free_space_enough(struct pglist_data *pgdat)
>> +{
>> + int z;
>> + unsigned long enough_wmark;
>> +
>> + enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
>> + pgdat->node_present_pages >> 4);
>> + for (z = pgdat->nr_zones - 1; z >= 0; z--) {
>> + struct zone *zone = pgdat->node_zones + z;
>> +
>> + if (!populated_zone(zone))
>> + continue;
>> +
>> + if (zone_watermark_ok(zone, 0,
>> + promo_wmark_pages(zone) + enough_wmark,
>> + ZONE_MOVABLE, 0))
>> + return true;
>> + }
>> + return false;
>> +}
>
>> +
>> +static void kpromoted_promotion_adjust_threshold(struct pglist_data *pgdat,
>
> Needs documentation of the algorithm and the reasons for various choices.
>
> I see it is a code move though so maybe that's a job for another day.
Sure.
Regards,
Bharata.
Powered by blists - more mailing lists