[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251201122912.348142-1-zhanghongru@xiaomi.com>
Date: Mon, 1 Dec 2025 20:29:12 +0800
From: Hongru Zhang <zhanghongru06@...il.com>
To: 21cnbao@...il.com,
zhongjinji@...or.com
Cc: Liam.Howlett@...cle.com,
akpm@...ux-foundation.org,
axelrasmussen@...gle.com,
david@...nel.org,
hannes@...xchg.org,
jackmanb@...gle.com,
linux-kernel@...r.kernel.org,
linux-mm@...ck.org,
lorenzo.stoakes@...cle.com,
mhocko@...e.com,
rppt@...nel.org,
surenb@...gle.com,
vbabka@...e.cz,
weixugc@...gle.com,
yuanchu@...gle.com,
zhanghongru06@...il.com,
zhanghongru@...omi.com,
ziy@...dia.com
Subject: Re: [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count
> Right. If we want the freecount to accurately reflect the current system
> state, we still need to take the zone lock.
Yeah, as I mentioned in patch (2/3), this implementation has accuracy
limitation:
"Accuracy. Both implementations have accuracy limitations. The previous
implementation required acquiring and releasing the zone lock for counting
each order and migratetype, making it potentially inaccurate. Under high
memory pressure, accuracy would further degrade due to zone lock
contention or fragmentation. The new implementation collects data within a
short time window, which helps maintain relatively small errors, and is
unaffected by memory pressure. Furthermore, user-space memory management
components inherently experience decision latency - by the time they
process the collected data and execute actions, the memory state has
already changed. This means that even perfectly accurate data at
collection time becomes stale by decision time. Considering these factors,
the accuracy trade-off introduced by the new implementation should be
acceptable for practical use cases, offering a balance between performance
and accuracy requirements."
Additional data:
1. average latency of pagetypeinfo_showfree_print() over 1,000,000
times is 4.67 us
2. average latency is 125 ns, if seq_printf() is taken out of the loop
Example code:
+unsigned long total_lat = 0;
+unsigned long total_count = 0;
+
static void pagetypeinfo_showfree_print(struct seq_file *m,
pg_data_t *pgdat, struct zone *zone)
{
int order, mtype;
+ ktime_t start;
+ u64 lat;
+ unsigned long freecounts[NR_PAGE_ORDERS][MIGRATE_TYPES]; /* ignore potential stack overflow */
+
+ start = ktime_get();
+ for (order = 0; order < NR_PAGE_ORDERS; ++order)
+ for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+ freecounts[order][mtype] = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
+
+ lat = ktime_to_ns(ktime_sub(ktime_get(), start));
+ total_count++;
+ total_lat += lat;
for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
seq_printf(m, "Node %4d, zone %8s, type %12s ",
@@ -1594,7 +1609,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
bool overflow = false;
/* Keep the same output format for user-space tools compatibility */
- freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
+ freecount = freecounts[order][mtype];
if (freecount >= 100000) {
overflow = true;
freecount = 100000;
@@ -1692,6 +1707,13 @@ static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat)
#endif /* CONFIG_PAGE_OWNER */
}
I think both are small time window (if IRQ is disabled, latency is more
deterministic).
> Multiple independent WRITE_ONCE and READ_ONCE operations do not guarantee
> correctness. They may ensure single-copy atomicity per access, but not for the
> overall result.
I know this does not guarantee correctness of the overall result.
READ_ONCE() and WRITE_ONCE() in this patch are used to avoid potential
store tearing and read tearing caused by compiler optimizations.
In fact, I have already noticed /proc/buddyinfo, which collects data under
zone lock and uses data_race to avoid KCSAN reports. But I'm wondering if
we could remove its zone lock as well, for the same reasons as
/proc/pagetypeinfo.
Powered by blists - more mailing lists