linux-kernel - Re: [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251201122912.348142-1-zhanghongru@xiaomi.com>
Date: Mon,  1 Dec 2025 20:29:12 +0800
From: Hongru Zhang <zhanghongru06@...il.com>
To: 21cnbao@...il.com,
	zhongjinji@...or.com
Cc: Liam.Howlett@...cle.com,
	akpm@...ux-foundation.org,
	axelrasmussen@...gle.com,
	david@...nel.org,
	hannes@...xchg.org,
	jackmanb@...gle.com,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	lorenzo.stoakes@...cle.com,
	mhocko@...e.com,
	rppt@...nel.org,
	surenb@...gle.com,
	vbabka@...e.cz,
	weixugc@...gle.com,
	yuanchu@...gle.com,
	zhanghongru06@...il.com,
	zhanghongru@...omi.com,
	ziy@...dia.com
Subject: Re: [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count

> Right. If we want the freecount to accurately reflect the current system
> state, we still need to take the zone lock.

Yeah, as I mentioned in patch (2/3), this implementation has accuracy
limitation:

    "Accuracy. Both implementations have accuracy limitations. The previous
    implementation required acquiring and releasing the zone lock for counting
    each order and migratetype, making it potentially inaccurate. Under high
    memory pressure, accuracy would further degrade due to zone lock
    contention or fragmentation. The new implementation collects data within a
    short time window, which helps maintain relatively small errors, and is
    unaffected by memory pressure. Furthermore, user-space memory management
    components inherently experience decision latency - by the time they
    process the collected data and execute actions, the memory state has
    already changed. This means that even perfectly accurate data at
    collection time becomes stale by decision time. Considering these factors,
    the accuracy trade-off introduced by the new implementation should be
    acceptable for practical use cases, offering a balance between performance
    and accuracy requirements."

Additional data:
1. average latency of pagetypeinfo_showfree_print() over 1,000,000 
   times is 4.67 us

2. average latency is 125 ns, if seq_printf() is taken out of the loop

Example code:

+unsigned long total_lat = 0;
+unsigned long total_count = 0;
+
 static void pagetypeinfo_showfree_print(struct seq_file *m,
                                        pg_data_t *pgdat, struct zone *zone)
 {
        int order, mtype;
+       ktime_t start;
+       u64 lat;
+       unsigned long freecounts[NR_PAGE_ORDERS][MIGRATE_TYPES]; /* ignore potential stack overflow */
+
+       start = ktime_get();
+       for (order = 0; order < NR_PAGE_ORDERS; ++order)
+               for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+                       freecounts[order][mtype] = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
+
+       lat = ktime_to_ns(ktime_sub(ktime_get(), start));
+       total_count++;
+       total_lat += lat;
 
        for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
                seq_printf(m, "Node %4d, zone %8s, type %12s ",
@@ -1594,7 +1609,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
                        bool overflow = false;
 
                        /* Keep the same output format for user-space tools compatibility */
-                       freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
+                       freecount = freecounts[order][mtype];
                        if (freecount >= 100000) {
                                overflow = true;
                                freecount = 100000;
@@ -1692,6 +1707,13 @@ static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat)
 #endif /* CONFIG_PAGE_OWNER */
 }

I think both are small time window (if IRQ is disabled, latency is more
deterministic).

> Multiple independent WRITE_ONCE and READ_ONCE operations do not guarantee
> correctness. They may ensure single-copy atomicity per access, but not for the
> overall result.

I know this does not guarantee correctness of the overall result.
READ_ONCE() and WRITE_ONCE() in this patch are used to avoid potential
store tearing and read tearing caused by compiler optimizations.

In fact, I have already noticed /proc/buddyinfo, which collects data under
zone lock and uses data_race to avoid KCSAN reports. But I'm wondering if
we could remove its zone lock as well, for the same reasons as
/proc/pagetypeinfo.