lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251201122912.348142-1-zhanghongru@xiaomi.com>
Date: Mon,  1 Dec 2025 20:29:12 +0800
From: Hongru Zhang <zhanghongru06@...il.com>
To: 21cnbao@...il.com,
	zhongjinji@...or.com
Cc: Liam.Howlett@...cle.com,
	akpm@...ux-foundation.org,
	axelrasmussen@...gle.com,
	david@...nel.org,
	hannes@...xchg.org,
	jackmanb@...gle.com,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	lorenzo.stoakes@...cle.com,
	mhocko@...e.com,
	rppt@...nel.org,
	surenb@...gle.com,
	vbabka@...e.cz,
	weixugc@...gle.com,
	yuanchu@...gle.com,
	zhanghongru06@...il.com,
	zhanghongru@...omi.com,
	ziy@...dia.com
Subject: Re: [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count

> Right. If we want the freecount to accurately reflect the current system
> state, we still need to take the zone lock.

Yeah, as I mentioned in patch (2/3), this implementation has accuracy
limitation:

    "Accuracy. Both implementations have accuracy limitations. The previous
    implementation required acquiring and releasing the zone lock for counting
    each order and migratetype, making it potentially inaccurate. Under high
    memory pressure, accuracy would further degrade due to zone lock
    contention or fragmentation. The new implementation collects data within a
    short time window, which helps maintain relatively small errors, and is
    unaffected by memory pressure. Furthermore, user-space memory management
    components inherently experience decision latency - by the time they
    process the collected data and execute actions, the memory state has
    already changed. This means that even perfectly accurate data at
    collection time becomes stale by decision time. Considering these factors,
    the accuracy trade-off introduced by the new implementation should be
    acceptable for practical use cases, offering a balance between performance
    and accuracy requirements."

Additional data:
1. average latency of pagetypeinfo_showfree_print() over 1,000,000 
   times is 4.67 us

2. average latency is 125 ns, if seq_printf() is taken out of the loop

Example code:

+unsigned long total_lat = 0;
+unsigned long total_count = 0;
+
 static void pagetypeinfo_showfree_print(struct seq_file *m,
                                        pg_data_t *pgdat, struct zone *zone)
 {
        int order, mtype;
+       ktime_t start;
+       u64 lat;
+       unsigned long freecounts[NR_PAGE_ORDERS][MIGRATE_TYPES]; /* ignore potential stack overflow */
+
+       start = ktime_get();
+       for (order = 0; order < NR_PAGE_ORDERS; ++order)
+               for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+                       freecounts[order][mtype] = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
+
+       lat = ktime_to_ns(ktime_sub(ktime_get(), start));
+       total_count++;
+       total_lat += lat;
 
        for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
                seq_printf(m, "Node %4d, zone %8s, type %12s ",
@@ -1594,7 +1609,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
                        bool overflow = false;
 
                        /* Keep the same output format for user-space tools compatibility */
-                       freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
+                       freecount = freecounts[order][mtype];
                        if (freecount >= 100000) {
                                overflow = true;
                                freecount = 100000;
@@ -1692,6 +1707,13 @@ static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat)
 #endif /* CONFIG_PAGE_OWNER */
 }

I think both are small time window (if IRQ is disabled, latency is more
deterministic).

> Multiple independent WRITE_ONCE and READ_ONCE operations do not guarantee
> correctness. They may ensure single-copy atomicity per access, but not for the
> overall result.

I know this does not guarantee correctness of the overall result.
READ_ONCE() and WRITE_ONCE() in this patch are used to avoid potential
store tearing and read tearing caused by compiler optimizations.

In fact, I have already noticed /proc/buddyinfo, which collects data under
zone lock and uses data_race to avoid KCSAN reports. But I'm wondering if
we could remove its zone lock as well, for the same reasons as
/proc/pagetypeinfo.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ