linux-kernel - Re: [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGsJ_4wvwH8oMYhbbhnbJAdX-fqBG9z-hXYqfshEXU0ChRpFyg@mail.gmail.com>
Date: Tue, 2 Dec 2025 02:54:28 +0800
From: Barry Song <21cnbao@...il.com>
To: Hongru Zhang <zhanghongru06@...il.com>
Cc: zhongjinji@...or.com, Liam.Howlett@...cle.com, akpm@...ux-foundation.org, 
	axelrasmussen@...gle.com, david@...nel.org, hannes@...xchg.org, 
	jackmanb@...gle.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	lorenzo.stoakes@...cle.com, mhocko@...e.com, rppt@...nel.org, 
	surenb@...gle.com, vbabka@...e.cz, weixugc@...gle.com, yuanchu@...gle.com, 
	zhanghongru@...omi.com, ziy@...dia.com
Subject: Re: [PATCH 2/3] mm/vmstat: get fragmentation statistics from
 per-migragetype count

On Mon, Dec 1, 2025 at 8:29 PM Hongru Zhang <zhanghongru06@...il.com> wrote:
>
> > Right. If we want the freecount to accurately reflect the current system
> > state, we still need to take the zone lock.
>
> Yeah, as I mentioned in patch (2/3), this implementation has accuracy
> limitation:
>
>     "Accuracy. Both implementations have accuracy limitations. The previous
>     implementation required acquiring and releasing the zone lock for counting
>     each order and migratetype, making it potentially inaccurate. Under high
>     memory pressure, accuracy would further degrade due to zone lock
>     contention or fragmentation. The new implementation collects data within a
>     short time window, which helps maintain relatively small errors, and is
>     unaffected by memory pressure. Furthermore, user-space memory management
>     components inherently experience decision latency - by the time they
>     process the collected data and execute actions, the memory state has
>     already changed. This means that even perfectly accurate data at
>     collection time becomes stale by decision time. Considering these factors,
>     the accuracy trade-off introduced by the new implementation should be
>     acceptable for practical use cases, offering a balance between performance
>     and accuracy requirements."
>
> Additional data:
> 1. average latency of pagetypeinfo_showfree_print() over 1,000,000
>    times is 4.67 us
>
> 2. average latency is 125 ns, if seq_printf() is taken out of the loop
>
> Example code:
>
> +unsigned long total_lat = 0;
> +unsigned long total_count = 0;
> +
>  static void pagetypeinfo_showfree_print(struct seq_file *m,
>                                         pg_data_t *pgdat, struct zone *zone)
>  {
>         int order, mtype;
> +       ktime_t start;
> +       u64 lat;
> +       unsigned long freecounts[NR_PAGE_ORDERS][MIGRATE_TYPES]; /* ignore potential stack overflow */
> +
> +       start = ktime_get();
> +       for (order = 0; order < NR_PAGE_ORDERS; ++order)
> +               for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
> +                       freecounts[order][mtype] = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
> +
> +       lat = ktime_to_ns(ktime_sub(ktime_get(), start));
> +       total_count++;
> +       total_lat += lat;
>
>         for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>                 seq_printf(m, "Node %4d, zone %8s, type %12s ",
> @@ -1594,7 +1609,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>                         bool overflow = false;
>
>                         /* Keep the same output format for user-space tools compatibility */
> -                       freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
> +                       freecount = freecounts[order][mtype];
>                         if (freecount >= 100000) {
>                                 overflow = true;
>                                 freecount = 100000;
> @@ -1692,6 +1707,13 @@ static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat)
>  #endif /* CONFIG_PAGE_OWNER */
>  }
>
> I think both are small time window (if IRQ is disabled, latency is more
> deterministic).
>
> > Multiple independent WRITE_ONCE and READ_ONCE operations do not guarantee
> > correctness. They may ensure single-copy atomicity per access, but not for the
> > overall result.
>
> I know this does not guarantee correctness of the overall result.
> READ_ONCE() and WRITE_ONCE() in this patch are used to avoid potential
> store tearing and read tearing caused by compiler optimizations.

Yes, I realized that correctness might not be a major concern, so I sent a
follow-up email [1] after replying to you.

>
> In fact, I have already noticed /proc/buddyinfo, which collects data under
> zone lock and uses data_race to avoid KCSAN reports. But I'm wondering if
> we could remove its zone lock as well, for the same reasons as
> /proc/pagetypeinfo.

That might be correct. However, if it doesn’t significantly affect performance
and buddyinfo is accessed much less frequently than the buddy list, we may
just leave it as is.

[1] https://lore.kernel.org/linux-mm/CAGsJ_4wUQdQyB_3y0Buf3uG34hvgpMAP3qHHwJM3=R01RJOuvw@mail.gmail.com/

Thanks
Barry