[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAGsJ_4wvwH8oMYhbbhnbJAdX-fqBG9z-hXYqfshEXU0ChRpFyg@mail.gmail.com>
Date: Tue, 2 Dec 2025 02:54:28 +0800
From: Barry Song <21cnbao@...il.com>
To: Hongru Zhang <zhanghongru06@...il.com>
Cc: zhongjinji@...or.com, Liam.Howlett@...cle.com, akpm@...ux-foundation.org,
axelrasmussen@...gle.com, david@...nel.org, hannes@...xchg.org,
jackmanb@...gle.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
lorenzo.stoakes@...cle.com, mhocko@...e.com, rppt@...nel.org,
surenb@...gle.com, vbabka@...e.cz, weixugc@...gle.com, yuanchu@...gle.com,
zhanghongru@...omi.com, ziy@...dia.com
Subject: Re: [PATCH 2/3] mm/vmstat: get fragmentation statistics from
per-migragetype count
On Mon, Dec 1, 2025 at 8:29 PM Hongru Zhang <zhanghongru06@...il.com> wrote:
>
> > Right. If we want the freecount to accurately reflect the current system
> > state, we still need to take the zone lock.
>
> Yeah, as I mentioned in patch (2/3), this implementation has accuracy
> limitation:
>
> "Accuracy. Both implementations have accuracy limitations. The previous
> implementation required acquiring and releasing the zone lock for counting
> each order and migratetype, making it potentially inaccurate. Under high
> memory pressure, accuracy would further degrade due to zone lock
> contention or fragmentation. The new implementation collects data within a
> short time window, which helps maintain relatively small errors, and is
> unaffected by memory pressure. Furthermore, user-space memory management
> components inherently experience decision latency - by the time they
> process the collected data and execute actions, the memory state has
> already changed. This means that even perfectly accurate data at
> collection time becomes stale by decision time. Considering these factors,
> the accuracy trade-off introduced by the new implementation should be
> acceptable for practical use cases, offering a balance between performance
> and accuracy requirements."
>
> Additional data:
> 1. average latency of pagetypeinfo_showfree_print() over 1,000,000
> times is 4.67 us
>
> 2. average latency is 125 ns, if seq_printf() is taken out of the loop
>
> Example code:
>
> +unsigned long total_lat = 0;
> +unsigned long total_count = 0;
> +
> static void pagetypeinfo_showfree_print(struct seq_file *m,
> pg_data_t *pgdat, struct zone *zone)
> {
> int order, mtype;
> + ktime_t start;
> + u64 lat;
> + unsigned long freecounts[NR_PAGE_ORDERS][MIGRATE_TYPES]; /* ignore potential stack overflow */
> +
> + start = ktime_get();
> + for (order = 0; order < NR_PAGE_ORDERS; ++order)
> + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
> + freecounts[order][mtype] = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
> +
> + lat = ktime_to_ns(ktime_sub(ktime_get(), start));
> + total_count++;
> + total_lat += lat;
>
> for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> seq_printf(m, "Node %4d, zone %8s, type %12s ",
> @@ -1594,7 +1609,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> bool overflow = false;
>
> /* Keep the same output format for user-space tools compatibility */
> - freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
> + freecount = freecounts[order][mtype];
> if (freecount >= 100000) {
> overflow = true;
> freecount = 100000;
> @@ -1692,6 +1707,13 @@ static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat)
> #endif /* CONFIG_PAGE_OWNER */
> }
>
> I think both are small time window (if IRQ is disabled, latency is more
> deterministic).
>
> > Multiple independent WRITE_ONCE and READ_ONCE operations do not guarantee
> > correctness. They may ensure single-copy atomicity per access, but not for the
> > overall result.
>
> I know this does not guarantee correctness of the overall result.
> READ_ONCE() and WRITE_ONCE() in this patch are used to avoid potential
> store tearing and read tearing caused by compiler optimizations.
Yes, I realized that correctness might not be a major concern, so I sent a
follow-up email [1] after replying to you.
>
> In fact, I have already noticed /proc/buddyinfo, which collects data under
> zone lock and uses data_race to avoid KCSAN reports. But I'm wondering if
> we could remove its zone lock as well, for the same reasons as
> /proc/pagetypeinfo.
That might be correct. However, if it doesn’t significantly affect performance
and buddyinfo is accessed much less frequently than the buddy list, we may
just leave it as is.
[1] https://lore.kernel.org/linux-mm/CAGsJ_4wUQdQyB_3y0Buf3uG34hvgpMAP3qHHwJM3=R01RJOuvw@mail.gmail.com/
Thanks
Barry
Powered by blists - more mailing lists