linux-kernel - Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <fwthn4zl6uppdjdckjkmglxwnby42x2rd57i3m22pbqamjzaxy@aso4l7xyvhek>
Date: Mon, 10 Nov 2025 20:19:04 +0000
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: Leon Huang Fu <leon.huangfu@...pee.com>
Cc: shakeel.butt@...ux.dev, akpm@...ux-foundation.org, 
	cgroups@...r.kernel.org, corbet@....net, hannes@...xchg.org, inwardvessel@...il.com, 
	jack@...e.cz, joel.granados@...nel.org, kyle.meyer@....com, 
	lance.yang@...ux.dev, laoar.shao@...il.com, linux-doc@...r.kernel.org, 
	linux-kernel@...r.kernel.org, linux-mm@...ck.org, mclapinski@...gle.com, mhocko@...nel.org, 
	muchun.song@...ux.dev, roman.gushchin@...ux.dev
Subject: Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file

On Mon, Nov 10, 2025 at 02:37:57PM +0800, Leon Huang Fu wrote:
> On Fri, Nov 7, 2025 at 7:56 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> >
> > On Thu, Nov 06, 2025 at 11:30:45AM +0800, Leon Huang Fu wrote:
> > > On Thu, Nov 6, 2025 at 9:19 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> > > >
> > > > +Yosry, JP
> > > >
> > > > On Wed, Nov 05, 2025 at 03:49:16PM +0800, Leon Huang Fu wrote:
> > > > > On high-core count systems, memory cgroup statistics can become stale
> > > > > due to per-CPU caching and deferred aggregation. Monitoring tools and
> > > > > management applications sometimes need guaranteed up-to-date statistics
> > > > > at specific points in time to make accurate decisions.
> > > >
> > > > Can you explain a bit more on your environment where you are seeing
> > > > stale stats? More specifically, how often the management applications
> > > > are reading the memcg stats and if these applications are reading memcg
> > > > stats for each nodes of the cgroup tree.
> > > >
> > > > We force flush all the memcg stats at root level every 2 seconds but it
> > > > seems like that is not enough for your case. I am fine with an explicit
> > > > way for users to flush the memcg stats. In that way only users who want
> > > > to has to pay for the flush cost.
> > > >
> > >
> > > Thanks for the feedback. I encountered this issue while running the LTP
> > > memcontrol02 test case [1] on a 256-core server with the 6.6.y kernel on XFS,
> > > where it consistently failed.
> > >
> > > I was aware that Yosry had improved the memory statistics refresh mechanism
> > > in "mm: memcg: subtree stats flushing and thresholds" [2], so I attempted to
> > > backport that patchset to 6.6.y [3]. However, even on the 6.15.0-061500-generic
> > > kernel with those improvements, the test still fails intermittently on XFS.
> > >
> > > I've created a simplified reproducer that mirrors the LTP test behavior. The
> > > test allocates 50 MiB of page cache and then verifies that memory.current and
> > > memory.stat's "file" field are approximately equal (within 5% tolerance).
> > >
> > > The failure pattern looks like:
> > >
> > >   After alloc: memory.current=52690944, memory.stat.file=48496640, size=52428800
> > >   Checks: current>=size=OK, file>0=OK, current~=file(5%)=FAIL
> > >
> > > Here's the reproducer code and test script (attached below for reference).
> > >
> > > To reproduce on XFS:
> > >   sudo ./run.sh --xfs
> > >   for i in {1..100}; do sudo ./run.sh --run; echo "==="; sleep 0.1; done
> > >   sudo ./run.sh --cleanup
> > >
> > > The test fails sporadically, typically a few times out of 100 runs, confirming
> > > that the improved flush isn't sufficient for this workload pattern.
> >
> > I was hoping that you have a real world workload/scenario which is
> > facing this issue. For the test a simple 'sleep 2' would be enough.
> > Anyways that is not an argument against adding an inteface for flushing.
> >
> 
> Fair point. I haven't encountered a production issue yet - this came up during
> our kernel testing phase on high-core count servers (224-256 cores) before
> deploying to production.
> 
> The LTP test failure was the indicator that prompted investigation. While
> adding 'sleep 2' would fix the test, it highlights a broader concern: on these
> high-core systems, the batching threshold (MEMCG_CHARGE_BATCH * num_online_cpus)
> can accumulate 14K-16K events before auto-flush, potentially causing significant
> staleness for workloads that need timely statistics.

The thresholding is implemented as a tradeoff between expensive flushing
and accurate stats, and it aims to at least provide deterministic
behavior in terms of how much the stats can deviate.

That being said, it's understandable that some use cases require even
higher accuracy and are willing to pay the price. Although I share
Shakeel's frustration that the driving motivation is tests where you can
sleep for 2 seconds or alter the tests to allow some bound deviation.

The two alternatives I can think of are the synchronous flushing
interface, and some sort of tunable that determines the needed accuracy.
The latter sounds like it would be difficult to design properly and may
end up with some of the swappiness problems, so I think the synchronous
flushing interface is probably the way to go. This was also brought up
before when the thresholding was implemented.

If we ever change the stats implementation completely and lose the
concept of flushes/refreshes, the interface can just be a noop, and we
can document that writes are useless (or even print something in dmesg).

So no objections from me.

> 
> We're planning to deploy container workloads on these servers where memory
> statistics drive placement and resource management decisions. Having an explicit
> flush interface would give us confidence that when precision matters (e.g.,
> admission control, OOM decisions), we can get accurate stats on demand rather
> than relying on timing or hoping the 2-second periodic flush happens when needed.
> 
> I understand this is more of a "preparing for future needs" rather than "fixing
> current production breakage" situation. However, given the interface provides
> opt-in control with no cost to users who don't need it, I believe it's a
> reasonable addition. I'll prepare a v3 with the dedicated memory.stat_refresh
> file as suggested.
> 
> Thanks,
> Leon