linux-kernel - Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251110063757.86725-1-leon.huangfu@shopee.com>
Date: Mon, 10 Nov 2025 14:37:57 +0800
From: Leon Huang Fu <leon.huangfu@...pee.com>
To: shakeel.butt@...ux.dev
Cc: akpm@...ux-foundation.org,
	cgroups@...r.kernel.org,
	corbet@....net,
	hannes@...xchg.org,
	inwardvessel@...il.com,
	jack@...e.cz,
	joel.granados@...nel.org,
	kyle.meyer@....com,
	lance.yang@...ux.dev,
	laoar.shao@...il.com,
	leon.huangfu@...pee.com,
	linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	mclapinski@...gle.com,
	mhocko@...nel.org,
	muchun.song@...ux.dev,
	roman.gushchin@...ux.dev,
	yosry.ahmed@...ux.dev
Subject: Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file

On Fri, Nov 7, 2025 at 7:56 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
>
> On Thu, Nov 06, 2025 at 11:30:45AM +0800, Leon Huang Fu wrote:
> > On Thu, Nov 6, 2025 at 9:19 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> > >
> > > +Yosry, JP
> > >
> > > On Wed, Nov 05, 2025 at 03:49:16PM +0800, Leon Huang Fu wrote:
> > > > On high-core count systems, memory cgroup statistics can become stale
> > > > due to per-CPU caching and deferred aggregation. Monitoring tools and
> > > > management applications sometimes need guaranteed up-to-date statistics
> > > > at specific points in time to make accurate decisions.
> > >
> > > Can you explain a bit more on your environment where you are seeing
> > > stale stats? More specifically, how often the management applications
> > > are reading the memcg stats and if these applications are reading memcg
> > > stats for each nodes of the cgroup tree.
> > >
> > > We force flush all the memcg stats at root level every 2 seconds but it
> > > seems like that is not enough for your case. I am fine with an explicit
> > > way for users to flush the memcg stats. In that way only users who want
> > > to has to pay for the flush cost.
> > >
> >
> > Thanks for the feedback. I encountered this issue while running the LTP
> > memcontrol02 test case [1] on a 256-core server with the 6.6.y kernel on XFS,
> > where it consistently failed.
> >
> > I was aware that Yosry had improved the memory statistics refresh mechanism
> > in "mm: memcg: subtree stats flushing and thresholds" [2], so I attempted to
> > backport that patchset to 6.6.y [3]. However, even on the 6.15.0-061500-generic
> > kernel with those improvements, the test still fails intermittently on XFS.
> >
> > I've created a simplified reproducer that mirrors the LTP test behavior. The
> > test allocates 50 MiB of page cache and then verifies that memory.current and
> > memory.stat's "file" field are approximately equal (within 5% tolerance).
> >
> > The failure pattern looks like:
> >
> >   After alloc: memory.current=52690944, memory.stat.file=48496640, size=52428800
> >   Checks: current>=size=OK, file>0=OK, current~=file(5%)=FAIL
> >
> > Here's the reproducer code and test script (attached below for reference).
> >
> > To reproduce on XFS:
> >   sudo ./run.sh --xfs
> >   for i in {1..100}; do sudo ./run.sh --run; echo "==="; sleep 0.1; done
> >   sudo ./run.sh --cleanup
> >
> > The test fails sporadically, typically a few times out of 100 runs, confirming
> > that the improved flush isn't sufficient for this workload pattern.
>
> I was hoping that you have a real world workload/scenario which is
> facing this issue. For the test a simple 'sleep 2' would be enough.
> Anyways that is not an argument against adding an inteface for flushing.
>

Fair point. I haven't encountered a production issue yet - this came up during
our kernel testing phase on high-core count servers (224-256 cores) before
deploying to production.

The LTP test failure was the indicator that prompted investigation. While
adding 'sleep 2' would fix the test, it highlights a broader concern: on these
high-core systems, the batching threshold (MEMCG_CHARGE_BATCH * num_online_cpus)
can accumulate 14K-16K events before auto-flush, potentially causing significant
staleness for workloads that need timely statistics.

We're planning to deploy container workloads on these servers where memory
statistics drive placement and resource management decisions. Having an explicit
flush interface would give us confidence that when precision matters (e.g.,
admission control, OOM decisions), we can get accurate stats on demand rather
than relying on timing or hoping the 2-second periodic flush happens when needed.

I understand this is more of a "preparing for future needs" rather than "fixing
current production breakage" situation. However, given the interface provides
opt-in control with no cost to users who don't need it, I believe it's a
reasonable addition. I'll prepare a v3 with the dedicated memory.stat_refresh
file as suggested.

Thanks,
Leon