[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fd9d20c3-1bc0-4bad-bc5e-7d9549ddf8fa@gmail.com>
Date: Thu, 29 Jan 2026 22:49:26 +0000
From: Usama Arif <usamaarif642@...il.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>,
Andrew Morton <akpm@...ux-foundation.org>
Cc: Johannes Weiner <hannes@...xchg.org>, Rik van Riel <riel@...riel.com>,
Song Liu <songliubraving@...com>, Kiryl Shutsemau <kas@...nel.org>,
David Hildenbrand <david@...nel.org>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Zi Yan <ziy@...dia.com>,
Baolin Wang <baolin.wang@...ux.alibaba.com>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>, Nico Pache
<npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
Dev Jain <dev.jain@....com>, Barry Song <baohua@...nel.org>,
Lance Yang <lance.yang@...ux.dev>, Meta kernel team <kernel-team@...a.com>,
linux-mm@...ck.org, cgroups@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm: khugepaged: fix NR_FILE_PAGES accounting in
collapse_file()
On 29/01/2026 18:40, Shakeel Butt wrote:
> In META's fleet, we are seeing high level cgroups with zero file memcg
> stat but their descendants have non-zero file stat. This should not be
> possible. On further inspection by looking at kernel data structures
> though drgn, it was revealed that the high level cgroups have negative
> file stat which was aggregated from their children.
>
> Another interesting point was that this specific issue start happening
> more often as we started deploying thp-always more widely which
> indicates some correlation between file memory and THPs and indeed it
> was found that file memcg stat accounting is buggy in the collapse code
> path from the start.
>
> When collapse_file() replaces small folios with a large THP, it fails to
> properly update the NR_FILE_PAGES memcg stat for both the old folios
> being freed and the new THP being added. It assumes the old and new
> folios belong to the same cgroup. However this assumption breaks in
> couple of scenarios:
>
> 1. Binary (executable) package downloader running in a different cgroup
> than the actual job executing the downloaded package.
>
> 2. File shared and mapped by processes running in different cgroups. One
> process read-in the file and the second process either through
> madvise(COLLAPSE) or khugepaged on behalf of second process
> collapsing the file.
>
> So, the current code has two bugs:
>
> 1. For non-shmem files, NR_FILE_PAGES is never incremented for the new
> THP because nr_none is always 0 for non-shmem, and the stat update is
> inside the "if (nr_none)" block.
>
> 2. When freeing old folios, NR_FILE_PAGES is never decremented because
> folio->mapping is set to NULL directly without calling
> filemap_unaccount_folio().
>
> These bugs cause incorrect per-memcg accounting when the process
> triggering the collapse (MADV_COLLAPSE or khugepaged) belongs to a
> different memcg than the process that originally faulted in the pages:
>
> - Process A (memcg X) reads file, creating 512 small page cache folios
> charged to memcg X (NR_FILE_PAGES += 512 for memcg X)
>
> - Process B (memcg Y) triggers collapse via MADV_COLLAPSE or khugepaged
> scans B's mm. The new THP is charged to memcg Y.
>
> - Old folios freed: NR_FILE_PAGES not decremented (bug)
> New THP added: NR_FILE_PAGES not incremented (bug)
>
> - Later, THP removed from page cache: NR_FILE_PAGES -= 512 for memcg Y
>
> Result: memcg X has +512 inflated pages, memcg Y has -512 (negative!)
>
> Fix this by:
> 1. Always incrementing NR_FILE_PAGES by HPAGE_PMD_NR for the new THP
> 2. Decrementing NR_FILE_PAGES for each old folio before clearing its
> mapping pointer
>
> For shmem with holes (nr_none > 0), the net change is still +nr_none
> since we decrement (HPAGE_PMD_NR - nr_none) old pages and increment
> HPAGE_PMD_NR new pages.
>
> Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
> Signed-off-by: Shakeel Butt <shakeel.butt@...ux.dev
Acked-by: Usama Arif <usamaarif642@...il.com>
Powered by blists - more mailing lists