linux-kernel - Re: [syzbot] [cgroups?] [mm?] WARNING in folio_lruvec

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a55beb72-4288-4356-9642-76ab35a2a07c@lucifer.local>
Date: Fri, 20 Jun 2025 12:50:43 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: syzbot <syzbot+a74a028d848147bc5931@...kaller.appspotmail.com>
Cc: akpm@...ux-foundation.org, cgroups@...r.kernel.org, hannes@...xchg.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org, mhocko@...nel.org,
        muchun.song@...ux.dev, roman.gushchin@...ux.dev,
        shakeel.butt@...ux.dev, syzkaller-bugs@...glegroups.com
Subject: Re: [syzbot] [cgroups?] [mm?] WARNING in folio_lruvec_lock

OK I think this might well be me, apologies. I definitely see a suspicious
looking bug. TL;DR - will fix, it's not upstream yet.

Thanks to Andrew for forwarding to me, that's some insight there!

So it looks like in [0] we are doing the KSM flag update _before_ the mmap()
hook, mistakenly, which is... not good.

This results in the correct checks not being applied to the VMA, because
e.g. VM_HUGETLB will not be set until after the .mmap() hook has been completed
(I'm working on converting the hooks to .mmap_prepare() but we're not there
yet...)

[0]:https://lore.kernel.org/all/3ba660af716d87a18ca5b4e635f2101edeb56340.1748537921.git.lorenzo.stoakes@oracle.com/

I will send a fix there.

Thanks, Lorenzo

On Thu, Jun 19, 2025 at 05:02:31AM -0700, syzbot wrote:
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:    bc6e0ba6c9ba Add linux-next specific files for 20250613
> git tree:       linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=1126090c580000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=2f7a2e4d17ed458f
> dashboard link: https://syzkaller.appspot.com/bug?extid=a74a028d848147bc5931
> compiler:       Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/2430bb0465cc/disk-bc6e0ba6.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/436a39deef0a/vmlinux-bc6e0ba6.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/e314ca5b1eb3/bzImage-bc6e0ba6.xz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+a74a028d848147bc5931@...kaller.appspotmail.com
>
>  handle_mm_fault+0x740/0x8e0 mm/memory.c:6397

I mean this is:

		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);

Interestingly, I see in mem_cgroup_charge_hugetlb():

	/*
	 * Even memcg does not account for hugetlb, we still want to update
	 * system-level stats via lruvec_stat_mod_folio. Return 0, and skip
	 * charging the memcg.
	 */
	if (mem_cgroup_disabled() || !memcg_accounts_hugetlb() ||
		!memcg || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
		goto out;

	if (charge_memcg(folio, memcg, gfp))
		ret = -ENOMEM;

So maybe somehow KSM is touching hugetlb (it shouldn't do...) which has an
uncharged folio...?

This aligns with us having set KSM flags at the wrong time on a hugetlb mapping.

>  faultin_page mm/gup.c:1186 [inline]
>  __get_user_pages+0x1aef/0x30b0 mm/gup.c:1488
>  populate_vma_page_range+0x29f/0x3a0 mm/gup.c:1922
>  __mm_populate+0x24c/0x380 mm/gup.c:2025
>  mm_populate include/linux/mm.h:3354 [inline]
>  vm_mmap_pgoff+0x3f0/0x4c0 mm/util.c:584
>  ksys_mmap_pgoff+0x587/0x760 mm/mmap.c:607
>  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
>  do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> page_owner free stack trace missing

I'm guessing this is the process stack of the repro (even though syzkaller can't repro :P)

> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 38 at ./include/linux/memcontrol.h:732 folio_lruvec include/linux/memcontrol.h:732 [inline]

This is:

static inline struct lruvec *folio_lruvec(struct folio *folio)
{
	struct mem_cgroup *memcg = folio_memcg(folio);

	VM_WARN_ON_ONCE_FOLIO(!memcg && !mem_cgroup_disabled(), folio); <---- here
	return mem_cgroup_lruvec(memcg, folio_pgdat(folio));
}

Meaning folio_memcg() is failing to find a memcg for the folio.

I'm not really that familiar with cgroup implementation but:

static inline struct mem_cgroup *folio_memcg(struct folio *folio)
{
	if (folio_memcg_kmem(folio))
		return obj_cgroup_memcg(__folio_objcg(folio));
	return __folio_memcg(folio);                           <--- seems this is what is returning NULL?
}

I guess it's __folio_memcg() that's returning NULL as apparently
obj_cgroup_memcg() should always return something non-NULL.

And this is:

static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
{
	unsigned long memcg_data = folio->memcg_data;
	...
	return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}

So if folio->memcg_data is NULL or NULL against the mask this will return NULL.


I see this is set to NULL (or rather 0) in mem_cgroup_migrate(), also in
__memcg_kmem_uncharge_page() (but is this kmem? No?), also uncharge_folio().


We also set the memcg after charge_memcg() -> commit_charge() so perhaps a
charge was expected that didn't happen somehow?

This again aligns with a mis-flagged hugetlb folio.

> WARNING: CPU: 0 PID: 38 at ./include/linux/memcontrol.h:732 folio_lruvec_lock+0x150/0x1a0 mm/memcontrol.c:1211

Ths is:

struct lruvec *folio_lruvec_lock(struct folio *folio)
{
	struct lruvec *lruvec = folio_lruvec(folio); <---- here

	spin_lock(&lruvec->lru_lock);
	lruvec_memcg_debug(lruvec, folio);

	return lruvec;
}

> Modules linked in:
> CPU: 0 UID: 0 PID: 38 Comm: ksmd Not tainted 6.16.0-rc1-next-20250613-syzkaller #0 PREEMPT(full)
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
> RIP: 0010:folio_lruvec include/linux/memcontrol.h:732 [inline]
> RIP: 0010:folio_lruvec_lock+0x150/0x1a0 mm/memcontrol.c:1211
> Code: 7c 25 00 00 74 08 4c 89 ff e8 7c 66 f8 ff 4d 89 2f eb c4 48 89 df 48 c7 c6 60 4f 98 8b e8 58 9b dc ff c6 05 01 85 5f 0d 01 90 <0f> 0b 90 e9 d5 fe ff ff 44 89 f9 80 e1 07 80 c1 03 38 c1 0f 8c 4d
> RSP: 0018:ffffc90000ae7660 EFLAGS: 00010046
> RAX: b21d845e3554e000 RBX: ffffea0002108000 RCX: b21d845e3554e000
> RDX: 0000000000000002 RSI: ffffffff8db792e4 RDI: ffff88801de83c00
> RBP: ffffea0002108000 R08: 0000000000000003 R09: 0000000000000004
> R10: dffffc0000000000 R11: fffffbfff1bfaa14 R12: ffffea0002108000
> R13: ffffea0002108008 R14: 0000000000000000 R15: 0000000000000000
> FS:  0000000000000000(0000) GS:ffff888125c41000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f475c15ef98 CR3: 000000005f95a000 CR4: 00000000003526f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  <TASK>
>  __split_unmapped_folio+0x42e/0x2cb0 mm/huge_memory.c:3487

This is:

static int __split_unmapped_folio(struct folio *folio, int new_order,
		struct page *split_at, struct page *lock_at,
		struct list_head *list, pgoff_t end,
		struct xa_state *xas, struct address_space *mapping,
		bool uniform_split)
{
	...
	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
	lruvec = folio_lruvec_lock(folio);  <--- here
	...
}

So we're splitting an unmapped folio that is locked, non-LRU and frozen
(refcount == 0)

Interstingly, __split_folio_to_order() sets (new_)folio->memcg_data, but this is
called _after_ this folio_lurvec_lock().

>  __folio_split+0xf78/0x1300 mm/huge_memory.c:3891

This is:

		ret = __split_unmapped_folio(folio, new_order,
				split_at, lock_at, list, end, &xas, mapping,
				uniform_split);

>  cmp_and_merge_page mm/ksm.c:2358 [inline]

So we have tried to merge two pages:

		kfolio = try_to_merge_two_pages(rmap_item, page,
						tree_rmap_item, tree_page);

But failed:

		/*
		 * If both pages we tried to merge belong to the same compound
		 * page, then we actually ended up increasing the reference
		 * count of the same compound page twice, and split_huge_page
		 * failed.
		 * Here we set a flag if that happened, and we use it later to
		 * try split_huge_page again. Since we call put_page right
		 * afterwards, the reference count will be correct and
		 * split_huge_page should succeed.
		 */
		split = PageTransCompound(page)
			&& compound_head(page) == compound_head(tree_page);

		if (kfolio) {
			...
		} else if (split) {
			/*
			 * We are here if we tried to merge two pages and
			 * failed because they both belonged to the same
			 * compound page. We will split the page now, but no
			 * merging will take place.
			 * We do not want to add the cost of a full lock; if
			 * the page is locked, it is better to skip it and
			 * perhaps try again later.
			 */
			if (!trylock_page(page))
				return;
			split_huge_page(page); <---- this is where the failure occurs.
			unlock_page(page);
		}

>  ksm_do_scan+0x499b/0x6530 mm/ksm.c:2665
>  ksm_scan_thread+0x10b/0x4b0 mm/ksm.c:2687
>  kthread+0x711/0x8a0 kernel/kthread.c:464
>  ret_from_fork+0x3f9/0x770 arch/x86/kernel/process.c:148
>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>  </TASK>
>
>
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@...glegroups.com.
>
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
>
> If the report is already addressed, let syzbot know by replying with:
> #syz fix: exact-commit-title
>
> If you want to overwrite report's subsystems, reply with:
> #syz set subsystems: new-subsystem
> (See the list of subsystem names on the web dashboard)
>
> If the report is a duplicate of another one, reply with:
> #syz dup: exact-subject-of-another-report
>
> If you want to undo deduplication, reply with:
> #syz undup
>
>