linux-kernel - Re: [syzbot] [mm?] INFO: rcu detected stall in sys

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ourgzm4wai237cpcef3ypdn67hspjgw4u7fee4hyouj2hn3gwx@c322noqn4kzq>
Date: Wed, 27 Aug 2025 17:29:02 -0700
From: Josh Poimboeuf <jpoimboe@...nel.org>
To: Harry Yoo <harry.yoo@...cle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
	syzbot <syzbot+8785aaf121cfb2141e0d@...kaller.appspotmail.com>, Liam.Howlett@...cle.com, akpm@...ux-foundation.org, 
	jannh@...gle.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	pfalcato@...e.de, syzkaller-bugs@...glegroups.com, vbabka@...e.cz, 
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>, peterz@...radead.org
Subject: Re: [syzbot] [mm?] INFO: rcu detected stall in sys_munmap (2)

On Fri, Aug 22, 2025 at 10:55:10PM +0900, Harry Yoo wrote:
> On Fri, Aug 22, 2025 at 01:08:02PM +0100, Lorenzo Stoakes wrote:
> > +cc Sebastian for RCU ORC change...
> > 
> > +cc Harry for slab side.
> 
> +cc Josh and Peter for stack unwinding stuff.
> 
> > Pinging Jann for the CONFIG_SLUB_RCU_DEBUG element.
> > 
> > Jann - could this possibly be related to CONFIG_SLUB_RCU_DEBUG? As it seems to
> > the stack is within KASAN, but no KASAN report so maybe it's KASAN itself that's
> > having an issue?
> > 
> > Though I'm thinking maybe it's the orc unwinder itself that could be problematic
> > here (yet invoked by CONFIG_SLUB_RCU_DEBUG though)... and yeah kinda suspcious
> > because:
> > 
> > - We have two threads freeing VMAs using SLAB_TYPESAFE_BY_RCU
> > - CONFIG_SLUB_RCU_DEBUG means that we use KASAN to save an aux stack, which
> >   makes us do an unwind via ORC, which then takes an RCU read lock on
> >   unwind_next_frame(), and both are doing this unwinding at the time of report.
> > - ???
> > - Somehow things get locked up?
> > 
> > I'm not an RCU expert (clearly :) so I'm not sure exactly how this could result
> > in a stall, but it's suspicious.
> 
> Can this be because of misleading ORC data or logical error in ORC unwinder
> that makes it fall into an infinite loop (unwind_done() never returning
> true in arch_stack_walk())?
> 
> ...because the reported line number reported doesn't really make sense
> as a cause of stalls.

There shouldn't be any way for ORC to hit an infinite loop.  Worst case
it would stop after the caller's buffer fills up.  ORC has always been
solid, and the RCU usage looks fine to me.  I tend to doubt ORC is at
fault here.

Maybe some interaction higher up the stack is causing things to run in a
tight loop.

All those debugging options (e.g., DEBUG_VM_MAPLE_TREE, LOCKDEP, KASAN,
SLUB_RCU_DEBUG...) could be a factor in slowing things down to a crawl.

-- 
Josh