linux-kernel - Re: [syzbot] [mm?] INFO: rcu detected stall in sys

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <vxxrlgnkujdpjl7jv5alkr5kjatq4f4jb6fw45t5son6qwzkuj@hgwye3dhqthm>
Date: Wed, 27 Aug 2025 23:35:37 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Josh Poimboeuf <jpoimboe@...nel.org>, Harry Yoo <harry.yoo@...cle.com>,
        Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
        syzbot <syzbot+8785aaf121cfb2141e0d@...kaller.appspotmail.com>,
        akpm@...ux-foundation.org, jannh@...gle.com,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org, pfalcato@...e.de,
        syzkaller-bugs@...glegroups.com, vbabka@...e.cz,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        peterz@...radead.org
Subject: Re: [syzbot] [mm?] INFO: rcu detected stall in sys_munmap (2)

* Liam R. Howlett <Liam.Howlett@...cle.com> [250827 21:58]:
> * Josh Poimboeuf <jpoimboe@...nel.org> [250827 20:29]:
> > On Fri, Aug 22, 2025 at 10:55:10PM +0900, Harry Yoo wrote:
> > > On Fri, Aug 22, 2025 at 01:08:02PM +0100, Lorenzo Stoakes wrote:
> > > > +cc Sebastian for RCU ORC change...
> > > > 
> > > > +cc Harry for slab side.
> > > 
> > > +cc Josh and Peter for stack unwinding stuff.
> > > 
> > > > Pinging Jann for the CONFIG_SLUB_RCU_DEBUG element.
> > > > 
> > > > Jann - could this possibly be related to CONFIG_SLUB_RCU_DEBUG? As it seems to
> > > > the stack is within KASAN, but no KASAN report so maybe it's KASAN itself that's
> > > > having an issue?
> > > > 
> > > > Though I'm thinking maybe it's the orc unwinder itself that could be problematic
> > > > here (yet invoked by CONFIG_SLUB_RCU_DEBUG though)... and yeah kinda suspcious
> > > > because:
> > > > 
> > > > - We have two threads freeing VMAs using SLAB_TYPESAFE_BY_RCU
> > > > - CONFIG_SLUB_RCU_DEBUG means that we use KASAN to save an aux stack, which
> > > >   makes us do an unwind via ORC, which then takes an RCU read lock on
> > > >   unwind_next_frame(), and both are doing this unwinding at the time of report.
> > > > - ???
> > > > - Somehow things get locked up?
> > > > 
> > > > I'm not an RCU expert (clearly :) so I'm not sure exactly how this could result
> > > > in a stall, but it's suspicious.
> > > 
> > > Can this be because of misleading ORC data or logical error in ORC unwinder
> > > that makes it fall into an infinite loop (unwind_done() never returning
> > > true in arch_stack_walk())?
> > > 
> > > ...because the reported line number reported doesn't really make sense
> > > as a cause of stalls.
> > 
> > There shouldn't be any way for ORC to hit an infinite loop.  Worst case
> > it would stop after the caller's buffer fills up.  ORC has always been
> > solid, and the RCU usage looks fine to me.  I tend to doubt ORC is at
> > fault here.
> > 
> > Maybe some interaction higher up the stack is causing things to run in a
> > tight loop.
> > 
> > All those debugging options (e.g., DEBUG_VM_MAPLE_TREE, LOCKDEP, KASAN,
> > SLUB_RCU_DEBUG...) could be a factor in slowing things down to a crawl.
> 
> DEBUG_VM_MAPLE_TREE is super heavy, but that comes from validate_mm()
> which would be the last thing to happen before returning, usually.
> 
> I mean surely that would show up in the logs.
> 
> Okay it's in the second log on the dashboard..
> 
> Yeah, I think it's debug options eventually causing failure.  Apparently
> there's a reproducer for syz now but without the validate_mm().

I don't think it's the debugging options as removing the validate_mm()
did not help.

We may want to wait for a c reproducer.