linux-kernel - Re: [syzbot] [mm?] INFO: rcu detected stall in sys

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <tvtzgibiy5fvmf7rms4jeyim3lx4nas7qmgv36oryizdvwaujh@bsxqbd3nii55>
Date: Wed, 27 Aug 2025 21:57:59 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Josh Poimboeuf <jpoimboe@...nel.org>
Cc: Harry Yoo <harry.yoo@...cle.com>,
        Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
        syzbot <syzbot+8785aaf121cfb2141e0d@...kaller.appspotmail.com>,
        akpm@...ux-foundation.org, jannh@...gle.com,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org, pfalcato@...e.de,
        syzkaller-bugs@...glegroups.com, vbabka@...e.cz,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        peterz@...radead.org
Subject: Re: [syzbot] [mm?] INFO: rcu detected stall in sys_munmap (2)

* Josh Poimboeuf <jpoimboe@...nel.org> [250827 20:29]:
> On Fri, Aug 22, 2025 at 10:55:10PM +0900, Harry Yoo wrote:
> > On Fri, Aug 22, 2025 at 01:08:02PM +0100, Lorenzo Stoakes wrote:
> > > +cc Sebastian for RCU ORC change...
> > > 
> > > +cc Harry for slab side.
> > 
> > +cc Josh and Peter for stack unwinding stuff.
> > 
> > > Pinging Jann for the CONFIG_SLUB_RCU_DEBUG element.
> > > 
> > > Jann - could this possibly be related to CONFIG_SLUB_RCU_DEBUG? As it seems to
> > > the stack is within KASAN, but no KASAN report so maybe it's KASAN itself that's
> > > having an issue?
> > > 
> > > Though I'm thinking maybe it's the orc unwinder itself that could be problematic
> > > here (yet invoked by CONFIG_SLUB_RCU_DEBUG though)... and yeah kinda suspcious
> > > because:
> > > 
> > > - We have two threads freeing VMAs using SLAB_TYPESAFE_BY_RCU
> > > - CONFIG_SLUB_RCU_DEBUG means that we use KASAN to save an aux stack, which
> > >   makes us do an unwind via ORC, which then takes an RCU read lock on
> > >   unwind_next_frame(), and both are doing this unwinding at the time of report.
> > > - ???
> > > - Somehow things get locked up?
> > > 
> > > I'm not an RCU expert (clearly :) so I'm not sure exactly how this could result
> > > in a stall, but it's suspicious.
> > 
> > Can this be because of misleading ORC data or logical error in ORC unwinder
> > that makes it fall into an infinite loop (unwind_done() never returning
> > true in arch_stack_walk())?
> > 
> > ...because the reported line number reported doesn't really make sense
> > as a cause of stalls.
> 
> There shouldn't be any way for ORC to hit an infinite loop.  Worst case
> it would stop after the caller's buffer fills up.  ORC has always been
> solid, and the RCU usage looks fine to me.  I tend to doubt ORC is at
> fault here.
> 
> Maybe some interaction higher up the stack is causing things to run in a
> tight loop.
> 
> All those debugging options (e.g., DEBUG_VM_MAPLE_TREE, LOCKDEP, KASAN,
> SLUB_RCU_DEBUG...) could be a factor in slowing things down to a crawl.

DEBUG_VM_MAPLE_TREE is super heavy, but that comes from validate_mm()
which would be the last thing to happen before returning, usually.

I mean surely that would show up in the logs.

Okay it's in the second log on the dashboard..

Yeah, I think it's debug options eventually causing failure.  Apparently
there's a reproducer for syz now but without the validate_mm().