[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <tvtzgibiy5fvmf7rms4jeyim3lx4nas7qmgv36oryizdvwaujh@bsxqbd3nii55>
Date: Wed, 27 Aug 2025 21:57:59 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Josh Poimboeuf <jpoimboe@...nel.org>
Cc: Harry Yoo <harry.yoo@...cle.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
syzbot <syzbot+8785aaf121cfb2141e0d@...kaller.appspotmail.com>,
akpm@...ux-foundation.org, jannh@...gle.com,
linux-kernel@...r.kernel.org, linux-mm@...ck.org, pfalcato@...e.de,
syzkaller-bugs@...glegroups.com, vbabka@...e.cz,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
peterz@...radead.org
Subject: Re: [syzbot] [mm?] INFO: rcu detected stall in sys_munmap (2)
* Josh Poimboeuf <jpoimboe@...nel.org> [250827 20:29]:
> On Fri, Aug 22, 2025 at 10:55:10PM +0900, Harry Yoo wrote:
> > On Fri, Aug 22, 2025 at 01:08:02PM +0100, Lorenzo Stoakes wrote:
> > > +cc Sebastian for RCU ORC change...
> > >
> > > +cc Harry for slab side.
> >
> > +cc Josh and Peter for stack unwinding stuff.
> >
> > > Pinging Jann for the CONFIG_SLUB_RCU_DEBUG element.
> > >
> > > Jann - could this possibly be related to CONFIG_SLUB_RCU_DEBUG? As it seems to
> > > the stack is within KASAN, but no KASAN report so maybe it's KASAN itself that's
> > > having an issue?
> > >
> > > Though I'm thinking maybe it's the orc unwinder itself that could be problematic
> > > here (yet invoked by CONFIG_SLUB_RCU_DEBUG though)... and yeah kinda suspcious
> > > because:
> > >
> > > - We have two threads freeing VMAs using SLAB_TYPESAFE_BY_RCU
> > > - CONFIG_SLUB_RCU_DEBUG means that we use KASAN to save an aux stack, which
> > > makes us do an unwind via ORC, which then takes an RCU read lock on
> > > unwind_next_frame(), and both are doing this unwinding at the time of report.
> > > - ???
> > > - Somehow things get locked up?
> > >
> > > I'm not an RCU expert (clearly :) so I'm not sure exactly how this could result
> > > in a stall, but it's suspicious.
> >
> > Can this be because of misleading ORC data or logical error in ORC unwinder
> > that makes it fall into an infinite loop (unwind_done() never returning
> > true in arch_stack_walk())?
> >
> > ...because the reported line number reported doesn't really make sense
> > as a cause of stalls.
>
> There shouldn't be any way for ORC to hit an infinite loop. Worst case
> it would stop after the caller's buffer fills up. ORC has always been
> solid, and the RCU usage looks fine to me. I tend to doubt ORC is at
> fault here.
>
> Maybe some interaction higher up the stack is causing things to run in a
> tight loop.
>
> All those debugging options (e.g., DEBUG_VM_MAPLE_TREE, LOCKDEP, KASAN,
> SLUB_RCU_DEBUG...) could be a factor in slowing things down to a crawl.
DEBUG_VM_MAPLE_TREE is super heavy, but that comes from validate_mm()
which would be the last thing to happen before returning, usually.
I mean surely that would show up in the logs.
Okay it's in the second log on the dashboard..
Yeah, I think it's debug options eventually causing failure. Apparently
there's a reproducer for syz now but without the validate_mm().
Powered by blists - more mailing lists