linux-kernel - Re: kdump always hangs in rcu_barrier() -> wait_for

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20201126214226.GS1437@paulmck-ThinkPad-P72>
Date:   Thu, 26 Nov 2020 13:42:26 -0800
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Dexuan Cui <decui@...rosoft.com>
Cc:     "boqun.feng@...il.com" <boqun.feng@...il.com>,
        Ingo Molnar <mingo@...hat.com>,
        "rcu@...r.kernel.org" <rcu@...r.kernel.org>,
        vkuznets <vkuznets@...hat.com>,
        Michael Kelley <mikelley@...rosoft.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: kdump always hangs in rcu_barrier() -> wait_for_completion()

On Thu, Nov 26, 2020 at 09:25:28PM +0000, Dexuan Cui wrote:
> > From: Paul E. McKenney <paulmck@...nel.org>
> > Sent: Thursday, November 26, 2020 7:47 AM
> >  ...
> > The rcu_segcblist_n_cbs() function returns non-zero because something
> > invoked call_rcu() some time previously.  The ftrace facility (or just
> > a printk) should help you work out where that call_rcu() is located.
> 
> call_rcu() is indeed called multiple times, but as you said, this should
> be normal.

Good to know, thank you!

> > My best guess is that the underlying bug is that you are invoking
> > rcu_barrier() before the RCU grace-period kthread has been created.
> > This means that RCU grace periods cannot complete, which in turn means
> > that if there has been even one invocation of call_rcu() since boot,
> > rcu_barrier() cannot complete, which is what you are in fact seeing.
> > Please note that it is perfectly legal to invoke call_rcu() very early in
> > the boot process, as in even before the call to rcu_init().  Therefore,
> > if this is the case, the bug is the early call to rcu_barrier(), not
> > the early calls to call_rcu().
> >
> > To check this, at the beginning of rcu_barrier(), check the value of
> > rcu_state.gp_kthread.  If my guess is correct, it will be NULL.
> 
> Unluckily, it's not NULL here. :-)

You can't have everything!  ;-)

> > Another possibility is that rcu_state.gp_kthread is non-NULL, but that
> > something else is preventing RCU grace periods from completing, but in
> 
> It looks like somehow the scheduling is not working here: in rcu_barrier()
> , if I replace the wait_for_completion() with
> wait_for_completion_timeout(&rcu_state.barrier_completion, 30*HZ), the
> issue persists.

Have you tried using sysreq-t to see what the various tasks are doing?

One way that this can happen is if whatever task is currently running
has managed to enter long loop with interrupts disabled.

> > that case you should see RCU CPU stall warnings.  Unless of course they
> > have been disabled.
> > 							Thanx, Paul
> 
> I guess I didn't disable the wanrings (I don't even know how to do that :)

Having interrupts disabled on all CPUs would have the effect of disabling
the RCU CPU stall warnings.

The intended method is in Documentation/admin-guide/kernel-parameters.txt.
Search for rcu_cpu_stall_suppress.  Not that it seems important at this
point.

							Thanx, Paul

> grep RCU .config
> # RCU Subsystem
> CONFIG_TREE_RCU=y
> # CONFIG_RCU_EXPERT is not set
> CONFIG_SRCU=y
> CONFIG_TREE_SRCU=y
> CONFIG_TASKS_RCU_GENERIC=y
> CONFIG_TASKS_RUDE_RCU=y
> CONFIG_TASKS_TRACE_RCU=y
> CONFIG_RCU_STALL_COMMON=y
> CONFIG_RCU_NEED_SEGCBLIST=y
> CONFIG_RCU_NOCB_CPU=y
> # end of RCU Subsystem
> CONFIG_MMU_GATHER_RCU_TABLE_FREE=y
> # RCU Debugging
> # CONFIG_RCU_SCALE_TEST is not set
> # CONFIG_RCU_TORTURE_TEST is not set
> # CONFIG_RCU_REF_SCALE_TEST is not set
> CONFIG_RCU_CPU_STALL_TIMEOUT=30
> CONFIG_RCU_TRACE=y
> CONFIG_RCU_EQS_DEBUG=y
> # end of RCU Debugging
> 
> Thanks,
> -- Dexuan
>