linux-kernel - Re: [PATCH v2] rcu: fix a deadlock caused by not release rcu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f0fdd57e-50ba-21c1-a1d8-18cf38d0ac2f@windriver.com>
Date:   Sun, 16 May 2021 20:33:22 +0800
From:   "Xu, Yanfei" <yanfei.xu@...driver.com>
To:     paulmck@...nel.org, josh@...htriplett.org, rostedt@...dmis.org,
        mathieu.desnoyers@...icios.com, jiangshanlai@...il.com,
        joel@...lfernandes.org
Cc:     rcu@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] rcu: fix a deadlock caused by not release
 rcu_node->lock

Hi Paul,

Should I merge this patch and the before one into one? If need please 
tell me and I will do it. :)
In addition, before these two patch the bug will lead a phenomenon which 
is "BUG: scheduling while atomic:". Because the preempt_count is 
disabled in tick irq  while missing release the rcu_node->lock.

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-11):
         (detected by 3, t=6504 jiffies, g=34033, q=10745911)
rcu: All QSes seen, last rcu_preempt kthread activity 28 
(4295088530-4295088502), jiffies_till_next_fqs=1, root ->qsmask 0x1
BUG: scheduling while atomic: msgstress04/90186/0x00000002
INFO: lockdep is turned off.
Modules linked in: sch_fq_codel
irq event stamp: 0
hardirqs last  enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffff80001004d57c>] 
copy_process+0x678/0x2790
softirqs last  enabled at (0): [<ffff80001004d57c>] 
copy_process+0x678/0x2790
softirqs last disabled at (0): [<0000000000000000>] 0x0
Preemption disabled at:
[<ffff800010402744>] find_and_remove_object+0x34/0xd0
CPU: 3 PID: 90186 Comm: msgstress04 Kdump: loaded Not tainted 
5.12.2-yoctodev-standard #1
Hardware name: Marvell OcteonTX CN96XX board (DT)
Call trace:
  dump_backtrace+0x0/0x2cc
  show_stack+0x24/0x30
  dump_stack+0x110/0x188
  __schedule_bug+0x100/0x114
  __schedule+0xe5c/0xfd4
  schedule+0x70/0x16c
  do_notify_resume+0xe4/0x19d0
  work_pending+0xc/0x2a8


Regards,
Yanfei

On 5/16/21 5:50 PM, yanfei.xu@...driver.com wrote:
> From: Yanfei Xu <yanfei.xu@...driver.com>
> 
> rcu_node->lock isn't released in rcu_print_task_stall() if the rcu_node
> don't contain tasks which blocking the GP. However this rcu_node->lock
> will be used again in rcu_dump_cpu_stacks() soon while the ndetected is
> non-zero. As a result the cpu will hung by this deadlock.
> 
> Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
> Signed-off-by: Yanfei Xu <yanfei.xu@...driver.com>
> ---
> v1->v2:
>      1.change the lock function to unlock function.
>      2.add fixes tag.
> 
>   kernel/rcu/tree_stall.h | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> index b72311d24a9f..b09a7140ef77 100644
> --- a/kernel/rcu/tree_stall.h
> +++ b/kernel/rcu/tree_stall.h
> @@ -267,8 +267,10 @@ static int rcu_print_task_stall(struct rcu_node *rnp, unsigned long flags)
>   	struct task_struct *ts[8];
>   
>   	lockdep_assert_irqs_disabled();
> -	if (!rcu_preempt_blocked_readers_cgp(rnp))
> +	if (!rcu_preempt_blocked_readers_cgp(rnp)) {
> +		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>   		return 0;
> +	}
>   	pr_err("\tTasks blocked on level-%d rcu_node (CPUs %d-%d):",
>   	       rnp->level, rnp->grplo, rnp->grphi);
>   	t = list_entry(rnp->gp_tasks->prev,
>