linux-kernel - Re: INFO: rcu detected stall in smp_call

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200825132411.GR35926@hirez.programming.kicks-ass.net>
Date:   Tue, 25 Aug 2020 15:24:11 +0200
From:   peterz@...radead.org
To:     syzbot <syzbot+cb3b69ae80afd6535b0e@...kaller.appspotmail.com>
Cc:     fweisbec@...il.com, linux-kernel@...r.kernel.org, mingo@...nel.org,
        syzkaller-bugs@...glegroups.com, tglx@...utronix.de,
        Paul McKenney <paulmck@...nel.org>
Subject: Re: INFO: rcu detected stall in smp_call_function


+Cc Paul, who was weirdly forgotten last time

And one additional question below, which made me remember this thing.

On Wed, Jul 29, 2020 at 02:58:11PM +0200, peterz@...radead.org wrote:

> > rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-1):
> > ------------[ cut here ]------------
> > IRQs not enabled as expected
> > WARNING: CPU: 0 PID: 32297 at kernel/sched/core.c:2701 try_invoke_on_locked_down_task+0x18b/0x320 kernel/sched/core.c:2701
> > Kernel panic - not syncing: panic_on_warn set ...
> > CPU: 0 PID: 32297 Comm: syz-executor.2 Not tainted 5.8.0-rc7-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > Call Trace:
> >  <IRQ>
> >  __dump_stack lib/dump_stack.c:77 [inline]
> >  dump_stack+0x1f0/0x31e lib/dump_stack.c:118
> >  panic+0x264/0x7a0 kernel/panic.c:231
> >  __warn+0x227/0x250 kernel/panic.c:600
> >  report_bug+0x1b1/0x2e0 lib/bug.c:198
> >  handle_bug+0x42/0x80 arch/x86/kernel/traps.c:235
> >  exc_invalid_op+0x16/0x40 arch/x86/kernel/traps.c:255
> >  asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:540
> > RIP: 0010:try_invoke_on_locked_down_task+0x18b/0x320 kernel/sched/core.c:2701
> > Code: 48 89 df e8 f7 35 09 00 4c 89 f7 e8 df b5 cf 06 e9 b5 00 00 00 c6 05 34 82 38 08 01 48 c7 c7 8c d7 07 89 31 c0 e8 a5 a9 f5 ff <0f> 0b e9 15 ff ff ff 48 c7 c1 30 71 8d 89 80 e1 07 80 c1 03 38 c1
> > RSP: 0018:ffffc90000007c50 EFLAGS: 00010046
> > RAX: 1aaa08be6903c500 RBX: ffff888085d16ac8 RCX: ffff888085d16240
> > RDX: 0000000000010004 RSI: 0000000000010004 RDI: 0000000000000000
> > RBP: ffff888085d16b0c R08: ffffffff815dd389 R09: ffffed1015d041c3
> > R10: ffffed1015d041c3 R11: 0000000000000000 R12: 0000000000000000
> > R13: ffff8880a8bac140 R14: ffff8880a8bac4c0 R15: dffffc0000000000
> >  rcu_print_task_stall kernel/rcu/tree_stall.h:269 [inline]
> >  print_other_cpu_stall kernel/rcu/tree_stall.h:477 [inline]
> 
> Ha, that calls it with IRQs already disabled,
> 
> So I'm thinking we want something like so?
> 
> ---
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2142c6767682..3182caf14844 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2694,12 +2694,11 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>   */
>  bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg)
>  {
> -	bool ret = false;
>  	struct rq_flags rf;
> +	bool ret = false;
>  	struct rq *rq;
>  
> -	lockdep_assert_irqs_enabled();
> -	raw_spin_lock_irq(&p->pi_lock);
> +	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
>  	if (p->on_rq) {
>  		rq = __task_rq_lock(p, &rf);
>  		if (task_rq(p) == rq)
> @@ -2716,7 +2715,7 @@ bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct t
>  				ret = func(p, arg);
>  		}
>  	}
> -	raw_spin_unlock_irq(&p->pi_lock);
> +	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
>  	return ret;
>  }

Paul, I wanted to use this function, but found it has very weird
semantics.

Why do you need it to (remotely) call @func when p is current? The user
in rcu_print_task_stall() explicitly bails in this case, and the other
in rcu_wait_for_one_reader() will attempt an IPI.

Would it be possible to change this function to:

 - blocked task: call @func with p->pi_lock held
 - queued, !running task: call @func with rq->lock held
 - running task: fail.

?