linux-kernel - Re: [linus:master] [rcu/nocb] 7625926086: WARNING:at_kernel/rcu/tree_nocb.h:#rcu_nocb_try

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZK1983kQCrN+zCqs@lothringen>
Date:   Tue, 11 Jul 2023 18:06:11 +0200
From:   Frederic Weisbecker <frederic@...nel.org>
To:     kernel test robot <oliver.sang@...el.com>,
        "Paul E. McKenney" <paulmck@...nel.org>
Cc:     oe-lkp@...ts.linux.dev, lkp@...el.com,
        linux-kernel@...r.kernel.org, rcu@...r.kernel.org
Subject: Re: [linus:master] [rcu/nocb]  7625926086:
 WARNING:at_kernel/rcu/tree_nocb.h:#rcu_nocb_try_bypass

On Tue, Jul 11, 2023 at 04:48:06PM +0800, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed "WARNING:at_kernel/rcu/tree_nocb.h:#rcu_nocb_try_bypass" on:
> 
> commit: 7625926086765123251f765d91fc3a70617d334d ("rcu/nocb: Fix shrinker race against callback enqueuer")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> [test failed on linus/master Linux 6.5-rc1]
> [test failed on linux-next/master fe57d0d86f03a8b2afe2869a95477d0ed1824c96]
> 
> in testcase: trinity
> version: trinity-i386-abe9de86-1_20230429
> with following parameters:
> 
> 	runtime: 300s
> 	group: group-00
> 
> test-description: Trinity is a linux system call fuzz tester.
> test-url: http://codemonkey.org.uk/projects/trinity/
> 
> 
> compiler: gcc-12
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> 
> (please refer to attached dmesg/kmsg for entire log/backtrace):
> 
> 
> the WARNING only happens in a low rate (10 out of 204 runs as below).
> but since we never observed the issue on parent, and we also noticed
> there is a WARN_ON_ONCE in code change, so just report what we observed
> in our tests FYI.
> 
> 5c83cedbaaad6dfe 7625926086765123251f765d91f
> ---------------- ---------------------------
>        fail:runs  %reproduction    fail:runs
>            |             |             |
>            :202          5%          10:204   dmesg.RIP:rcu_nocb_bypass_lock
>            :202          5%          10:204   dmesg.RIP:rcu_nocb_try_bypass
>            :202          5%          10:204   dmesg.WARNING:at_kernel/rcu/tree_nocb.h:#rcu_nocb_bypass_lock
>            :202          5%          10:204   dmesg.WARNING:at_kernel/rcu/tree_nocb.h:#rcu_nocb_try_bypass
> 
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@...el.com>
> | Closes: https://lore.kernel.org/oe-lkp/202307111535.4823c255-oliver.sang@intel.com
> 
> 
> 
> [ 315.569574][ T16] WARNING: CPU: 1 PID: 16 at kernel/rcu/tree_nocb.h:124
> rcu_nocb_try_bypass (kernel/rcu/tree_nocb.h:124 kernel/rcu/tree_nocb.h:513)

It's:

static void rcu_nocb_wait_contended(struct rcu_data *rdp)
{
	WARN_ON_ONCE(smp_processor_id() != rdp->cpu);
	while (WARN_ON_ONCE(atomic_read(&rdp->nocb_lock_contended))) <<< HERE
		cpu_relax();
}

This is called on enqueue time by rcu_nocb_try_bypass().

The places that can increment ->nocb_lock_contended are:

* rcu_nocb_try_bypass() itself, but it's not re-entrant

* rcu_nocb_rdp_deoffload(), but it executes on the local CPU with
  IRQs disabled, so rcu_nocb_try_bypass() can't race
  
* lazy_rcu_shrink_scan(): that's indeed the issue here

* rcu_barrier_entrain(), executed either locally with IRQs disabled
  or remotely when the target is dead, so can't race with rcu_nocb_try_bypass()
  
* rcutree_migrate_callback(): locally with IRQs disabled, can't race

So indeed lazy_rcu_shrink_scan() is the only possible culprit. But then
I wonder, what is the purpose of this ->nocb_lock_contended if it's actually
not an allowed behaviour?

Paul, any recollection of that?

Thanks.