lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 11 Jul 2023 09:41:58 -0700
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Frederic Weisbecker <frederic@...nel.org>
Cc:     kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
        lkp@...el.com, linux-kernel@...r.kernel.org, rcu@...r.kernel.org
Subject: Re: [linus:master] [rcu/nocb]  7625926086:
 WARNING:at_kernel/rcu/tree_nocb.h:#rcu_nocb_try_bypass

On Tue, Jul 11, 2023 at 06:06:11PM +0200, Frederic Weisbecker wrote:
> On Tue, Jul 11, 2023 at 04:48:06PM +0800, kernel test robot wrote:
> > 
> > 
> > Hello,
> > 
> > kernel test robot noticed "WARNING:at_kernel/rcu/tree_nocb.h:#rcu_nocb_try_bypass" on:
> > 
> > commit: 7625926086765123251f765d91fc3a70617d334d ("rcu/nocb: Fix shrinker race against callback enqueuer")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > 
> > [test failed on linus/master Linux 6.5-rc1]
> > [test failed on linux-next/master fe57d0d86f03a8b2afe2869a95477d0ed1824c96]
> > 
> > in testcase: trinity
> > version: trinity-i386-abe9de86-1_20230429
> > with following parameters:
> > 
> > 	runtime: 300s
> > 	group: group-00
> > 
> > test-description: Trinity is a linux system call fuzz tester.
> > test-url: http://codemonkey.org.uk/projects/trinity/
> > 
> > 
> > compiler: gcc-12
> > test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> > 
> > (please refer to attached dmesg/kmsg for entire log/backtrace):
> > 
> > 
> > the WARNING only happens in a low rate (10 out of 204 runs as below).
> > but since we never observed the issue on parent, and we also noticed
> > there is a WARN_ON_ONCE in code change, so just report what we observed
> > in our tests FYI.
> > 
> > 5c83cedbaaad6dfe 7625926086765123251f765d91f
> > ---------------- ---------------------------
> >        fail:runs  %reproduction    fail:runs
> >            |             |             |
> >            :202          5%          10:204   dmesg.RIP:rcu_nocb_bypass_lock
> >            :202          5%          10:204   dmesg.RIP:rcu_nocb_try_bypass
> >            :202          5%          10:204   dmesg.WARNING:at_kernel/rcu/tree_nocb.h:#rcu_nocb_bypass_lock
> >            :202          5%          10:204   dmesg.WARNING:at_kernel/rcu/tree_nocb.h:#rcu_nocb_try_bypass
> > 
> > 
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <oliver.sang@...el.com>
> > | Closes: https://lore.kernel.org/oe-lkp/202307111535.4823c255-oliver.sang@intel.com
> > 
> > 
> > 
> > [ 315.569574][ T16] WARNING: CPU: 1 PID: 16 at kernel/rcu/tree_nocb.h:124
> > rcu_nocb_try_bypass (kernel/rcu/tree_nocb.h:124 kernel/rcu/tree_nocb.h:513)
> 
> It's:
> 
> static void rcu_nocb_wait_contended(struct rcu_data *rdp)
> {
> 	WARN_ON_ONCE(smp_processor_id() != rdp->cpu);
> 	while (WARN_ON_ONCE(atomic_read(&rdp->nocb_lock_contended))) <<< HERE
> 		cpu_relax();
> }
> 
> This is called on enqueue time by rcu_nocb_try_bypass().
> 
> The places that can increment ->nocb_lock_contended are:
> 
> * rcu_nocb_try_bypass() itself, but it's not re-entrant
> 
> * rcu_nocb_rdp_deoffload(), but it executes on the local CPU with
>   IRQs disabled, so rcu_nocb_try_bypass() can't race
>   
> * lazy_rcu_shrink_scan(): that's indeed the issue here
> 
> * rcu_barrier_entrain(), executed either locally with IRQs disabled
>   or remotely when the target is dead, so can't race with rcu_nocb_try_bypass()
>   
> * rcutree_migrate_callback(): locally with IRQs disabled, can't race
> 
> So indeed lazy_rcu_shrink_scan() is the only possible culprit. But then
> I wonder, what is the purpose of this ->nocb_lock_contended if it's actually
> not an allowed behaviour?
> 
> Paul, any recollection of that?

Heh!

The purpose was to see if this lock was ever contended.  I guess we now
have an answer, which is "Yes, but rarely."

It looks like the victim commit increased the size of the ->nocb_lock
critical section, just enough to make this happen sometimes.

Removing the WARN_ON_ONCE() seems appropriate, especially given that
this only happens when shrinking.

Should we add something that monitors that lock's contention?  It is
often the case that lock contention rises over time as new features and
optimizations are added.

							Thanx, Paul

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ