lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 23 Oct 2014 13:05:07 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Yanko Kaneti <yaneti@...lera.com>
Cc:	Josh Boyer <jwboyer@...oraproject.org>,
	"Eric W. Biederman" <ebiederm@...ssion.com>,
	Cong Wang <cwang@...pensource.com>,
	Kevin Fenzi <kevin@...ye.com>, netdev <netdev@...r.kernel.org>,
	"Linux-Kernel@...r. Kernel. Org" <linux-kernel@...r.kernel.org>
Subject: Re: localed stuck in recent 3.18 git in copy_net_ns?

On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote:
> On Thu-10/23/14-2014 08:33, Paul E. McKenney wrote:
> > On Thu, Oct 23, 2014 at 05:27:50AM -0700, Paul E. McKenney wrote:
> > > On Thu, Oct 23, 2014 at 09:09:26AM +0300, Yanko Kaneti wrote:
> > > > On Wed, 2014-10-22 at 16:24 -0700, Paul E. McKenney wrote:
> > > > > On Thu, Oct 23, 2014 at 01:40:32AM +0300, Yanko Kaneti wrote:
> > > > > > On Wed-10/22/14-2014 15:33, Josh Boyer wrote:
> > > > > > > On Wed, Oct 22, 2014 at 2:55 PM, Paul E. McKenney
> > > > > > > <paulmck@...ux.vnet.ibm.com> wrote:
> > > > > 
> > > > > [ . . . ]
> > > > > 
> > > > > > > > Don't get me wrong -- the fact that this kthread appears to 
> > > > > > > > have
> > > > > > > > blocked within rcu_barrier() for 120 seconds means that 
> > > > > > > > something is
> > > > > > > > most definitely wrong here.  I am surprised that there are no 
> > > > > > > > RCU CPU
> > > > > > > > stall warnings, but perhaps the blockage is in the callback 
> > > > > > > > execution
> > > > > > > > rather than grace-period completion.  Or something is 
> > > > > > > > preventing this
> > > > > > > > kthread from starting up after the wake-up callback executes.  
> > > > > > > > Or...
> > > > > > > > 
> > > > > > > > Is this thing reproducible?
> > > > > > > 
> > > > > > > I've added Yanko on CC, who reported the backtrace above and can
> > > > > > > recreate it reliably.  Apparently reverting the RCU merge commit
> > > > > > > (d6dd50e) and rebuilding the latest after that does not show the
> > > > > > > issue.  I'll let Yanko explain more and answer any questions you 
> > > > > > > have.
> > > > > > 
> > > > > > - It is reproducible
> > > > > > - I've done another build here to double check and its definitely 
> > > > > > the rcu merge
> > > > > >   that's causing it.
> > > > > > 
> > > > > > Don't think I'll be able to dig deeper, but I can do testing if 
> > > > > > needed.
> > > > > 
> > > > > Please!  Does the following patch help?
> > > > 
> > > > Nope, doesn't seem to make a difference to the modprobe ppp_generic 
> > > > test
> > > 
> > > Well, I was hoping.  I will take a closer look at the RCU merge commit
> > > and see what suggests itself.  I am likely to ask you to revert specific
> > > commits, if that works for you.
> > 
> > Well, rather than reverting commits, could you please try testing the
> > following commits?
> > 
> > 11ed7f934cb8 (rcu: Make nocb leader kthreads process pending callbacks after spawning)
> > 
> > 73a860cd58a1 (rcu: Replace flush_signals() with WARN_ON(signal_pending()))
> > 
> > c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())
> > 
> > 	For whatever it is worth, I am guessing this one.
> 
> Indeed, c847f14217d5 it is.
> 
> Much to my embarrasment I just noticed that in addition to the
> rcu merge, triggering the bug "requires" my specific Fedora rawhide network
> setup. Booting in single mode and modprobe ppp_generic is fine. The bug
> appears when starting with my regular fedora network setup, which in my case 
> includes 3 ethernet adapters and a libvirt birdge+nat setup.
> 
> Hope that helps. 
> 
> I am attaching the config.

It does help a lot, thank you!!!

The following patch is a bit of a shot in the dark, and assumes that
commit 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled idle
code) introduced the problem.  Does this patch fix things up?

							Thanx, Paul

------------------------------------------------------------------------

rcu: Kick rcuo kthreads after their CPU goes offline

If a no-CBs CPU were to post an RCU callback with interrupts disabled
after it entered the idle loop for the last time, there might be no
deferred wakeup for the corresponding rcuo kthreads.  This commit
therefore adds a set of calls to do_nocb_deferred_wakeup() after the
CPU has gone completely offline.

Signed-off-by: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 84b41b3c6ebd..4f3d25a58786 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3493,8 +3493,10 @@ static int rcu_cpu_notify(struct notifier_block *self,
 	case CPU_DEAD_FROZEN:
 	case CPU_UP_CANCELED:
 	case CPU_UP_CANCELED_FROZEN:
-		for_each_rcu_flavor(rsp)
+		for_each_rcu_flavor(rsp) {
 			rcu_cleanup_dead_cpu(cpu, rsp);
+			do_nocb_deferred_wakeup(this_cpu_ptr(rsp->rda));
+		}
 		break;
 	default:
 		break;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists