linux-kernel - Re: next: Commit 'mm: Prevent __alloc_pages_nodemask() RCU CPU stall ...' causing hang on sparc32 qemu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20161201011950.GX3924@linux.vnet.ibm.com>
Date:   Wed, 30 Nov 2016 17:19:50 -0800
From:   "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:     Guenter Roeck <linux@...ck-us.net>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Andrew Morton <akpm@...ux-foundation.org>,
        sparclinux@...r.kernel.org, davem@...emloft.net
Subject: Re: next: Commit 'mm: Prevent __alloc_pages_nodemask() RCU CPU stall
 ...' causing hang on sparc32 qemu

On Wed, Nov 30, 2016 at 03:18:46PM -0800, Guenter Roeck wrote:
> On Wed, Nov 30, 2016 at 01:01:52PM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 30, 2016 at 11:21:59AM -0800, Guenter Roeck wrote:
> > > On Wed, Nov 30, 2016 at 04:03:33AM -0800, Paul E. McKenney wrote:
> > > > On Wed, Nov 30, 2016 at 02:52:11AM -0800, Guenter Roeck wrote:
> > > > > On 11/29/2016 11:02 PM, Paul E. McKenney wrote:
> > > > > >On Tue, Nov 29, 2016 at 08:32:51PM -0800, Guenter Roeck wrote:
> > > > > >>On 11/29/2016 05:28 PM, Paul E. McKenney wrote:
> > > > > >>>On Tue, Nov 29, 2016 at 01:23:08PM -0800, Guenter Roeck wrote:
> > > > > >>>>Hi Paul,
> > > > > >>>>
> > > > > >>>>most of my qemu tests for sparc32 targets started to fail in next-20161129.
> > > > > >>>>The problem is only seen in SMP builds; non-SMP builds are fine.
> > > > > >>>>Bisect points to commit 2d66cccd73436 ("mm: Prevent __alloc_pages_nodemask()
> > > > > >>>>RCU CPU stall warnings"); reverting that commit fixes the problem.
> > > > 
> > > > And I have dropped this patch.  Michal Hocko showed me the error of
> > > > my ways with this patch.
> > > > 
> > > 
> > > :-)
> > > 
> > > On another note, I still get RCU tracebacks in the s390 tests.
> > > 
> > > BUG: sleeping function called from invalid context at mm/page_alloc.c:3775
> > > 
> > > That is caused by 'rcu: Maintain special bits at bottom of ->dynticks counter';
> > > if I recall correctly we had discussed that earlier.
> > 
> > Indeed, I had missed a dyntick counter update back on Nov 11, which meant
> > that some of the code was still looking at the low-order bit instead of
> > the next bit up.  This is now fixed.
> > 
> > So to get to the error message you call out above, I need to have improperly
> > left the system in bh state or left irqs disabled, while the system was
> > running normally without an oops.  I am having a hard time seeing how this
> > patch can do that.
> > 
> > I would be more suspicious of f2a471ffc8a8 ("rcu: Allow boot-time use
> > of cond_resched_rcu_qs()").
> > 
> > So you bisected or did a revert to work out which was the offending commit?
> > 
> 
> My most recent bisect was with the November 10 image, so that would have missed
> any later fix. Comparing the log messages, the current message is indeed
> different. Sorry, I mixed that up; I just assumed that the problem would be
> the same without really checking. My bad.
> 
> Bisect would be tricky, since the s390 image was broken for some time after
> November 10. The first time I have seen the above BUG: was with next-20161128
> (which is the first build after the crash was fixed). That version did not
> include f2a471ffc8a8, so that can not be the cause.
> 
> I'll try to set up a bisect tonight, working around the crash problem.
> I'll let you know how it goes.

Whew!  You had me going for a bit there.  ;-)

							Thanx, Paul