lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110503091320.GA4542@novell.com>
Date:	Tue, 3 May 2011 10:13:20 +0100
From:	Mel Gorman <mgorman@...ell.com>
To:	James Bottomley <James.Bottomley@...e.de>
Cc:	Mel Gorman <mgorman@...e.de>, Jan Kara <jack@...e.cz>,
	colin.king@...onical.com, Chris Mason <chris.mason@...cle.com>,
	linux-fsdevel <linux-fsdevel@...r.kernel.org>,
	linux-mm <linux-mm@...ck.org>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	linux-ext4 <linux-ext4@...r.kernel.org>
Subject: Re: [BUG] fatal hang untarring 90GB file, possibly writeback related.

On Thu, Apr 28, 2011 at 05:43:48PM -0500, James Bottomley wrote:
> On Thu, 2011-04-28 at 16:12 -0500, James Bottomley wrote:
> > On Thu, 2011-04-28 at 14:59 -0500, James Bottomley wrote:
> > > Actually, talking to Chris, I think I can get the system up using
> > > init=/bin/bash without systemd, so I can try the no cgroup config.
> > 
> > OK, so a non-PREEMPT non-CGROUP kernel has survived three back to back
> > runs of untar without locking or getting kswapd pegged, so I'm pretty
> > certain this is cgroups related.  The next steps are to turn cgroups
> > back on but try disabling the memory and IO controllers.
> 
> I tried non-PREEMPT CGROUP but disabled GROUP_MEM_RES_CTLR.
> 
> The results are curious:  the tar does complete (I've done three back to
> back).  However, I did get one soft lockup in kswapd (below).  But the
> system recovers instead of halting I/O and hanging like it did
> previously.
> 
> The soft lockup is in shrink_slab, so perhaps it's a combination of slab
> shrinker and cgroup memory controller issues?
> 

So, kswapd is still looping in reclaim and spending a lot of time in
shrink_slab but it must not be the shrinker itself or that debug patch
would have triggered. It's curious that cgroups are involved with
systemd considering that one would expect those groups to be fairly
small. I still don't have a new theory but will get hold of a Fedora 15
install CD and see can I reproduce it locally.

One last thing, what is the value of /proc/sys/vm/zone_reclaim_mode? Two
of the reporting machines could be NUMA and if that proc file reads as
1, I'd be interested in hearing the results of a test with it set to 0.
Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ