linux-ext4 - Re: [BUG] fatal hang untarring 90GB file, possibly writeback related.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20110503091320.GA4542@novell.com>
Date:	Tue, 3 May 2011 10:13:20 +0100
From:	Mel Gorman <mgorman@...ell.com>
To:	James Bottomley <James.Bottomley@...e.de>
Cc:	Mel Gorman <mgorman@...e.de>, Jan Kara <jack@...e.cz>,
	colin.king@...onical.com, Chris Mason <chris.mason@...cle.com>,
	linux-fsdevel <linux-fsdevel@...r.kernel.org>,
	linux-mm <linux-mm@...ck.org>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	linux-ext4 <linux-ext4@...r.kernel.org>
Subject: Re: [BUG] fatal hang untarring 90GB file, possibly writeback related.

On Thu, Apr 28, 2011 at 05:43:48PM -0500, James Bottomley wrote:
> On Thu, 2011-04-28 at 16:12 -0500, James Bottomley wrote:
> > On Thu, 2011-04-28 at 14:59 -0500, James Bottomley wrote:
> > > Actually, talking to Chris, I think I can get the system up using
> > > init=/bin/bash without systemd, so I can try the no cgroup config.
> > 
> > OK, so a non-PREEMPT non-CGROUP kernel has survived three back to back
> > runs of untar without locking or getting kswapd pegged, so I'm pretty
> > certain this is cgroups related.  The next steps are to turn cgroups
> > back on but try disabling the memory and IO controllers.
> 
> I tried non-PREEMPT CGROUP but disabled GROUP_MEM_RES_CTLR.
> 
> The results are curious:  the tar does complete (I've done three back to
> back).  However, I did get one soft lockup in kswapd (below).  But the
> system recovers instead of halting I/O and hanging like it did
> previously.
> 
> The soft lockup is in shrink_slab, so perhaps it's a combination of slab
> shrinker and cgroup memory controller issues?
> 

So, kswapd is still looping in reclaim and spending a lot of time in
shrink_slab but it must not be the shrinker itself or that debug patch
would have triggered. It's curious that cgroups are involved with
systemd considering that one would expect those groups to be fairly
small. I still don't have a new theory but will get hold of a Fedora 15
install CD and see can I reproduce it locally.

One last thing, what is the value of /proc/sys/vm/zone_reclaim_mode? Two
of the reporting machines could be NUMA and if that proc file reads as
1, I'd be interested in hearing the results of a test with it set to 0.
Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html