lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20121114132940.GA13196@offline.be>
Date:	Wed, 14 Nov 2012 14:29:40 +0100
From:	Marc Duponcheel <marc@...line.be>
To:	Mel Gorman <mgorman@...e.de>
Cc:	David Rientjes <rientjes@...gle.com>,
	Andy Lutomirski <luto@...capital.net>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	Marc Duponcheel <marc@...line.be>
Subject: Re: [3.6 regression?] THP + migration/compaction livelock (I think)

 Hi all

 If someone can provide the patches (or learn me how to get them with
git (I apologise to not be git savy)) then, this weekend, I can apply
them to 3.6.6 and compare before/after to check if they fix #49361.

 Thanks

On 2012 Nov 14, Mel Gorman wrote:
> On Tue, Nov 13, 2012 at 03:41:02PM -0800, David Rientjes wrote:
> > On Tue, 13 Nov 2012, Andy Lutomirski wrote:
> > 
> > > It just happened again.
> > > 
> > > $ grep -E "compact_|thp_" /proc/vmstat
> > > compact_blocks_moved 8332448774
> > > compact_pages_moved 21831286
> > > compact_pagemigrate_failed 211260
> > > compact_stall 13484
> > > compact_fail 6717
> > > compact_success 6755
> > > thp_fault_alloc 150665
> > > thp_fault_fallback 4270
> > > thp_collapse_alloc 19771
> > > thp_collapse_alloc_failed 2188
> > > thp_split 19600
> > > 
> > 
> > Two of the patches from the list provided at
> > http://marc.info/?l=linux-mm&m=135179005510688 are already in your 3.6.3 
> > kernel:
> > 
> > 	mm: compaction: abort compaction loop if lock is contended or run too long
> > 	mm: compaction: acquire the zone->lock as late as possible
> > 
> > and all have not made it to the 3.6 stable kernel yet, so would it be 
> > possible to try with 3.7-rc5 to see if it fixes the issue?  If so, it will 
> > indicate that the entire series is a candidate to backport to 3.6.
> 
> Thanks David once again.
> 
> The full list of compaction-related patches I believe are necessary for
> this particular problem are
> 
> e64c5237cf6ff474cb2f3f832f48f2b441dd9979 mm: compaction: abort compaction loop if lock is contended or run too long
> 3cc668f4e30fbd97b3c0574d8cac7a83903c9bc7 mm: compaction: move fatal signal check out of compact_checklock_irqsave
> 661c4cb9b829110cb68c18ea05a56be39f75a4d2 mm: compaction: Update try_to_compact_pages()kerneldoc comment
> 2a1402aa044b55c2d30ab0ed9405693ef06fb07c mm: compaction: acquire the zone->lru_lock as late as possible
> f40d1e42bb988d2a26e8e111ea4c4c7bac819b7e mm: compaction: acquire the zone->lock as late as possible
> 753341a4b85ff337487b9959c71c529f522004f4 revert "mm: have order > 0 compaction start off where it left"
> bb13ffeb9f6bfeb301443994dfbf29f91117dfb3 mm: compaction: cache if a pageblock was scanned and no pages were isolated
> c89511ab2f8fe2b47585e60da8af7fd213ec877e mm: compaction: Restart compaction from near where it left off
> 62997027ca5b3d4618198ed8b1aba40b61b1137b mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity
> 0db63d7e25f96e2c6da925c002badf6f144ddf30 mm: compaction: correct the nr_strict va isolated check for CMA
> 
> If we can get confirmation that these fix the problem in 3.6 kernels then
> I can backport them to -stable. This fixing a problem where "many processes
> stall, all in an isolation-related function". This started happening after
> lumpy reclaim was removed because we depended on that to aggressively
> reclaim with less compaction. Now compaction is depended upon more.
> 
> The full 3.7-rc5 kernel has a different problem on top of this and it's
> important the problems do not get conflacted. It has these fixes *but*
> GFP_NO_KSWAPD has been removed and there is a patch that scales reclaim
> with THP failures that is causing problem. With them, kswapd can get
> stuck in a 100% loop where it is neither reclaiming nor reaching its exit
> conditions. The correct fix would be to identify why this happens but I
> have not got around to it yet. To test with 3.7-rc5 then apply either
> 
> 1) https://lkml.org/lkml/2012/11/5/308
> 2) https://lkml.org/lkml/2012/11/12/113
> 
> or
> 
> 1) https://lkml.org/lkml/2012/11/5/308
> 3) https://lkml.org/lkml/2012/11/12/151
> 
> on top of 3.7-rc5. So it's a lot of work but there are three tests I'm
> interested in hearing about. The results of each determine what happens
> in -stable or mainline
> 
> Test 1: 3.6 + the last of commits above	(should fix processes stick in isolate)
> Test 2: 3.7-rc5 + (1+2) above (should fix kswapd stuck at 100%)
> Test 3: 3.7-rc5 + (1+3) above (should fix kswapd stuck at 100% but better)
> 
> Thanks.
> 
> -- 
> Mel Gorman
> SUSE Labs
> 

-- 
--
 Marc Duponcheel
 Velodroomstraat 74 - 2600 Berchem - Belgium
 +32 (0)478 68.10.91 - marc@...line.be
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ