lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 9 Nov 2014 09:27:46 +0100
From:	Pavel Machek <pavel@....cz>
To:	Vlastimil Babka <vbabka@...e.cz>
Cc:	"P. Christeas" <xrg@...ux.gr>, linux-mm@...ck.org,
	Joonsoo Kim <iamjoonsoo.kim@....com>,
	lkml <linux-kernel@...r.kernel.org>,
	David Rientjes <rientjes@...gle.com>,
	Norbert Preining <preining@...ic.at>,
	Markus Trippelsdorf <markus@...ppelsdorf.de>
Subject: Re: Early test: hangs in mm/compact.c w. Linus's 12d7aacab56e9ef185c

Hi!

> >> Oh and did I ask in this thread for /proc/zoneinfo yet? :)
> > 
> > Using that same kernel[1], got again into a race, gathered a few more data.
> > 
> > This time, I had 1x "urpmq" process [2] hung at 100% CPU , when "kwin" got 
> > apparently blocked (100% CPU, too) trying to resize a GUI window. I suppose 
> > the resizing operation would mean heavy memory alloc/free.
> > 
> > The rest of the system was responsive, I could easily get a console, login, 
> > gather the files.. Then, I have *killed* -9 the "urpmq" process, which solved 
> > the race and my system is still alive! "kwin" is still running, returned to 
> > regular CPU load.
> > 
> > Attached is traces from SysRq+l (pressed a few times, wanted to "snapshot" the 
> > stack) and /proc/zoneinfo + /proc/vmstat
> > 
> > Bisection is not yet meaningful, IMHO, because I cannot be sure that "good" 
> > points are really free from this issue. I'd estimate that each test would take 
> > +3days, unless I really find a deterministic way to reproduce the issue .
> 
> Hi,
> 
> I think I finally found the cause by staring into the code... CCing
> people from all 4 separate threads I know about this issue.
> The problem with finding the cause was that the first report I got from
> Markus was about isolate_freepages_block() overhead, and later Norbert
> reported that reverting a patch for isolate_freepages* helped. But the
> problem seems to be that although the loop in isolate_migratepages exits
> because the scanners almost meet (they are within same pageblock), they
> don't truly meet, therefore compact_finished() decides to continue, but
> isolate_migratepages() exits immediately... boom! But indeed e14c720efdd7
> made this situation possible, as free scaner pfn can now point to a
> middle of pageblock.

Ok, it seems it happened second time now, again shortly after
resume. I guess I should apply your patch after all.

(Or... instead it should go to Linus ASAP -- it fixes known problem
that is affected people, and we want it in soon in case it is not
complete fix.)

Dmesg is in the attachment, perhaps it helps.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Download attachment "delme.gz" of type "application/gzip" (18436 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ