linux-kernel - Re: [Bug #14141] order 2 page allocation failures in iwlagn

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <200910262206.13146.elendil@planet.nl>
Date:	Mon, 26 Oct 2009 22:06:09 +0100
From:	Frans Pop <elendil@...net.nl>
To:	Mel Gorman <mel@....ul.ie>
Cc:	Chris Mason <chris.mason@...cle.com>,
	David Rientjes <rientjes@...gle.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Kernel Testers List <kernel-testers@...r.kernel.org>,
	Pekka Enberg <penberg@...helsinki.fi>,
	Reinette Chatre <reinette.chatre@...el.com>,
	Bartlomiej Zolnierkiewicz <bzolnier@...il.com>,
	Karol Lewandowski <karol.k.lewandowski@...il.com>,
	Mohamed Abbas <mohamed.abbas@...el.com>,
	Jens Axboe <jens.axboe@...cle.com>,
	"John W. Linville" <linville@...driver.com>, linux-mm@...ck.org
Subject: Re: [Bug #14141] order 2 page allocation failures in iwlagn

On Tuesday 20 October 2009, Mel Gorman wrote:
> I've attached a patch below that should allow us to cheat. When it's
> applied, it outputs who called congestion_wait(), how long the timeout
> was and how long it waited for. By comparing before and after sleep
> times, we should be able to see which of the callers has significantly
> changed and if it's something easily addressable.

The results from this look fairly interesting (although I may be a bad 
judge as I don't really know what I'm looking at ;-).

I've tested with two kernels:
1) 2.6.31.1: 1 test run
2) 2.6.31.1 + congestion_wait() reverts: 2 test runs

The 1st kernel had the expected "freeze" while reading commits in gitk; 
reading commits with the 2nd kernel was more fluent.
I did 2 runs with the 2nd kernel as the first run had a fairly long music 
skip and more SKB errors than expected. The second run was fairly normal 
with no music skips at all even though it had a few SKB errors.

Data for the tests:
				1st kernel	2nd kernel 1	2nd kernel 2
end reading commits		1:15		1:00		0:55
  "freeze"			yes		no		no
branch data shown		1:55		1:15		1:10
system quiet			2:25		1:50		1:45
# SKB allocation errors		10		53		5

Note that the test is substantially faster with the 2nd kernel and that the 
SKB errors don't really affect the duration of the test.

Attached a tarball with the kernel logs, both the full logs and a stripped 
version with only the lines generated during the actual test.
Something like this will extract the debug data from the logs:
$ grep "delay " <file> | sed "s/^.*\] //"

Also attached a ODF spreadsheet with a summary of the data for all 3 tests.
I've dropped the congestion_wait and sync/rw= columns as they were always 
the same (rw=1 for 1st kernel and sync=0 for 2nd kernel).
I've added a column "weighed delay" and totals for that column and the 
count column.

My layman's observations are:
- without the revert 'background_writeout' is called a lot less frequently,
  but when it's called it gets long delays
- without the revert you have 'wb_kupdate', which is relatively expensive
- with the revert 'shrink_list' is relatively expensive, although not
  really in absolute terms

You people may want to look at exactly what happens directly around the SKB 
allocation errors. I've only looked at totals.

Cheers,
FJP

Download attachment "logs.tgz" of type "application/x-tgz" (151463 bytes)

Download attachment "results.ods" of type "application/vnd.oasis.opendocument.spreadsheet" (20051 bytes)