[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4B2B85C7.80502@linux.vnet.ibm.com>
Date: Fri, 18 Dec 2009 14:38:15 +0100
From: Christian Ehrhardt <ehrhardt@...ux.vnet.ibm.com>
To: Mel Gorman <mel@....ul.ie>
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
Andrew Morton <akpm@...ux-foundation.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
epasch@...ibm.com, SCHILLIG@...ibm.com,
Martin Schwidefsky <schwidefsky@...ibm.com>,
Heiko Carstens <heiko.carstens@...ibm.com>,
christof.schmitt@...ibm.com, thoss@...ibm.com
Subject: Re: Performance regression in scsi sequential throughput (iozone)
due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is
set"
Christian Ehrhardt wrote:
> Mel Gorman wrote:
> [...]
>>>
>>
>> One way of finding out if cache hottness was the problem would be to
>> profile
>> for cache misses and see if there are massive differences with and
>> without
>> the patch. Is that an option?
>>
> It is an option to verify things, but as mentioned above I would
> expect an increase amounts of consumed cycles per kb which I don't see.
> I'll track caches anyway to be sure.
>
> My personal current assumption is that either there is some time lost
> from the read syscall until the A event blocktrace tracks or I'm wrong
> with my assumption about user processes and iozone runs "slower".
>
After I was able to verify that the time is lost above the block device
layer I ran further tests to check out where the additional latency due
to commit e084b occurs.
With "strace -frT -e trace=open,close,read,write" I could isolate the
continuous stream of read calls done by the iozone child process.
In those data two things were revealed:
a) The time is lost between read and Block device enter (not e.g. due to
slow userspace execution or scheduling effects)
Time spent in userspace until the next read call is almost the
same, but the average time spent "inside" the read call increases from
788µs to 3278µs
b) The time is lost in a few big peaks
The majority of read calls are ~500-750µs in both cases, but there
is a slight shift of approximately 8% that occur now as durations
between 15000 and 60000µs
As mentioned before there is no significant increase of spent cpu cycles
- therefore I watched out how the read call could end up in sleeps
(latency but not used cycles).
I found that there is a chance to run into "congestion_wait" in case
__alloc_pages_direct_reclaim() was able to free some pages with
try_to_free_pages() but doesn't get a page in the subsequent
get_page_from_freelist().
To get details about that I patched in some simple counters and
ktime_get based timings. The results are impressive and point exactly
towards the congestion_wait theory
The following data is again taken with iozone 1 thread random read in a
very low memory environment, the counters were zeroed before starting iozone
Bad = the kernel just with commit e084b, good = kernel one commit before
e084b. The throughput in good case while taking this data was
approximately ~168% of that of the bad case.
GOOD BAD
perf_count_congestion_wait 245 1388
perf_count_pages_direct_reclaim 10993 5365
perf_count_failed_pages_direct_reclaim 0 1090
Note - the data also shows that the average time spent in the
f_op->aio_read call issued by do_sync_read increases by 68.5% which is
directly inversely proportional to the average loss in throughput.
This lets me conclude that patch e084b causes the system to run into
congestion_wait after the semi-failed call to __alloc_pages_direct_reclaim.
Unfortunately I don't understand memory management enough to find the
relation between e084b actually changing the order and position of freed
pages in the LRU list to __alloc_pages_direct_reclaim not getting pages
as effectively as before.
Ideas and comments welcome!
Kind regards,
Christian
P.S: attached are the three patches I used to get those new numbers
--
Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Linux Performance on System z
View attachment "perf-count-and-time-do-sync-read.diff" of type "text/x-patch" (3683 bytes)
View attachment "perf-count-congestion-wait.diff" of type "text/x-patch" (2750 bytes)
View attachment "perf-count-failed-direct-recalims.diff" of type "text/x-patch" (2065 bytes)
Powered by blists - more mailing lists