lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4B225B9E.2020702@linux.vnet.ibm.com>
Date:	Fri, 11 Dec 2009 15:47:58 +0100
From:	Christian Ehrhardt <ehrhardt@...ux.vnet.ibm.com>
To:	Mel Gorman <mel@....ul.ie>
CC:	arayananu Gopalakrishnan <narayanan.g@...sung.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	epasch@...ibm.com, SCHILLIG@...ibm.com,
	Martin Schwidefsky <schwidefsky@...ibm.com>,
	Heiko Carstens <heiko.carstens@...ibm.com>,
	christof.schmitt@...ibm.com, thoss@...ibm.com
Subject: Re: Performance regression in scsi sequential throughput (iozone)
 due to "e084b - page-allocator: preserve PFN ordering when	__GFP_COLD is
 set"

Mel Gorman wrote:
> On Thu, Dec 10, 2009 at 03:36:04PM +0100, Christian Ehrhardt wrote:
>   
>> Keeping the old discussion in the mail tail, adding the new information  
>> up here were everyone finds them :-)
>>
>> Things I was able to confirm so far summarized:
>> - The controller doesn't care about pfn ordering in any way (proved by  
>> HW statistics)
>> - regression appears in sequential AND random workloads -> also without  
>> readahead
>> - oprofile & co are no option atm.
>>  The effective consumed cpu cycles per transferred kb are almost the  
>> same so I would not expect sampling would give us huge insights.
>>  Therefore I expect that it is more a matter of lost time (latency) than 
>> more expensive tasks (cpu consumption)
>>     
>
> But earlier, you said that latency was lower - "latency statistics clearly
> state that your patch is working as intended - the latency from entering
> the controller until the interrupt to linux device driver is ~30% lower!."
>   
Thats right, but the pure Hardware time is only lower due to the case of 
less I/O in flight.
As it has less concurrency and contention that way a single I/O in HW is 
faster.
But in HW it's so fast (both cases) that as verified in the linux layers 
it doesn't matter.
Both cases take more or less the same time from an I/O entering the 
block device layer until completion.
> Also, if the controller is doing no merging of IO requests, why is the
> interrupt rate lower?
>   
I was wondering about that when I started to work on this too, but the 
answer is simply that there are less requests coming in per second - and 
that implies a lower interrupt rate too.
>>  I don't want to preclude it completely, but sampling has to wait as  
>> long as we have better tracks to follow
>>
>> So the question is where time is lost in Linux. I used blktrace to  
>> create latency summaries.
>> I only list the random case for discussion as the effects are more clear  
>> int hat data.
>> Abbreviations are (like the blkparse man page explains) - sorted in  
>> order it would appear per request:
>>       A -- remap For stacked devices, incoming i/o is remapped to device 
>> below it in the i/o stack. The remap action details what exactly is being 
>> remapped to what.
>>       G -- get request To send any type of request to a block device, a  
>> struct request container must be allocated first.
>>       I -- inserted A request is being sent to the i/o scheduler for  
>> addition to the internal queue and later service by the driver. The  
>> request is fully formed at this time.
>>       D -- issued A request that previously resided on the block layer  
>> queue or in the i/o scheduler has been sent to the driver.
>>       C -- complete A previously issued request has been completed.  The 
>> output will detail the sector and size of that request, as well as the 
>> success or failure of it.
>>
>> The following table shows the average latencies from A to G, G to I and  
>> so on.
>> C2A is special and tries to summarize how long it takes after completing  
>> an I/O until the next one arrives in the block device layer.
>>
>>                     avg-A2G    avg-G2I    avg-I2D   avg-D2C    avg-C2A-in-avg+-stddev    %C2A-in-avg+-stddev
>> deviation good->bad    -3.48%    -0.56%    -1.57%    -1.31%         128.69%                 97.26%
>>
>> It clearly shows that all latencies once block device layer and device  
>> driver are involved are almost equal. Remember that the throughput of  
>> good vs. bad case is more than x3.
>> But we can also see that the value of C2A increases by a huge amount.  
>> That huge C2A increase let me assume that the time is actually lost  
>> "above" the block device layer.
>>
>>     
>
> To be clear. As C is "completion" and "A" is remapping new IO, it
> implies that time is being lost between when one IO completes and
> another starts, right?
>
>   
Absolutely correct
>> I don't expect the execution speed of iozone as user process itself is  
>> affected by commit e084b,
>>     
> Not by this much anyway. Lets say cache hotness is a problem, I would
> expect some performance loss but not this much.
>   
I agree, even if it is cache hotness it wouldn't be that much.
And cold caches would appear as "longer" instructions because they would 
need more cycles due to e.g. dcache misses.
But as mentioned before the cycles per transferred amount of data are 
the same so I don't expect it is due to cache hot/cold.

>> so the question is where the time is lost  
>> between the "read" issued by iozone and entering the block device layer.
>> Actually I expect it somewhere in the area of getting a page cache page  
>> for the I/O. On one hand page handling is what commit e084b changes and  
>> on the other hand pages are under pressure (systat vm effectiveness  
>> ~100%, >40% scanned directly in both cases).
>>
>> I'll continue hunting down the lost time - maybe with ftrace if it is  
>> not concealing the effect by its invasiveness -, any further  
>> ideas/comments welcome.
>>
>>     
>
> One way of finding out if cache hottness was the problem would be to profile
> for cache misses and see if there are massive differences with and without
> the patch. Is that an option?
>   
It is an option to verify things, but as mentioned above I would expect 
an increase amounts of consumed cycles per kb which I don't see.
I'll track caches anyway to be sure.

My personal current assumption is that either there is some time lost 
from the read syscall until the A event blocktrace tracks or I'm wrong 
with my assumption about user processes and iozone runs "slower".

-- 

GrĂ¼sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ