linux-kernel - Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4B6C3E6E.6050303@linux.vnet.ibm.com>
Date:	Fri, 05 Feb 2010 16:51:10 +0100
From:	Christian Ehrhardt <ehrhardt@...ux.vnet.ibm.com>
To:	Mel Gorman <mel@....ul.ie>
CC:	Andrew Morton <akpm@...ux-foundation.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	epasch@...ibm.com, SCHILLIG@...ibm.com,
	Martin Schwidefsky <schwidefsky@...ibm.com>,
	Heiko Carstens <heiko.carstens@...ibm.com>,
	christof.schmitt@...ibm.com, thoss@...ibm.com, hare@...e.de
Subject: Re: Performance regression in scsi sequential throughput (iozone)
 due to "e084b - page-allocator: preserve PFN ordering when	__GFP_COLD is
 set"

I'll keep the old thread below as reference.

After taking a round of ensuring reproducibility and a pile of new 
measurements I now can come back with several new insights.

FYI - I'm now running iozone triplets (4, then 8, then 16 parallel 
threads) with sequential read load and all that 4 times to find 
potential noise. But since I changed to that load instead of random read 
wit hone thread and ensuring the most memory is cleared (sync + echo 3 > 
/proc/sys/vm/drop_caches + a few sleeps) . The noise is now down <2%. 
For detailed questions about the setup feel free to ask me directly as I 
won't flood this thread too much with such details.

So in the past I identified git id e084b for bringing a huge 
degradation, that is still true, but I need to revert my old statement 
that unapplying e084b on v2.6.32 helps - it simply doesn't.

Another bisect (keepign e084b reverted) brought up git id 5f8dcc21 which 
came in later. Both patches unapplied individually don't improve 
anything. But both patches reverted at the same time on git v2.6.32 
bring us back to our old values (remember that is more than 2Gb/s 
improvement in throughput)!

Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this 
can cause so much trouble.

In the past I identified congestion_wait as the point where the "time is 
lost" when comparing good and bad case. Fortunately at least this is 
still true when comparing 2.6.32 vs 2.6.32 with both patches reverted.
So I upgraded my old counter patches a bit and can now report the following:

BAD CASE
                                                                4 THREAD 
READ               8 THREAD READ               16 THREAD READ       
16THR % portions
perf_count_congestion_wait                                        
305                        1970                        8980
perf_count_call_congestion_wait_from_alloc_pages_high_priority      
0                           0                           0
perf_count_call_congestion_wait_from_alloc_pages_slowpath         
305                        1970                        
8979                100.00%
perf_count_pages_direct_reclaim                                  
1153                        6818                        3221
perf_count_failed_pages_direct_reclaim                            
305                        1556                        8979
perf_count_failed_pages_direct_reclaim_but_progress               
305                        1478                        
8979                27.87%

GOOD CASE WITH REVERTS                                          4 THREAD 
READ               8 THREAD READ               16 THREAD READ       
16THR % portions
perf_count_congestion_wait                                         
25                          76                         1114
perf_count_call_congestion_wait_from_alloc_pages_high_priority      
0                           0                            0
perf_count_call_congestion_wait_from_alloc_pages_slowpath          
25                          76                         
1114               99.98%
perf_count_pages_direct_reclaim                                  
1054                        9150                        62297
perf_count_failed_pages_direct_reclaim                             
25                          64                         1114
perf_count_failed_pages_direct_reclaim_but_progress                
25                          57                         
1114                1.79%


I hope the format is kept, it should be good with every monospace viewer.
You can all clearly see that the patches somehow affect the ratio at 
which __alloc_pages_direct_reclaim runs into a race between try_to_free 
pages that could actually free something, but a few lines below can't 
get a page from the free list.
Outside in the function alloc_pages_slowpath this leads to a call to 
congestion_wait which is absolutely futile as there are absolutely no 
writes in flight to wait for.

Now this kills effectively up to 80% of our throughput - Any idea of 
better understanding the link between the patches and the effect is 
welcome and might lead to a solution.

FYI - I tried all patches you suggested - none of them affects this.

Updated perf counter patches are attached.
As for some questions in the old thread:

>> Back to business :-)
>> First - yes it is reproducible in 2.6.32 final and fixable by unapplying  
>> e084b.
>> But I don't think un-applying it is really a fix as no one really  
>> understands how e084b cause this regression (it might just work around  
>> whatever goes on in the background and someone still hits the hidden 
>> issue).
>>
>>     
>
> Agreed on reverting is not an option. While it is not understood why it
> causes your regression, it's known to fix a regression for hardware that
> does do merging.
>   

I'm not taking this point but fairly compared it is +15% vs -80% - so it 
is not "only" good :-)

>> Lets better look what we know summarized:
>> - the patch e084a causes this regression
>> - it causes it only in very rare if not theoretically high memory  
>> constraints
>> - time is lost between read system call and the block device layer enter
>> - via debugging and discussion we found that time lost is spent in  
>> congestion_wait
>>
>>     
>
> Thanks for the summary.
>
> There is an outside chance it's due to a difference in free page availablity
> and the associated counters. Particularly if some subsystem is optimistally
> trying high-order allocations depending on availability. If there are many
> allocation failures, the counters gets skewed due to a bug, watermarks are
> not met and direct reclaim is used more. A patch is on its way upstream to
> fix this is http://www.spinics.net/lists/mm-commits/msg75671.html . Can you
> try it please?
>
> There is a report on swap-based workloads being impacted on 2.6.32 and a fix
> exists but it wouldn't have affected 2.6.31. If you are testing on 2.6.32
> though, this patch should be applied http://lkml.org/lkml/2009/12/21/245
>   
both are not affecting my scenario at all
>   
>> I did not yet check if it is possible, but we might have a spot tho  
>> might fix the issue.
>> Congestion_wait states it would "Waits for up to @timeout jiffies for a  
>> backing_dev (any backing_dev) to exit write congestion.  If no  
>> backing_devs are congested then just wait for the next write to be  
>> completed.".
>> Now in some case (and also my test) there is absolutely no write to do  
>> or in flight. So maybe it would be possible to either detect this before  
>> calling congestion_wait from page_alloc.c or let congestion_wait return  
>> an error code or something similar.
>>
>>     
>
> While I think that idea has merit, it's fixing the symptons and not the
> underlying problem. There is a growing opinion that congestion_wait() being
> used in the VM at all is a mistake. If the VM really needs to wait on pages
> being written out, it should have been on a waitqueue for a number of pages
> rather than a time-based method.
>
> In other words, I'm more interesting in figuring out *why* we enter
> congestion_wait() at all with the patch and it's avoided without rather than
> what congestion_wait() does when it gets called.
>
> Also, are you sure about thw "waiting for the next write to complete"? In
> 2.6.31 at least, it's waiting on the async write queue. If it's a case that
> it waits the full timeout if there is no async traffic then that's obviously
> bad as well but still doesn't explain why the problem patch makes a difference.
>   
I agree that this part of the discussion is about symptoms and for now 
we should try to focus on the cause that avtually brings up the race in 
__alloc_pages_direct_reclaim.
>> I mean as far as I can see the kernel currently does a loop like that  
>> (pseudo code and simplified):
>> 1. get a page <- fail
>> 2. try_to_free page <- returns >=1 pages freed
>> 3. get_page_from_freelist <- fails
>> 4. should_alloc_retry <- true
>> 5. congestion_wait <- FUTILE - just loosing time, nothing will be freed  
>> by writes
>> 6. goto 1
>> -> So maybe we should detect the futility of that congestion_wait call  
>> and not do it at all repeating instantly, go to oom, .., whatever (thats  
>> out for discussion).
>>
>> If we can't fix it that way I would propose we keep the code as is,  
>>     
>
> It's not my preferred way to fix this although I'm willing to explore it.
> I'd much prefer an explanation as to why your testcase is sensitive to the
> PFN ordering of the pages it gets back.
>
> One possibility is that your testcase requires a number of
> high-order allocations and the patch is preventing them being
> merged. If that was the case, the following patch would also help
> http://www.spinics.net/lists/mm-commits/msg75672.html but it's less than ideal.
>
>   
again not affecting my workload at all

>> hoping that it is a theoretical case that never hits a customer system.  
>> But in that case we should definitely add a userspace readable counter  
>> to congestion_wait similar to my debug patches. This would allow  
>> everyone that ever assumes an issue might be related to this to verify  
>> it by checking the congestion_wait counter.
>>     
>
> Would the accounting features available with
> CONFIG_TASK_DELAY_ACCT be sufficient?  There is a description in
> Documentation/accounting/delay-accounting.txt on how to use it. An abnormal
> amount of time spent on IO might explain the problem.
>
>   
Eventually it will appear as a huge increase in I/O-wait for no obvious 
reason - DELAY_ACCT is fine, actually even plain iostat is be enough.
The congestion_wait counter would just help to differ this issue, from 
others also increasing I/O-wait - but that is not too important.
I think we can all live without a counter and better fix the cause :-)
>> This should not be too hard in a technical way, but also not performance  
>> critical as congestion_wait is going to wait anyway. On the other hand  
>> everyone hates introducing new kernel interfaces that need to be kept  
>> compatible until nirvana - especially for bug tracking its not the best  
>> idea :-) So it should be our very last resort.
>>
>>     
>
> Agreed.
>
>   


-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization 


View attachment "perf-count-congestion-wait.diff" of type "text/x-patch" (2630 bytes)

View attachment "perf-count-congestion-wait-calls.diff" of type "text/x-patch" (3604 bytes)

View attachment "perf-count-failed-direct-recalims.diff" of type "text/x-patch" (2462 bytes)