[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4B6C3E6E.6050303@linux.vnet.ibm.com>
Date: Fri, 05 Feb 2010 16:51:10 +0100
From: Christian Ehrhardt <ehrhardt@...ux.vnet.ibm.com>
To: Mel Gorman <mel@....ul.ie>
CC: Andrew Morton <akpm@...ux-foundation.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
epasch@...ibm.com, SCHILLIG@...ibm.com,
Martin Schwidefsky <schwidefsky@...ibm.com>,
Heiko Carstens <heiko.carstens@...ibm.com>,
christof.schmitt@...ibm.com, thoss@...ibm.com, hare@...e.de
Subject: Re: Performance regression in scsi sequential throughput (iozone)
due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is
set"
I'll keep the old thread below as reference.
After taking a round of ensuring reproducibility and a pile of new
measurements I now can come back with several new insights.
FYI - I'm now running iozone triplets (4, then 8, then 16 parallel
threads) with sequential read load and all that 4 times to find
potential noise. But since I changed to that load instead of random read
wit hone thread and ensuring the most memory is cleared (sync + echo 3 >
/proc/sys/vm/drop_caches + a few sleeps) . The noise is now down <2%.
For detailed questions about the setup feel free to ask me directly as I
won't flood this thread too much with such details.
So in the past I identified git id e084b for bringing a huge
degradation, that is still true, but I need to revert my old statement
that unapplying e084b on v2.6.32 helps - it simply doesn't.
Another bisect (keepign e084b reverted) brought up git id 5f8dcc21 which
came in later. Both patches unapplied individually don't improve
anything. But both patches reverted at the same time on git v2.6.32
bring us back to our old values (remember that is more than 2Gb/s
improvement in throughput)!
Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this
can cause so much trouble.
In the past I identified congestion_wait as the point where the "time is
lost" when comparing good and bad case. Fortunately at least this is
still true when comparing 2.6.32 vs 2.6.32 with both patches reverted.
So I upgraded my old counter patches a bit and can now report the following:
BAD CASE
4 THREAD
READ 8 THREAD READ 16 THREAD READ
16THR % portions
perf_count_congestion_wait
305 1970 8980
perf_count_call_congestion_wait_from_alloc_pages_high_priority
0 0 0
perf_count_call_congestion_wait_from_alloc_pages_slowpath
305 1970
8979 100.00%
perf_count_pages_direct_reclaim
1153 6818 3221
perf_count_failed_pages_direct_reclaim
305 1556 8979
perf_count_failed_pages_direct_reclaim_but_progress
305 1478
8979 27.87%
GOOD CASE WITH REVERTS 4 THREAD
READ 8 THREAD READ 16 THREAD READ
16THR % portions
perf_count_congestion_wait
25 76 1114
perf_count_call_congestion_wait_from_alloc_pages_high_priority
0 0 0
perf_count_call_congestion_wait_from_alloc_pages_slowpath
25 76
1114 99.98%
perf_count_pages_direct_reclaim
1054 9150 62297
perf_count_failed_pages_direct_reclaim
25 64 1114
perf_count_failed_pages_direct_reclaim_but_progress
25 57
1114 1.79%
I hope the format is kept, it should be good with every monospace viewer.
You can all clearly see that the patches somehow affect the ratio at
which __alloc_pages_direct_reclaim runs into a race between try_to_free
pages that could actually free something, but a few lines below can't
get a page from the free list.
Outside in the function alloc_pages_slowpath this leads to a call to
congestion_wait which is absolutely futile as there are absolutely no
writes in flight to wait for.
Now this kills effectively up to 80% of our throughput - Any idea of
better understanding the link between the patches and the effect is
welcome and might lead to a solution.
FYI - I tried all patches you suggested - none of them affects this.
Updated perf counter patches are attached.
As for some questions in the old thread:
>> Back to business :-)
>> First - yes it is reproducible in 2.6.32 final and fixable by unapplying
>> e084b.
>> But I don't think un-applying it is really a fix as no one really
>> understands how e084b cause this regression (it might just work around
>> whatever goes on in the background and someone still hits the hidden
>> issue).
>>
>>
>
> Agreed on reverting is not an option. While it is not understood why it
> causes your regression, it's known to fix a regression for hardware that
> does do merging.
>
I'm not taking this point but fairly compared it is +15% vs -80% - so it
is not "only" good :-)
>> Lets better look what we know summarized:
>> - the patch e084a causes this regression
>> - it causes it only in very rare if not theoretically high memory
>> constraints
>> - time is lost between read system call and the block device layer enter
>> - via debugging and discussion we found that time lost is spent in
>> congestion_wait
>>
>>
>
> Thanks for the summary.
>
> There is an outside chance it's due to a difference in free page availablity
> and the associated counters. Particularly if some subsystem is optimistally
> trying high-order allocations depending on availability. If there are many
> allocation failures, the counters gets skewed due to a bug, watermarks are
> not met and direct reclaim is used more. A patch is on its way upstream to
> fix this is http://www.spinics.net/lists/mm-commits/msg75671.html . Can you
> try it please?
>
> There is a report on swap-based workloads being impacted on 2.6.32 and a fix
> exists but it wouldn't have affected 2.6.31. If you are testing on 2.6.32
> though, this patch should be applied http://lkml.org/lkml/2009/12/21/245
>
both are not affecting my scenario at all
>
>> I did not yet check if it is possible, but we might have a spot tho
>> might fix the issue.
>> Congestion_wait states it would "Waits for up to @timeout jiffies for a
>> backing_dev (any backing_dev) to exit write congestion. If no
>> backing_devs are congested then just wait for the next write to be
>> completed.".
>> Now in some case (and also my test) there is absolutely no write to do
>> or in flight. So maybe it would be possible to either detect this before
>> calling congestion_wait from page_alloc.c or let congestion_wait return
>> an error code or something similar.
>>
>>
>
> While I think that idea has merit, it's fixing the symptons and not the
> underlying problem. There is a growing opinion that congestion_wait() being
> used in the VM at all is a mistake. If the VM really needs to wait on pages
> being written out, it should have been on a waitqueue for a number of pages
> rather than a time-based method.
>
> In other words, I'm more interesting in figuring out *why* we enter
> congestion_wait() at all with the patch and it's avoided without rather than
> what congestion_wait() does when it gets called.
>
> Also, are you sure about thw "waiting for the next write to complete"? In
> 2.6.31 at least, it's waiting on the async write queue. If it's a case that
> it waits the full timeout if there is no async traffic then that's obviously
> bad as well but still doesn't explain why the problem patch makes a difference.
>
I agree that this part of the discussion is about symptoms and for now
we should try to focus on the cause that avtually brings up the race in
__alloc_pages_direct_reclaim.
>> I mean as far as I can see the kernel currently does a loop like that
>> (pseudo code and simplified):
>> 1. get a page <- fail
>> 2. try_to_free page <- returns >=1 pages freed
>> 3. get_page_from_freelist <- fails
>> 4. should_alloc_retry <- true
>> 5. congestion_wait <- FUTILE - just loosing time, nothing will be freed
>> by writes
>> 6. goto 1
>> -> So maybe we should detect the futility of that congestion_wait call
>> and not do it at all repeating instantly, go to oom, .., whatever (thats
>> out for discussion).
>>
>> If we can't fix it that way I would propose we keep the code as is,
>>
>
> It's not my preferred way to fix this although I'm willing to explore it.
> I'd much prefer an explanation as to why your testcase is sensitive to the
> PFN ordering of the pages it gets back.
>
> One possibility is that your testcase requires a number of
> high-order allocations and the patch is preventing them being
> merged. If that was the case, the following patch would also help
> http://www.spinics.net/lists/mm-commits/msg75672.html but it's less than ideal.
>
>
again not affecting my workload at all
>> hoping that it is a theoretical case that never hits a customer system.
>> But in that case we should definitely add a userspace readable counter
>> to congestion_wait similar to my debug patches. This would allow
>> everyone that ever assumes an issue might be related to this to verify
>> it by checking the congestion_wait counter.
>>
>
> Would the accounting features available with
> CONFIG_TASK_DELAY_ACCT be sufficient? There is a description in
> Documentation/accounting/delay-accounting.txt on how to use it. An abnormal
> amount of time spent on IO might explain the problem.
>
>
Eventually it will appear as a huge increase in I/O-wait for no obvious
reason - DELAY_ACCT is fine, actually even plain iostat is be enough.
The congestion_wait counter would just help to differ this issue, from
others also increasing I/O-wait - but that is not too important.
I think we can all live without a counter and better fix the cause :-)
>> This should not be too hard in a technical way, but also not performance
>> critical as congestion_wait is going to wait anyway. On the other hand
>> everyone hates introducing new kernel interfaces that need to be kept
>> compatible until nirvana - especially for bug tracking its not the best
>> idea :-) So it should be our very last resort.
>>
>>
>
> Agreed.
>
>
--
GrĂ¼sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
View attachment "perf-count-congestion-wait.diff" of type "text/x-patch" (2630 bytes)
View attachment "perf-count-congestion-wait-calls.diff" of type "text/x-patch" (3604 bytes)
View attachment "perf-count-failed-direct-recalims.diff" of type "text/x-patch" (2462 bytes)
Powered by blists - more mailing lists