[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4B4F0E60.1020601@linux.vnet.ibm.com>
Date: Thu, 14 Jan 2010 13:30:24 +0100
From: Christian Ehrhardt <ehrhardt@...ux.vnet.ibm.com>
To: Mel Gorman <mel@....ul.ie>
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
Andrew Morton <akpm@...ux-foundation.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
epasch@...ibm.com, SCHILLIG@...ibm.com,
Martin Schwidefsky <schwidefsky@...ibm.com>,
Heiko Carstens <heiko.carstens@...ibm.com>,
christof.schmitt@...ibm.com, thoss@...ibm.com
Subject: Re: Performance regression in scsi sequential throughput (iozone)
due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is
set"
Mel Gorman wrote:
> On Fri, Dec 18, 2009 at 02:38:15PM +0100, Christian Ehrhardt wrote:
>
>> Christian Ehrhardt wrote:
>>
>> [...]
> You mention that the "GOOD" kernel is one commit before e084b but can
> you test with just that patch reverted please? As the patch is only
> about list manipulation, I'm failing to see how it can affect
> congestion_wait().
>
>
>> Unfortunately I don't understand memory management enough to find the
>> relation between e084b actually changing the order and position of freed
>> pages in the LRU list to __alloc_pages_direct_reclaim not getting pages
>> as effectively as before.
>>
>
> Off-hand, neither can I. I regret that I'm going off-line for the
> Christmas very soon and I'll be in the middle of nowhere with no
> internet connection or test machines in the interim. It's going to be
> the new year before I get to look at this properly.
>
> Even if this patch is not the real problem, your instrumentation patches
> will help pin down the real problem. Thanks and sorry I'll be delayed in
> getting back to you properly.
>
Back to business :-)
First - yes it is reproducible in 2.6.32 final and fixable by unapplying
e084b.
But I don't think un-applying it is really a fix as no one really
understands how e084b cause this regression (it might just work around
whatever goes on in the background and someone still hits the hidden issue).
Lets better look what we know summarized:
- the patch e084a causes this regression
- it causes it only in very rare if not theoretically high memory
constraints
- time is lost between read system call and the block device layer enter
- via debugging and discussion we found that time lost is spent in
congestion_wait
I did not yet check if it is possible, but we might have a spot tho
might fix the issue.
Congestion_wait states it would "Waits for up to @timeout jiffies for a
backing_dev (any backing_dev) to exit write congestion. If no
backing_devs are congested then just wait for the next write to be
completed.".
Now in some case (and also my test) there is absolutely no write to do
or in flight. So maybe it would be possible to either detect this before
calling congestion_wait from page_alloc.c or let congestion_wait return
an error code or something similar.
I mean as far as I can see the kernel currently does a loop like that
(pseudo code and simplified):
1. get a page <- fail
2. try_to_free page <- returns >=1 pages freed
3. get_page_from_freelist <- fails
4. should_alloc_retry <- true
5. congestion_wait <- FUTILE - just loosing time, nothing will be freed
by writes
6. goto 1
-> So maybe we should detect the futility of that congestion_wait call
and not do it at all repeating instantly, go to oom, .., whatever (thats
out for discussion).
If we can't fix it that way I would propose we keep the code as is,
hoping that it is a theoretical case that never hits a customer system.
But in that case we should definitely add a userspace readable counter
to congestion_wait similar to my debug patches. This would allow
everyone that ever assumes an issue might be related to this to verify
it by checking the congestion_wait counter.
This should not be too hard in a technical way, but also not performance
critical as congestion_wait is going to wait anyway. On the other hand
everyone hates introducing new kernel interfaces that need to be kept
compatible until nirvana - especially for bug tracking its not the best
idea :-) So it should be our very last resort.
Comments/Opinions to those ideas ?
--
GrĂ¼sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists