[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4B796C6D.80800@linux.vnet.ibm.com>
Date: Mon, 15 Feb 2010 16:46:53 +0100
From: Christian Ehrhardt <ehrhardt@...ux.vnet.ibm.com>
To: Nick Piggin <npiggin@...e.de>
CC: Mel Gorman <mel@....ul.ie>,
Andrew Morton <akpm@...ux-foundation.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
epasch@...ibm.com, SCHILLIG@...ibm.com,
Martin Schwidefsky <schwidefsky@...ibm.com>,
Heiko Carstens <heiko.carstens@...ibm.com>,
christof.schmitt@...ibm.com, thoss@...ibm.com, hare@...e.de,
gregkh@...ell.com
Subject: Re: Performance regression in scsi sequential throughput (iozone)
due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is
set"
Nick Piggin wrote:
> On Thu, Feb 11, 2010 at 05:11:24PM +0100, Christian Ehrhardt wrote:
>>> 2. Test with the patch below rmqueue_bulk-fewer-parameters to see if the
>>> number of parameters being passed is making some unexpected large
>>> difference
>> BINGO - this definitely hit something.
>> This experimental patch does two things - on one hand it closes the race we had:
>>
>> 4 THREAD READ 8 THREAD READ 16 THREAD READ %ofcalls
>> perf_count_congestion_wait 13 27 52
>> perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
>> perf_count_call_congestion_wait_from_alloc_pages_slowpath 13 27 52 99.52%
>> perf_count_pages_direct_reclaim 30867 56811 131552
>> perf_count_failed_pages_direct_reclaim 14 24 52
>> perf_count_failed_pages_direct_reclaim_but_progress 14 21 52 0.04% !!!
>>
>> On the other hand we see that the number of direct_reclaim calls increased by ~x4.
>>
>> I assume (might be totally wrong) that the x4 increase of direct_reclaim calls could be caused by the fact that before we used higher orders which worked on x4 number of pages at once.
>
> But the order parameter was always passed as constant 0 by the caller?
Uh - I didn't see that in the first look - you're right.
So the x4 in direct_reclaim calls are not caused by e.g. missing order information. Thanks for spotting this.
>
>> This leaves me with two ideas what the real issue could be:
>> 1. either really the 6th parameter as this is the first one that needs to go on stack and that way might open a race and rob gcc a big pile of optimization chances
>
> It must be something to do with code generation AFAIKS. I'm surprised
> the function isn't inlined, giving exactly the same result regardless
> of the patch.
after checking asm I can tell that rmqueue_bulk is inlined.
Therefore the test to inline it explicitly as requested in another mail can be considered negligible.
It actually boils down to something different in Mels patch - see below.
> Unlikely to be a correctness issue with code generation, but I'm
> really surprised that a small difference in performance could have
> such a big (and apparently repeatable) effect on behaviour like this.
>
> What's the assembly look like?
>
Line 214763
This is the preparation of the __mod_zone_page_state call from rmqueue_bulk which is inlined into
get_page_from_freelist.
!ORDER CHANGED FOR READABILITY!
FAST SLOW
lgr %r2,%r11 #parm 1 from r11 lgr %r2, %r11 #parm 1 from r11 - same
lghi %r3,0 #parm 2 = 0 const lghi %r3,0 #parm 2 = 0 - same
lghi %r4,-1 #parm 3 = -1 const lcr %r4,%r12 #parm 3 as complement of r12?
lgfr %r4,%r4 #parm 3 - clear upper 32 bit?
brasl #call brasl #call
=> This is most probably this part of Mel's patch:
23 @@ -965,7 +965,7 @@
24 set_page_private(page, migratetype);
25 list = &page->lru;
26 }
27 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
28 + __mod_zone_page_state(zone, NR_FREE_PAGES, -1);
29 spin_unlock(&zone->lock);
30 return i;
31 }
-> doesn't look too important, but to be sure we can build an executable with
just this change, but still passing order to rmqueue_bulk - see below.
Line 214800
Differend end of the function. In Slow path there is a check to %r4 and
dependent two different jump targets inside get_page_from_freelist, while in
the fast version there is only an unconditional jump (both after a
_raw_spin_lock_wait).
FAST SLOW
brasl # call _raw_spin_lock_wait brasl # _raw_spin_lock_wait
j 1e6294 get_page_from_freelist lg %r4,288(%r15) # from stack
ltgr %r4,%r4 # compare
jne 1e62a2 get_page_from_freelist
lhi %r12,0 # new call gets r12 @ 0 (GOT?)
j 1e6340 get_page_from_freelist
The rest is the same. So what is left is that the slow variant has a new branch
back in to get_page_from_freelist with r12 set to zero. This jump wents directly
after the also new lcr call seen in the first difference and might therfore be
related to that change.
Now I first thought it might not be rmqueue_bulk that misses optimization, but
__mod_zone_page_state - but that one looks the same in both cases.
Therefore I took a shot for 2.6.32 with just that snippet above applied (__mod_zone_page_state call without order which should be fine as we know it is 0 in all cases).
And it is interesting to see that this snippet alone is enough to fix throughput and the direct_reclaim -> progress&!page ratio.
The differences in asm are pretty much the same, as before rmqueue_bulk was already inlined the actually intended change to its parameters was negligible.
I wondered if it would be important if that is a constant value (-1) or if the reason was caused by that shift. So I tried:
23 @@ -965,7 +965,7 @@
24 set_page_private(page, migratetype);
25 list = &page->lru;
26 }
27 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
28 + __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
29 spin_unlock(&zone->lock);
30 return i;
31 }
Now that one has still the issue of the very huge throughput loss and the bad ratio of driect_reclaims leading into congestion_wait.
Now as far as I can see that - oh so important - "-i" or "-1" ends up in zone_page_state_add as variable x:
static inline void zone_page_state_add(long x, struct zone *zone,
enum zone_stat_item item)
{
atomic_long_add(x, &zone->vm_stat[item]);
atomic_long_add(x, &vm_stat[item]);
}
I guess the intention there is to update the zone with the number of pages freed - and I also guess that -1 as constant might be wrong there.
That would also explain some weird output I saw like this after boot:
h37lp01:~ # cat /proc/meminfo
MemTotal: 251216 kB
MemFree: 483460 kB bigger than total
BUT, and that is now again open for discussion - "__mod_zone_page_state(zone, NR_FREE_PAGES, -1)" even being wrong - fixed the issue in terms of throughput and the congestion_waits as good as reverting e084b+5f8dcc21!
--
GrĂ¼sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists