[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0703182134450.17822@skynet.skynet.ie>
Date: Sun, 18 Mar 2007 22:21:01 +0000 (GMT)
From: Mel Gorman <mel@....ul.ie>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Mariusz Kozlowski <m.kozlowski@...land.pl>,
Andy Whitcroft <apw@...dowen.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] Bias the location of pages freed for min_free_kbytes in
the same MAX_ORDER_NR_PAGES blocks
On Sun, 18 Mar 2007, Andrew Morton wrote:
> On Sun, 18 Mar 2007 20:08:49 +0000 (GMT) Mel Gorman <mel@....ul.ie> wrote:
>
>> On Sun, 18 Mar 2007, Andrew Morton wrote:
>>
>>> On Sun, 18 Mar 2007 19:05:41 +0000 (GMT) Mel Gorman <mel@....ul.ie> wrote:
>>>
>>>>> How much additional memory consumption are we expecting here?
>>>>>
>>>>
>>>> Short answer, about 1.5KB on a 1GB system of which 1.3KB is statically
>>>> defined in the 3 struct zones on a 1 node x86 system.
>>>>
>>>> Longer answer that I hopefully have not made any mistakes in - There is
>>>> the zone overhead which is statically sized and a runtime overhead which
>>>> depends on the amount of memory in the system. The additional zone
>>>> overhead is the overhead for additional freelists (larger struct
>>>> free_area) and is as follows;
>>>>
>>>> (MIGRATE_TYPES-1) * sizeof(list_head) * (MAX_ORDER-1)
>>>>
>>>> so, on 32 bit in general, thats
>>>>
>>>> 4 * 8 * 10 = 320 bytes per zone (would be 240 bytes if MIGRATE_RESERVE is
>>>> sufficient for higher order allocations
>>>> instead of MIGRATE_HIGHALLOC)
>>>>
>>>> on x86 with DMA, Normal and HighMem, thats 1280 bytes. On a NUMA system,
>>>> it's 1280 bytes per node. On 64 bit, it would be double because of the
>>>> larger pointer size. At worst, I guess you are looking at 3KB per node.
>>>
>>> That a very modest overhead - not worth the config option, IMO.
>>>
>>> The runtime overhead might be a concern - is it possible to quantify
>>> it?
>>>
>>
>> Do you mean performance wise or memory wise?
>
> CPU load. From your earlier email I'd decided memory consumption was a
> non-issue ;)
>
I figured that was the case but thought I would try pin it down more and
offer storing the overhead in a counter just in case there is a situation
where it's a problem.
>> Memory-wise, something like
>>
>> ===
>> FLATMEM Case
>> bits = 0;
>> for_each_zone(zone) {
>> bits += (zone->spanned_pages >> (MAX_ORDER-1)) * NR_PAGEBLOCK_BITS);
>> }
>> bytes_consumed = bits / 8;
>>
>> === SPARSEMEM Case, a rough approximation is
>> ((vm_total_pages * PAGE_SIZE) >> SECTION_SIZE_BITS) * 8
>>
>> The consumption could be stored in a zone variable similar to
>> zone->present_pages and visible through /proc/zoneinfo. Would that be
>> useful?
>>
>> Performance wise is harder to quantify. There are three places where
>> issues can show up. The first is with allocation fallbacks where
>> __rmqueue_fallback() is called. Fallbacks are expensive but fallbacks are
>> rare except when the zone is too small which is why I probably should be
>> catching that case explicitly. I used to have a counters patch for
>> fallbacks. I could bring it up to date to use __count_vm_events() to
>> quantify fallbacks if you think it would be useful?
>>
>> The second hotpoint is where the per-cpu lists are searched for a page of
>> the suitable migrate type. An instruction-level profile on x86 when I
>> looked at this on x86 showed about 2-4% of the time spent in
>> get_page_from_freelist() was searching the per-cpu lists for a page of a
>> suitable type. IIRC, something like 85% of the time there was clearing the
>> pages although I'd need to double check this to be 100% sure.
>>
>> The last potential performance hotpoint is where the pageblock flags are
>> read on every free in get_pageblock_flags_group(). There is probably room
>> for optimisation there. I haven't an exact quantification available at the
>> moment but I remember seeing it far down the list of functions time was
>> spent when I was last looking at this.
>
> hm, well. It'd be good to drill down, quantify and, where needed, fix
> these things. Because the existence of that config option is quite
> undesirabe.
After my last mail, I turned on my main desktop (Intel(R) Pentium(R) D on
32 bit) and set it going with 2.6.21-rc3-mm2 with the fix for Mariusz
applied. I built a kernel while music was playing and I went off watching
someone else make my dinner - a scientific test to be sure.
4.2% of the time in get_page_from_freelist() is spent in this
list_for_each_entry;
/* Find a page of the appropriate migrate type */
list_for_each_entry(page, &pcp->list, lru) {
Something like 3% is spent on one instruction
84768 0.0334 :c014c535: lea 0x0(%esi),%esi
Maybe I can avoid some of this by optimistically checking if the first
entry is suitable before entering into a loop, prefetching data and the
like.
To put the loop into perspective though, 82% of the time was spent on one
instruction within __constant_c_and_count_memset() called from
prep_new_page() here;
2059075 0.8117 :c014c6d8: rep stos %eax,%es:(%edi)
On architectures with a cheaper prep_new_page(), the list search may be
more noticable. I'll see can I check what this looks like on ppc64 during
the week because I believe ppc64 is able to zero pages faster.
__rmqueue_fallback didn't even appear in readprofile or oprofile even
though it has to have been executed during boot time. I guess it just
wasn't sampled enough. The overhead should be visible on ia64 with low
memory machines because MAX_ORDER_NR_PAGES is so ridiculously large there.
I have access to an IA64 machine with 1GB so I'll experiement with forcing
allocflags_to_migratetype() to return MIGRATE_UNMOVABLE when
vm_total_pages < (MAX_ORDER_NR_PAGES * MIGRATE_TYPES).
get_pageblock_flags_group() was the 40th most executed function according
to oprofile. 70% of the time in that function was spent here;
for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
if (test_bit(bitidx + start_bitidx, bitmap))
flags |= value;
This is reading the 3 bits storing values for the MAX_ORDER_NR_PAGES on
each free. That code was structured to be as generic as possible so that
additional bits could be added with the minimum of fuss. Maybe based on
the profile, I should look at making it specific instead. If someone wants
to add another bit, they can use the get_pageblock_bitmap() and
pfn_to_bitidx() helpers to make their own specific accessors of their new
bit.
The small zone one is the one I'm worried about. Once I verify it's a
problem and fix it if necessary, I'll be happy to get rid of that
configuration item altogether.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists