linux-kernel - Re: [PATCH] page allocator: Reduce fragmentation in buddy allocator by adding buddies that are merging to the tail of the free lists

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <84144f021001040605pefbc9f7h276b796bdf931353@mail.gmail.com>
Date:	Mon, 4 Jan 2010 16:05:37 +0200
From:	Pekka Enberg <penberg@...helsinki.fi>
To:	Mel Gorman <mel@....ul.ie>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Corrado Zoccolo <czoccolo@...il.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	Christoph Lameter <cl@...ux-foundation.org>
Subject: Re: [PATCH] page allocator: Reduce fragmentation in buddy allocator 
	by adding buddies that are merging to the tail of the free lists

On Mon, Jan 4, 2010 at 3:55 PM, Mel Gorman <mel@....ul.ie> wrote:
> From: Corrado Zoccolo <czoccolo@...il.com>
>
> In order to reduce fragmentation, this patch classifies freed pages in
> two groups according to their probability of being part of a high order
> merge. Pages belonging to a compound whose next-highest buddy is free are
> more likely to be part of a high order merge in the near future, so they
> will be added at the tail of the freelist. The remaining pages are put at
> the front of the freelist.
>
> In this way, the pages that are more likely to cause a big merge are kept
> free longer. Consequently there is a tendency to aggregate the long-living
> allocations on a subset of the compounds, reducing the fragmentation.
>
> This heuristic was testing on three machines, x86, x86-64 and ppc64 with
> 3GB of RAM in each machine. The tests were kernbench, netperf, sysbench and
> STREAM for performance and a high-order stress test for huge page allocations.
>
> KernBench X86
> Elapsed mean     374.77 ( 0.00%)   375.10 (-0.09%)
> User    mean     649.53 ( 0.00%)   650.44 (-0.14%)
> System  mean      54.75 ( 0.00%)    54.18 ( 1.05%)
> CPU     mean     187.75 ( 0.00%)   187.25 ( 0.27%)
>
> KernBench X86-64
> Elapsed mean      94.45 ( 0.00%)    94.01 ( 0.47%)
> User    mean     323.27 ( 0.00%)   322.66 ( 0.19%)
> System  mean      36.71 ( 0.00%)    36.50 ( 0.57%)
> CPU     mean     380.75 ( 0.00%)   381.75 (-0.26%)
>
> KernBench PPC64
> Elapsed mean     173.45 ( 0.00%)   173.74 (-0.17%)
> User    mean     587.99 ( 0.00%)   587.95 ( 0.01%)
> System  mean      60.60 ( 0.00%)    60.57 ( 0.05%)
> CPU     mean     373.50 ( 0.00%)   372.75 ( 0.20%)
>
> Nothing notable for kernbench.
>
> NetPerf UDP X86
>      64    42.68 ( 0.00%)     42.77 ( 0.21%)
>     128    85.62 ( 0.00%)     85.32 (-0.35%)
>     256   170.01 ( 0.00%)    168.76 (-0.74%)
>    1024   655.68 ( 0.00%)    652.33 (-0.51%)
>    2048  1262.39 ( 0.00%)   1248.61 (-1.10%)
>    3312  1958.41 ( 0.00%)   1944.61 (-0.71%)
>    4096  2345.63 ( 0.00%)   2318.83 (-1.16%)
>    8192  4132.90 ( 0.00%)   4089.50 (-1.06%)
>   16384  6770.88 ( 0.00%)   6642.05 (-1.94%)*
>
> NetPerf UDP X86-64
>      64   148.82 ( 0.00%)    154.92 ( 3.94%)
>     128   298.96 ( 0.00%)    312.95 ( 4.47%)
>     256   583.67 ( 0.00%)    626.39 ( 6.82%)
>    1024  2293.18 ( 0.00%)   2371.10 ( 3.29%)
>    2048  4274.16 ( 0.00%)   4396.83 ( 2.79%)
>    3312  6356.94 ( 0.00%)   6571.35 ( 3.26%)
>    4096  7422.68 ( 0.00%)   7635.42 ( 2.79%)*
>    8192 12114.81 ( 0.00%)* 12346.88 ( 1.88%)
>   16384 17022.28 ( 0.00%)* 17033.19 ( 0.06%)*
>             1.64%             2.73%
>
> NetPerf UDP PPC64
>      64    49.98 ( 0.00%)     50.25 ( 0.54%)
>     128    98.66 ( 0.00%)    100.95 ( 2.27%)
>     256   197.33 ( 0.00%)    191.03 (-3.30%)
>    1024   761.98 ( 0.00%)    785.07 ( 2.94%)
>    2048  1493.50 ( 0.00%)   1510.85 ( 1.15%)
>    3312  2303.95 ( 0.00%)   2271.72 (-1.42%)
>    4096  2774.56 ( 0.00%)   2773.06 (-0.05%)
>    8192  4918.31 ( 0.00%)   4793.59 (-2.60%)
>   16384  7497.98 ( 0.00%)   7749.52 ( 3.25%)
>
> The tests are run to have confidence limits within 1%. Results marked with
> a * were not confident although in this case, it's only outside by small
> amounts. Even with some results that were not confident, the netperf UDP
> results were generally positive.
>
> NetPerf TCP X86
>      64   652.25 ( 0.00%)*   648.12 (-0.64%)*
>            23.80%            22.82%
>     128  1229.98 ( 0.00%)*  1220.56 (-0.77%)*
>            21.03%            18.90%
>     256  2105.88 ( 0.00%)   1872.03 (-12.49%)*
>             1.00%            16.46%
>    1024  3476.46 ( 0.00%)*  3548.28 ( 2.02%)*
>            13.37%            11.39%
>    2048  4023.44 ( 0.00%)*  4231.45 ( 4.92%)*
>             9.76%            12.48%
>    3312  4348.88 ( 0.00%)*  4396.96 ( 1.09%)*
>             6.49%             8.75%
>    4096  4726.56 ( 0.00%)*  4877.71 ( 3.10%)*
>             9.85%             8.50%
>    8192  4732.28 ( 0.00%)*  5777.77 (18.10%)*
>             9.13%            13.04%
>   16384  5543.05 ( 0.00%)*  5906.24 ( 6.15%)*
>             7.73%             8.68%
>
> NETPERF TCP X86-64
>            netperf-tcp-vanilla-netperf       netperf-tcp
>                   tcp-vanilla     pgalloc-delay
>      64  1895.87 ( 0.00%)*  1775.07 (-6.81%)*
>             5.79%             4.78%
>     128  3571.03 ( 0.00%)*  3342.20 (-6.85%)*
>             3.68%             6.06%
>     256  5097.21 ( 0.00%)*  4859.43 (-4.89%)*
>             3.02%             2.10%
>    1024  8919.10 ( 0.00%)*  8892.49 (-0.30%)*
>             5.89%             6.55%
>    2048 10255.46 ( 0.00%)* 10449.39 ( 1.86%)*
>             7.08%             7.44%
>    3312 10839.90 ( 0.00%)* 10740.15 (-0.93%)*
>             6.87%             7.33%
>    4096 10814.84 ( 0.00%)* 10766.97 (-0.44%)*
>             6.86%             8.18%
>    8192 11606.89 ( 0.00%)* 11189.28 (-3.73%)*
>             7.49%             5.55%
>   16384 12554.88 ( 0.00%)* 12361.22 (-1.57%)*
>             7.36%             6.49%
>
> NETPERF TCP PPC64
>            netperf-tcp-vanilla-netperf       netperf-tcp
>                   tcp-vanilla     pgalloc-delay
>      64   594.17 ( 0.00%)    596.04 ( 0.31%)*
>             1.00%             2.29%
>     128  1064.87 ( 0.00%)*  1074.77 ( 0.92%)*
>             1.30%             1.40%
>     256  1852.46 ( 0.00%)*  1856.95 ( 0.24%)
>             1.25%             1.00%
>    1024  3839.46 ( 0.00%)*  3813.05 (-0.69%)
>             1.02%             1.00%
>    2048  4885.04 ( 0.00%)*  4881.97 (-0.06%)*
>             1.15%             1.04%
>    3312  5506.90 ( 0.00%)   5459.72 (-0.86%)
>    4096  6449.19 ( 0.00%)   6345.46 (-1.63%)
>    8192  7501.17 ( 0.00%)   7508.79 ( 0.10%)
>   16384  9618.65 ( 0.00%)   9490.10 (-1.35%)
>
> There was a distinct lack of confidence in the X86* figures so I included what
> the devation was where the results were not confident.  Many of the results,
> whether gains or losses were within the standard deviation so no solid
> conclusion can be reached on performance impact. Looking at the figures,
> only the X86-64 ones look suspicious with a few losses that were outside
> the noise. However, the results were so unstable that without knowing why
> they vary so much, a solid conclusion cannot be reached.
>
> SYSBENCH X86
>              sysbench-vanilla     pgalloc-delay
>           1  7722.85 ( 0.00%)  7756.79 ( 0.44%)
>           2 14901.11 ( 0.00%) 13683.44 (-8.90%)
>           3 15171.71 ( 0.00%) 14888.25 (-1.90%)
>           4 14966.98 ( 0.00%) 15029.67 ( 0.42%)
>           5 14370.47 ( 0.00%) 14865.00 ( 3.33%)
>           6 14870.33 ( 0.00%) 14845.57 (-0.17%)
>           7 14429.45 ( 0.00%) 14520.85 ( 0.63%)
>           8 14354.35 ( 0.00%) 14362.31 ( 0.06%)
>
> SYSBENCH X86-64
>           1 17448.70 ( 0.00%) 17484.41 ( 0.20%)
>           2 34276.39 ( 0.00%) 34251.00 (-0.07%)
>           3 50805.25 ( 0.00%) 50854.80 ( 0.10%)
>           4 66667.10 ( 0.00%) 66174.69 (-0.74%)
>           5 66003.91 ( 0.00%) 65685.25 (-0.49%)
>           6 64981.90 ( 0.00%) 65125.60 ( 0.22%)
>           7 64933.16 ( 0.00%) 64379.23 (-0.86%)
>           8 63353.30 ( 0.00%) 63281.22 (-0.11%)
>           9 63511.84 ( 0.00%) 63570.37 ( 0.09%)
>          10 62708.27 ( 0.00%) 63166.25 ( 0.73%)
>          11 62092.81 ( 0.00%) 61787.75 (-0.49%)
>          12 61330.11 ( 0.00%) 61036.34 (-0.48%)
>          13 61438.37 ( 0.00%) 61994.47 ( 0.90%)
>          14 62304.48 ( 0.00%) 62064.90 (-0.39%)
>          15 63296.48 ( 0.00%) 62875.16 (-0.67%)
>          16 63951.76 ( 0.00%) 63769.09 (-0.29%)
>
> SYSBENCH PPC64
>                             -sysbench-pgalloc-delay-sysbench
>              sysbench-vanilla     pgalloc-delay
>           1  7645.08 ( 0.00%)  7467.43 (-2.38%)
>           2 14856.67 ( 0.00%) 14558.73 (-2.05%)
>           3 21952.31 ( 0.00%) 21683.64 (-1.24%)
>           4 27946.09 ( 0.00%) 28623.29 ( 2.37%)
>           5 28045.11 ( 0.00%) 28143.69 ( 0.35%)
>           6 27477.10 ( 0.00%) 27337.45 (-0.51%)
>           7 26489.17 ( 0.00%) 26590.06 ( 0.38%)
>           8 26642.91 ( 0.00%) 25274.33 (-5.41%)
>           9 25137.27 ( 0.00%) 24810.06 (-1.32%)
>          10 24451.99 ( 0.00%) 24275.85 (-0.73%)
>          11 23262.20 ( 0.00%) 23674.88 ( 1.74%)
>          12 24234.81 ( 0.00%) 23640.89 (-2.51%)
>          13 24577.75 ( 0.00%) 24433.50 (-0.59%)
>          14 25640.19 ( 0.00%) 25116.52 (-2.08%)
>          15 26188.84 ( 0.00%) 26181.36 (-0.03%)
>          16 26782.37 ( 0.00%) 26255.99 (-2.00%)
>
> Again, there is little to conclude here. While there are a few losses,
> the results vary by +/- 8% in some cases. They are the results of most
> concern as there are some large losses but it's also within the variance
> typically seen between kernel releases.
>
> The STREAM results varied so little and are so verbose that I didn't
> include them here.
>
> The final test stressed how many huge pages can be allocated. The
> absolute number of huge pages allocated are the same with or without the
> page. However, the "unusability free space index" which is a measure of
> external fragmentation was slightly lower (lower is better) throughout the
> lifetime of the system. I also measured the latency of how long it took
> to successfully allocate a huge page. The latency was slightly lower and
> on X86 and PPC64, more huge pages were allocated almost immediately from
> the free lists. The improvement is slight but there.
>
> [mel@....ul.ie: Tested, reworked for less branches]
> Signed-off-by: Mel Gorman <mel@....ul.ie>
> ---
>  mm/page_alloc.c |   27 ++++++++++++++++++++++++---
>  1 files changed, 24 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2bc2ac6..fe7017e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -451,6 +451,7 @@ static inline void __free_one_page(struct page *page,
>                int migratetype)
>  {
>        unsigned long page_idx;
> +       unsigned long combined_idx;
>
>        if (unlikely(PageCompound(page)))
>                if (unlikely(destroy_compound_page(page, order)))
> @@ -464,7 +465,6 @@ static inline void __free_one_page(struct page *page,
>        VM_BUG_ON(bad_range(zone, page));
>
>        while (order < MAX_ORDER-1) {
> -               unsigned long combined_idx;
>                struct page *buddy;
>
>                buddy = __page_find_buddy(page, page_idx, order);
> @@ -481,8 +481,29 @@ static inline void __free_one_page(struct page *page,
>                order++;
>        }
>        set_page_order(page, order);
> -       list_add(&page->lru,
> -               &zone->free_area[order].free_list[migratetype]);
> +
> +       /*
> +        * If this is not the largest possible page, check if the buddy
> +        * of the next-highest order is free. If it is, it's possible
> +        * that pages are being freed that will coalesce soon. In case,
> +        * that is happening, add the free page to the tail of the list
> +        * so it's less likely to be used soon and more likely to be merged
> +        * as a higher order page
> +        */
> +       if (order < MAX_ORDER-1) {
> +               struct page *higher_page, *higher_buddy;
> +               combined_idx = __find_combined_index(page_idx, order);
> +               higher_page = page + combined_idx - page_idx;
> +               higher_buddy = __page_find_buddy(higher_page, combined_idx, order + 1);
> +               if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
> +                       list_add_tail(&page->lru,
> +                               &zone->free_area[order].free_list[migratetype]);
> +                       goto out;
> +               }
> +       }
> +
> +       list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
> +out:
>        zone->free_area[order].nr_free++;
>  }

FWIW,

Reviewed-by: Pekka Enberg <penberg@...helsinki.fi>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/