linux-kernel - Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191021102714.GH9379@dhcp22.suse.cz>
Date:   Mon, 21 Oct 2019 12:27:14 +0200
From:   Michal Hocko <mhocko@...nel.org>
To:     Mel Gorman <mgorman@...hsingularity.net>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Thomas Gleixner <tglx@...utronix.de>,
        Matt Fleming <matt@...eblueprint.co.uk>,
        Borislav Petkov <bp@...en8.de>, Linux-MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits
 after init completes

On Mon 21-10-19 10:48:06, Mel Gorman wrote:
> Deferred memory initialisation updates zone->managed_pages during
> the initialisation phase but before that finishes, the per-cpu page
> allocator (pcpu) calculates the number of pages allocated/freed in
> batches as well as the maximum number of pages allowed on a per-cpu list.
> As zone->managed_pages is not up to date yet, the pcpu initialisation
> calculates inappropriately low batch and high values.
> 
> This increases zone lock contention quite severely in some cases with the
> degree of severity depending on how many CPUs share a local zone and the
> size of the zone. A private report indicated that kernel build times were
> excessive with extremely high system CPU usage. A perf profile indicated
> that a large chunk of time was lost on zone->lock contention.
> 
> This patch recalculates the pcpu batch and high values after deferred
> initialisation completes for every populated zone in the system. It
> was tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
> workload -- allmodconfig and all available CPUs.
> 
> mmtests configuration: config-workload-kernbench-max
> Configuration was modified to build on a fresh XFS partition.
> 
> kernbench
>                                 5.4.0-rc3              5.4.0-rc3
>                                   vanilla           resetpcpu-v2
> Amean     user-256    13249.50 (   0.00%)    16401.31 * -23.79%*
> Amean     syst-256    14760.30 (   0.00%)     4448.39 *  69.86%*
> Amean     elsp-256      162.42 (   0.00%)      119.13 *  26.65%*
> Stddev    user-256       42.97 (   0.00%)       19.15 (  55.43%)
> Stddev    syst-256      336.87 (   0.00%)        6.71 (  98.01%)
> Stddev    elsp-256        2.46 (   0.00%)        0.39 (  84.03%)
> 
>                    5.4.0-rc3    5.4.0-rc3
>                      vanilla resetpcpu-v2
> Duration User       39766.24     49221.79
> Duration System     44298.10     13361.67
> Duration Elapsed      519.11       388.87
> 
> The patch reduces system CPU usage by 69.86% and total build time by
> 26.65%. The variance of system CPU usage is also much reduced.
> 
> Before, this was the breakdown of batch and high values over all zones was.
> 
>     256               batch: 1
>     256               batch: 63
>     512               batch: 7
>     256               high:  0
>     256               high:  378
>     512               high:  42
> 
> 512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After the patch
> 
>     256               batch: 1
>     768               batch: 63
>     256               high:  0
>     768               high:  378
> 
> Cc: stable@...r.kernel.org # v4.1+
> Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>

Acked-by: Michal Hocko <mhocko@...e.com>

> ---
>  mm/page_alloc.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c0b2e0306720..f972076d0f6b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1947,6 +1947,14 @@ void __init page_alloc_init_late(void)
>  	/* Block until all are initialised */
>  	wait_for_completion(&pgdat_init_all_done_comp);
>  
> +	/*
> +	 * The number of managed pages has changed due to the initialisation
> +	 * so the pcpu batch and high limits needs to be updated or the limits
> +	 * will be artificially small.
> +	 */
> +	for_each_populated_zone(zone)
> +		zone_pcp_update(zone);
> +
>  	/*
>  	 * We initialized the rest of the deferred pages.  Permanently disable
>  	 * on-demand struct page initialization.
> -- 
> 2.16.4

-- 
Michal Hocko
SUSE Labs