linux-kernel - Re: [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191018130127.GP5017@dhcp22.suse.cz>
Date:   Fri, 18 Oct 2019 15:01:27 +0200
From:   Michal Hocko <mhocko@...nel.org>
To:     Mel Gorman <mgorman@...hsingularity.net>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Thomas Gleixner <tglx@...utronix.de>,
        Matt Fleming <matt@...eblueprint.co.uk>,
        Borislav Petkov <bp@...en8.de>, Linux-MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits
 after init completes

On Fri 18-10-19 11:56:05, Mel Gorman wrote:
> Deferred memory initialisation updates zone->managed_pages during
> the initialisation phase but before that finishes, the per-cpu page
> allocator (pcpu) calculates the number of pages allocated/freed in
> batches as well as the maximum number of pages allowed on a per-cpu list.
> As zone->managed_pages is not up to date yet, the pcpu initialisation
> calculates inappropriately low batch and high values.
> 
> This increases zone lock contention quite severely in some cases with the
> degree of severity depending on how many CPUs share a local zone and the
> size of the zone. A private report indicated that kernel build times were
> excessive with extremely high system CPU usage. A perf profile indicated
> that a large chunk of time was lost on zone->lock contention.
> 
> This patch recalculates the pcpu batch and high values after deferred
> initialisation completes on each node. It was tested on a 2-socket AMD
> EPYC 2 machine using a kernel compilation workload -- allmodconfig and
> all available CPUs.
> 
> mmtests configuration: config-workload-kernbench-max
> Configuration was modified to build on a fresh XFS partition.
> 
> kernbench
>                                 5.4.0-rc3              5.4.0-rc3
>                                   vanilla         resetpcpu-v1r1
> Amean     user-256    13249.50 (   0.00%)    15928.40 * -20.22%*
> Amean     syst-256    14760.30 (   0.00%)     4551.77 *  69.16%*
> Amean     elsp-256      162.42 (   0.00%)      118.46 *  27.06%*
> Stddev    user-256       42.97 (   0.00%)       50.83 ( -18.30%)
> Stddev    syst-256      336.87 (   0.00%)       33.70 (  90.00%)
> Stddev    elsp-256        2.46 (   0.00%)        0.81 (  67.01%)
> 
>                    5.4.0-rc3   5.4.0-rc3
>                      vanillaresetpcpu-v1r1
> Duration User       39766.24    47802.92
> Duration System     44298.10    13671.93
> Duration Elapsed      519.11      387.65
> 
> The patch reduces system CPU usage by 69.16% and total build time by
> 27.06%. The variance of system CPU usage is also much reduced.

The fix makes sense. It would be nice to see the difference in the batch
sizes from the initial setup compared to the one after the deferred
intialization is done

> Cc: stable@...r.kernel.org # v4.15+

Hmm, are you sure about 4.15? Doesn't this go all the way down to
deferred initialization? I do not see any recent changes on when
setup_per_cpu_pageset is called.

> Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>

Acked-by: Michal Hocko <mhocko@...e.com>

> ---
>  mm/page_alloc.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cafe568d36f6..0a0dd74edc83 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1818,6 +1818,14 @@ static int __init deferred_init_memmap(void *data)
>  	 */
>  	while (spfn < epfn)
>  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +
> +	/*
> +	 * The number of managed pages has changed due to the initialisation
> +	 * so the pcpu batch and high limits needs to be updated or the limits
> +	 * will be artificially small.
> +	 */
> +	zone_pcp_update(zone);
> +
>  zone_empty:
>  	pgdat_resize_unlock(pgdat, &flags);
>  
> @@ -8516,7 +8524,6 @@ void free_contig_range(unsigned long pfn, unsigned int nr_pages)
>  	WARN(count != 0, "%d pages are still in use!\n", count);
>  }
>  
> -#ifdef CONFIG_MEMORY_HOTPLUG
>  /*
>   * The zone indicated has a new number of managed_pages; batch sizes and percpu
>   * page high values need to be recalulated.
> @@ -8527,7 +8534,6 @@ void __meminit zone_pcp_update(struct zone *zone)
>  	__zone_pcp_update(zone);
>  	mutex_unlock(&pcp_batch_high_lock);
>  }
> -#endif
>  
>  void zone_pcp_reset(struct zone *zone)
>  {
> -- 
> 2.16.4
> 

-- 
Michal Hocko
SUSE Labs