linux-kernel - Re: [PATCH v3 1/2] mm/page_alloc: use ac->high_zoneidx for classzone

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200320103000.GB3039@MiWiFi-R3L-srv>
Date:   Fri, 20 Mar 2020 18:30:00 +0800
From:   Baoquan He <bhe@...hat.com>
To:     js1304@...il.com
Cc:     Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Minchan Kim <minchan@...nel.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Mel Gorman <mgorman@...hsingularity.net>, kernel-team@....com,
        Ye Xiaolong <xiaolong.ye@...el.com>,
        David Rientjes <rientjes@...gle.com>,
        Joonsoo Kim <iamjoonsoo.kim@....com>
Subject: Re: [PATCH v3 1/2] mm/page_alloc: use ac->high_zoneidx for
 classzone_idx


On 03/20/20 at 05:32pm, js1304@...il.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@....com>
> 
> Currently, we use the zone index of preferred_zone which represents
> the best matching zone for allocation, as classzone_idx. It has
> a problem on NUMA systems when the lowmem reserve protection exists
> for some zones on a node that do not exist on other nodes.
> 
> In NUMA system, it can be possible that each node has different populated
> zones. For example, node 0 could have DMA/DMA32/NORMAL/MOVABLE zone and
> node 1 could have only NORMAL zone. In this setup, allocation request
> initiated on node 0 and the one on node 1 would have different
> classzone_idx, 3 and 2, respectively, since their preferred_zones are
> different. If the allocation is local, there is no problem. However,
> if it is handled by the remote node due to memory shortage, the problem
> would happen.

Hi Joonsoo,

Not sure if adding one sentence into above paragraph would be make it
easier to understand. Assume you are only talking about the high_zoneidx
is MOVABLE_ZONE with calculation of gfp_zone(gfp_mask), since any other
case doesn't have this problem. Please correct me if I am wrong.

In NUMA system, it can be possible that each node has different populated
zones. For example, node 0 could have DMA/DMA32/NORMAL/MOVABLE zone and
node 1 could have only NORMAL zone. In this setup, if we get high_zoneidx
as 3 (namely MOVABLE zone), with gfp_zone(gfp_mask), allocation request
initiated on node 0 and the one on node 1 would have different
classzone_idx, 3 and 2, respectively, since their preferred_zones are
different. If the allocation is local, there is no problem. However,
if it is handled by the remote node due to memory shortage, the problem
would happen.

Thanks
Baoquan

> 
> In the following setup, allocation initiated on node 1 will have some
> precedence than allocation initiated on node 0 when former allocation is
> processed on node 0 due to not enough memory on node 1. They will have
> different lowmem reserve due to their different classzone_idx thus
> an watermark bars are also different.
> 
> root@...ntu:/sys/devices/system/memory# cat /proc/zoneinfo
> Node 0, zone      DMA
>   per-node stats
> ...
>   pages free     3965
>         min      5
>         low      8
>         high     11
>         spanned  4095
>         present  3998
>         managed  3977
>         protection: (0, 2961, 4928, 5440)
> ...
> Node 0, zone    DMA32
>   pages free     757955
>         min      1129
>         low      1887
>         high     2645
>         spanned  1044480
>         present  782303
>         managed  758116
>         protection: (0, 0, 1967, 2479)
> ...
> Node 0, zone   Normal
>   pages free     459806
>         min      750
>         low      1253
>         high     1756
>         spanned  524288
>         present  524288
>         managed  503620
>         protection: (0, 0, 0, 4096)
> ...
> Node 0, zone  Movable
>   pages free     130759
>         min      195
>         low      326
>         high     457
>         spanned  1966079
>         present  131072
>         managed  131072
>         protection: (0, 0, 0, 0)
> ...
> Node 1, zone      DMA
>   pages free     0
>         min      0
>         low      0
>         high     0
>         spanned  0
>         present  0
>         managed  0
>         protection: (0, 0, 1006, 1006)
> Node 1, zone    DMA32
>   pages free     0
>         min      0
>         low      0
>         high     0
>         spanned  0
>         present  0
>         managed  0
>         protection: (0, 0, 1006, 1006)
> Node 1, zone   Normal
>   per-node stats
> ...
>   pages free     233277
>         min      383
>         low      640
>         high     897
>         spanned  262144
>         present  262144
>         managed  257744
>         protection: (0, 0, 0, 0)
> ...
> Node 1, zone  Movable
>   pages free     0
>         min      0
>         low      0
>         high     0
>         spanned  262144
>         present  0
>         managed  0
>         protection: (0, 0, 0, 0)
> 
> min watermark for NORMAL zone on node 0
> allocation initiated on node 0: 750 + 4096 = 4846
> allocation initiated on node 1: 750 + 0 = 750
> 
> This watermark difference could cause too many numa_miss allocation
> in some situation and then performance could be downgraded.
> 
> Recently, there was a regression report about this problem on CMA patches
> since CMA memory are placed in ZONE_MOVABLE by those patches. I checked
> that problem is disappeared with this fix that uses high_zoneidx
> for classzone_idx.
> 
> http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
> 
> Using high_zoneidx for classzone_idx is more consistent way than previous
> approach because system's memory layout doesn't affect anything to it.
> With this patch, both classzone_idx on above example will be 3 so will
> have the same min watermark.
> 
> allocation initiated on node 0: 750 + 4096 = 4846
> allocation initiated on node 1: 750 + 4096 = 4846
> 
> One could wonder if there is a side effect that allocation initiated on
> node 1 will use higher bar when allocation is handled on local since
> classzone_idx could be higher than before. It will not happen because
> the zone without managed page doesn't contributes lowmem_reserve at all.
> 
> Reported-by: Ye Xiaolong <xiaolong.ye@...el.com>
> Tested-by: Ye Xiaolong <xiaolong.ye@...el.com>
> Acked-by: Vlastimil Babka <vbabka@...e.cz>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@....com>
> ---
>  mm/internal.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index c39c895..aebaa33 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -119,7 +119,7 @@ struct alloc_context {
>  	bool spread_dirty_pages;
>  };
>  
> -#define ac_classzone_idx(ac) zonelist_zone_idx(ac->preferred_zoneref)
> +#define ac_classzone_idx(ac) (ac->high_zoneidx)
>  
>  /*
>   * Locate the struct page for both the matching buddy in our
> -- 
> 2.7.4
> 
>