lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200323035843.GH2987@MiWiFi-R3L-srv>
Date:   Mon, 23 Mar 2020 11:58:43 +0800
From:   Baoquan He <bhe@...hat.com>
To:     Joonsoo Kim <js1304@...il.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Linux Memory Management List <linux-mm@...ck.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Minchan Kim <minchan@...nel.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Mel Gorman <mgorman@...hsingularity.net>, kernel-team@....com,
        Ye Xiaolong <xiaolong.ye@...el.com>,
        David Rientjes <rientjes@...gle.com>,
        Joonsoo Kim <iamjoonsoo.kim@....com>
Subject: Re: [PATCH v3 1/2] mm/page_alloc: use ac->high_zoneidx for
 classzone_idx

On 03/23/20 at 12:50pm, Joonsoo Kim wrote:
> Hello, Baoquan.
> 
> 2020년 3월 20일 (금) 오후 7:30, Baoquan He <bhe@...hat.com>님이 작성:
> >
> >
> > On 03/20/20 at 05:32pm, js1304@...il.com wrote:
> > > From: Joonsoo Kim <iamjoonsoo.kim@....com>
> > >
> > > Currently, we use the zone index of preferred_zone which represents
> > > the best matching zone for allocation, as classzone_idx. It has
> > > a problem on NUMA systems when the lowmem reserve protection exists
> > > for some zones on a node that do not exist on other nodes.
> > >
> > > In NUMA system, it can be possible that each node has different populated
> > > zones. For example, node 0 could have DMA/DMA32/NORMAL/MOVABLE zone and
> > > node 1 could have only NORMAL zone. In this setup, allocation request
> > > initiated on node 0 and the one on node 1 would have different
> > > classzone_idx, 3 and 2, respectively, since their preferred_zones are
> > > different. If the allocation is local, there is no problem. However,
> > > if it is handled by the remote node due to memory shortage, the problem
> > > would happen.
> >
> > Hi Joonsoo,
> >
> > Not sure if adding one sentence into above paragraph would be make it
> > easier to understand. Assume you are only talking about the high_zoneidx
> > is MOVABLE_ZONE with calculation of gfp_zone(gfp_mask), since any other
> > case doesn't have this problem. Please correct me if I am wrong.
> 
> You're right. This example is for the allocation request with
> gfp_zone(gfp_mask),
> MOVABLE_ZONE.
> 
> > In NUMA system, it can be possible that each node has different populated
> > zones. For example, node 0 could have DMA/DMA32/NORMAL/MOVABLE zone and
> > node 1 could have only NORMAL zone. In this setup, if we get high_zoneidx
> > as 3 (namely MOVABLE zone), with gfp_zone(gfp_mask), allocation request
> > initiated on node 0 and the one on node 1 would have different
> > classzone_idx, 3 and 2, respectively, since their preferred_zones are
> > different. If the allocation is local, there is no problem. However,
> > if it is handled by the remote node due to memory shortage, the problem
> > would happen.
> 
> I'm okay with your change but I try again to be better. Please check the
> following rewritten commit message and please let me know if it is better
> than before.

Yeah, this new one looks very detailed, I believe anyone interested can
get what's going on. Thanks for doing this.

> 
> Thanks.
> 
> ------------------------>8-------------------------------
> From 3842d99d54c02c25cc6f6077fd0e97acabc10917 Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <iamjoonsoo.kim@....com>
> Date: Fri, 4 May 2018 09:55:21 +0900
> Subject: [PATCH] mm/page_alloc: use ac->high_zoneidx for classzone_idx
> 
> Currently, we use classzone_idx to calculate lowmem reserve proetection
> for an allocation request. This classzone_idx causes a problem
> on NUMA systems when the lowmem reserve protection exists for some zones
> on a node that do not exist on other nodes.
> 
> Before further explanation, I should first clarify how to compute
> the classzone_idx and the high_zoneidx.
> 
> - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and
> represents the index of the highest zone the allocation can use
> - classzone_idx was supposed to be the index of the highest zone on
> the local node that the allocation can use, that is actually available
> in the system
> 
> Think about following example. Node 0 has 4 populated zone,
> DMA/DMA32/NORMAL/MOVABLE. Node 1 has 1 populated zone, NORMAL. Some zones,
> such as MOVABLE, doesn't exist on node 1 and this makes following
> difference.
> 
> Assume that there is an allocation request whose gfp_zone(gfp_mask) is
> the zone, MOVABLE. Then, it's high_zoneidx is 3. If this allocation is
> initiated on node 0, it's classzone_idx is 3 since actually
> available/usable zone on local (node 0) is MOVABLE. If this allocation
> is initiated on node 1, it's classzone_idx is 2 since actually
> available/usable zone on local (node 1) is NORMAL.
> 
> You can see that classzone_idx of the allocation request are different
> according to their starting node, even if their high_zoneidx is the same.
> 
> Think more about these two allocation requests. If they are processed
> on local, there is no problem. However, if allocation is initiated
> on node 1 are processed on remote, in this example, at the NORMAL zone
> on node 0, due to memory shortage, problem occurs. Their different
> classzone_idx leads to different lowmem reserve and then different
> min watermark. See the following example.
> 
> root@...ntu:/sys/devices/system/memory# cat /proc/zoneinfo
> Node 0, zone      DMA
>   per-node stats
> ...
>   pages free     3965
>         min      5
>         low      8
>         high     11
>         spanned  4095
>         present  3998
>         managed  3977
>         protection: (0, 2961, 4928, 5440)
> ...
> Node 0, zone    DMA32
>   pages free     757955
>         min      1129
>         low      1887
>         high     2645
>         spanned  1044480
>         present  782303
>         managed  758116
>         protection: (0, 0, 1967, 2479)
> ...
> Node 0, zone   Normal
>   pages free     459806
>         min      750
>         low      1253
>         high     1756
>         spanned  524288
>         present  524288
>         managed  503620
>         protection: (0, 0, 0, 4096)
> ...
> Node 0, zone  Movable
>   pages free     130759
>         min      195
>         low      326
>         high     457
>         spanned  1966079
>         present  131072
>         managed  131072
>         protection: (0, 0, 0, 0)
> ...
> Node 1, zone      DMA
>   pages free     0
>         min      0
>         low      0
>         high     0
>         spanned  0
>         present  0
>         managed  0
>         protection: (0, 0, 1006, 1006)
> Node 1, zone    DMA32
>   pages free     0
>         min      0
>         low      0
>         high     0
>         spanned  0
>         present  0
>         managed  0
>         protection: (0, 0, 1006, 1006)
> Node 1, zone   Normal
>   per-node stats
> ...
>   pages free     233277
>         min      383
>         low      640
>         high     897
>         spanned  262144
>         present  262144
>         managed  257744
>         protection: (0, 0, 0, 0)
> ...
> Node 1, zone  Movable
>   pages free     0
>         min      0
>         low      0
>         high     0
>         spanned  262144
>         present  0
>         managed  0
>         protection: (0, 0, 0, 0)
> 
> - static min watermark for the NORMAL zone on node 0 is 750.
> - lowmem reserve for the request with classzone idx 3 at the NORMAL
> on node 0 is 4096.
> - lowmem reserve for the request with classzone idx 2 at the NORMAL
> on node 0 is 0.
> 
> So, overall min watermark is:
> allocation initiated on node 0 (classzone_idx 3): 750 + 4096 = 4846
> allocation initiated on node 1 (classzone_idx 2): 750 + 0 = 750
> 
> allocation initiated on node 1 will have some precedence than allocation
> initiated on node 0 because min watermark of the former allocation is
> lower than the other. So, allocation initiated on node 1 could succeed
> on node 0 when allocation initiated on node 0 could not, and, this could
> cause too many numa_miss allocation. Then, performance could be
> downgraded.
> 
> Recently, there was a regression report about this problem on CMA patches
> since CMA memory are placed in ZONE_MOVABLE by those patches. I checked
> that problem is disappeared with this fix that uses high_zoneidx
> for classzone_idx.
> 
> http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
> 
> Using high_zoneidx for classzone_idx is more consistent way than previous
> approach because system's memory layout doesn't affect anything to it.
> With this patch, both classzone_idx on above example will be 3 so will
> have the same min watermark.
> 
> allocation initiated on node 0: 750 + 4096 = 4846
> allocation initiated on node 1: 750 + 4096 = 4846
> 
> One could wonder if there is a side effect that allocation initiated on
> node 1 will use higher bar when allocation is handled on local since
> classzone_idx could be higher than before. It will not happen because
> the zone without managed page doesn't contributes lowmem_reserve at all.
> 
> Reported-by: Ye Xiaolong <xiaolong.ye@...el.com>
> Tested-by: Ye Xiaolong <xiaolong.ye@...el.com>
> Acked-by: Vlastimil Babka <vbabka@...e.cz>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@....com>
> ---
>  mm/internal.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index c39c895..aebaa33 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -119,7 +119,7 @@ struct alloc_context {
>         bool spread_dirty_pages;
>  };
> 
> -#define ac_classzone_idx(ac) zonelist_zone_idx(ac->preferred_zoneref)
> +#define ac_classzone_idx(ac) (ac->high_zoneidx)
> 
>  /*
>   * Locate the struct page for both the matching buddy in our
> -- 
> 2.7.4
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ