linux-kernel - Re: [PATCH v2 1/2] mm/page_alloc: use ac->high_zoneidx for classzone

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAmzW4PZr6QiO=6VcM_Nbf4079awHBLULAm+_A_-2mCxrzOO2g@mail.gmail.com>
Date:   Thu, 19 Mar 2020 17:57:58 +0900
From:   Joonsoo Kim <js1304@...il.com>
To:     David Rientjes <rientjes@...gle.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Linux Memory Management List <linux-mm@...ck.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Minchan Kim <minchan@...nel.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Mel Gorman <mgorman@...hsingularity.net>, kernel-team@....com,
        Ye Xiaolong <xiaolong.ye@...el.com>,
        Joonsoo Kim <iamjoonsoo.kim@....com>
Subject: Re: [PATCH v2 1/2] mm/page_alloc: use ac->high_zoneidx for classzone_idx

2020년 3월 19일 (목) 오전 6:29, David Rientjes <rientjes@...gle.com>님이 작성:
>
> On Wed, 18 Mar 2020, js1304@...il.com wrote:
>
> > From: Joonsoo Kim <iamjoonsoo.kim@....com>
> >
> > Currently, we use the zone index of preferred_zone which represents
> > the best matching zone for allocation, as classzone_idx. It has a problem
> > on NUMA system with ZONE_MOVABLE.
> >
>
> Hi Joonsoo,

Hello, David.

> More specifically, it has a problem on NUMA systems when the lowmem
> reserve protection exists for some zones on a node that do not exist on
> other nodes, right?

Right.

> In other words, to make sure I understand correctly, if your node 1 had a
> ZONE_MOVABLE than this would not have happened.  If that's true, it might
> be helpful to call out that ZONE_MOVABLE itself is not necessarily a
> problem, but a system where one node has ZONE_NORMAL and ZONE_MOVABLE and
> another only has ZONE_NORMAL is the problem.

Okay. I will try to re-write the commit message as you suggested.

> > In NUMA system, it can be possible that each node has different populated
> > zones. For example, node 0 could have DMA/DMA32/NORMAL/MOVABLE zone and
> > node 1 could have only NORMAL zone. In this setup, allocation request
> > initiated on node 0 and the one on node 1 would have different
> > classzone_idx, 3 and 2, respectively, since their preferred_zones are
> > different. If they are handled by only their own node, there is no problem.
>
> I'd say "If the allocation is local" rather than "If they are handled by
> only their own node".

I will replace it with yours. Thanks for correcting.

> > However, if they are somtimes handled by the remote node due to memory
> > shortage, the problem would happen.
> >
> > In the following setup, allocation initiated on node 1 will have some
> > precedence than allocation initiated on node 0 when former allocation is
> > processed on node 0 due to not enough memory on node 1. They will have
> > different lowmem reserve due to their different classzone_idx thus
> > an watermark bars are also different.
> >
> > root@...ntu:/sys/devices/system/memory# cat /proc/zoneinfo
> > Node 0, zone      DMA
> >   per-node stats
> > ...
> >   pages free     3965
> >         min      5
> >         low      8
> >         high     11
> >         spanned  4095
> >         present  3998
> >         managed  3977
> >         protection: (0, 2961, 4928, 5440)
> > ...
> > Node 0, zone    DMA32
> >   pages free     757955
> >         min      1129
> >         low      1887
> >         high     2645
> >         spanned  1044480
> >         present  782303
> >         managed  758116
> >         protection: (0, 0, 1967, 2479)
> > ...
> > Node 0, zone   Normal
> >   pages free     459806
> >         min      750
> >         low      1253
> >         high     1756
> >         spanned  524288
> >         present  524288
> >         managed  503620
> >         protection: (0, 0, 0, 4096)
> > ...
> > Node 0, zone  Movable
> >   pages free     130759
> >         min      195
> >         low      326
> >         high     457
> >         spanned  1966079
> >         present  131072
> >         managed  131072
> >         protection: (0, 0, 0, 0)
> > ...
> > Node 1, zone      DMA
> >   pages free     0
> >         min      0
> >         low      0
> >         high     0
> >         spanned  0
> >         present  0
> >         managed  0
> >         protection: (0, 0, 1006, 1006)
> > Node 1, zone    DMA32
> >   pages free     0
> >         min      0
> >         low      0
> >         high     0
> >         spanned  0
> >         present  0
> >         managed  0
> >         protection: (0, 0, 1006, 1006)
> > Node 1, zone   Normal
> >   per-node stats
> > ...
> >   pages free     233277
> >         min      383
> >         low      640
> >         high     897
> >         spanned  262144
> >         present  262144
> >         managed  257744
> >         protection: (0, 0, 0, 0)
> > ...
> > Node 1, zone  Movable
> >   pages free     0
> >         min      0
> >         low      0
> >         high     0
> >         spanned  262144
> >         present  0
> >         managed  0
> >         protection: (0, 0, 0, 0)
> >
> > min watermark for NORMAL zone on node 0
> > allocation initiated on node 0: 750 + 4096 = 4846
> > allocation initiated on node 1: 750 + 0 = 750
> >
> > This watermark difference could cause too many numa_miss allocation
> > in some situation and then performance could be downgraded.
> >
> > Recently, there was a regression report about this problem on CMA patches
> > since CMA memory are placed in ZONE_MOVABLE by those patches. I checked
> > that problem is disappeared with this fix that uses high_zoneidx
> > for classzone_idx.
> >
> > http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
> >
> > Using high_zoneidx for classzone_idx is more consistent way than previous
> > approach because system's memory layout doesn't affect anything to it.
> > With this patch, both classzone_idx on above example will be 3 so will
> > have the same min watermark.
> >
> > allocation initiated on node 0: 750 + 4096 = 4846
> > allocation initiated on node 1: 750 + 4096 = 4846
> >
>
> Alternatively, I assume that this could also be fixed by changing the
> value of the lowmem protection on the node without managed pages in the
> upper zone to be the max protection from the lowest zones?  In your
> example, node 1 ZONE_NORMAL would then be (0, 0, 0, 4096).

No, if lowmem_reserve of node 0 ZONE_NORMAL is (0, 0, 4096, 4096),
min watermark of the allocation initiated on node 1 is 750 +
4096(classzone_idx 2)
when allocation is tried on node 0 ZONE_NORMAL and issue would be gone.
So, I think that it cannot be fixed by your alternative.

> > One could wonder if there is a side effect that allocation initiated on
> > node 1 will use higher bar when allocation is handled on node 1 since
> > classzone_idx could be higher than before. It will not happen because
> > the zone without managed page doesn't contributes lowmem_reserve at all.
> >
> > Reported-by: Ye Xiaolong <xiaolong.ye@...el.com>
> > Tested-by: Ye Xiaolong <xiaolong.ye@...el.com>
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@....com>
>
> Curious: is this only an issue when vm.numa_zonelist_order is set to Node?

Do you mean "/proc/sys/vm/numa_zonelist_order"? It looks like it's gone now.

Thanks.