[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAmzW4PVguD0jEEAt0ZUPqMykswuq2an21jRpP2ZuwRL6PYwCw@mail.gmail.com>
Date: Mon, 23 Mar 2020 12:50:38 +0900
From: Joonsoo Kim <js1304@...il.com>
To: Baoquan He <bhe@...hat.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Linux Memory Management List <linux-mm@...ck.org>,
LKML <linux-kernel@...r.kernel.org>,
Johannes Weiner <hannes@...xchg.org>,
Michal Hocko <mhocko@...nel.org>,
Minchan Kim <minchan@...nel.org>,
Vlastimil Babka <vbabka@...e.cz>,
Mel Gorman <mgorman@...hsingularity.net>, kernel-team@....com,
Ye Xiaolong <xiaolong.ye@...el.com>,
David Rientjes <rientjes@...gle.com>,
Joonsoo Kim <iamjoonsoo.kim@....com>
Subject: Re: [PATCH v3 1/2] mm/page_alloc: use ac->high_zoneidx for classzone_idx
Hello, Baoquan.
2020년 3월 20일 (금) 오후 7:30, Baoquan He <bhe@...hat.com>님이 작성:
>
>
> On 03/20/20 at 05:32pm, js1304@...il.com wrote:
> > From: Joonsoo Kim <iamjoonsoo.kim@....com>
> >
> > Currently, we use the zone index of preferred_zone which represents
> > the best matching zone for allocation, as classzone_idx. It has
> > a problem on NUMA systems when the lowmem reserve protection exists
> > for some zones on a node that do not exist on other nodes.
> >
> > In NUMA system, it can be possible that each node has different populated
> > zones. For example, node 0 could have DMA/DMA32/NORMAL/MOVABLE zone and
> > node 1 could have only NORMAL zone. In this setup, allocation request
> > initiated on node 0 and the one on node 1 would have different
> > classzone_idx, 3 and 2, respectively, since their preferred_zones are
> > different. If the allocation is local, there is no problem. However,
> > if it is handled by the remote node due to memory shortage, the problem
> > would happen.
>
> Hi Joonsoo,
>
> Not sure if adding one sentence into above paragraph would be make it
> easier to understand. Assume you are only talking about the high_zoneidx
> is MOVABLE_ZONE with calculation of gfp_zone(gfp_mask), since any other
> case doesn't have this problem. Please correct me if I am wrong.
You're right. This example is for the allocation request with
gfp_zone(gfp_mask),
MOVABLE_ZONE.
> In NUMA system, it can be possible that each node has different populated
> zones. For example, node 0 could have DMA/DMA32/NORMAL/MOVABLE zone and
> node 1 could have only NORMAL zone. In this setup, if we get high_zoneidx
> as 3 (namely MOVABLE zone), with gfp_zone(gfp_mask), allocation request
> initiated on node 0 and the one on node 1 would have different
> classzone_idx, 3 and 2, respectively, since their preferred_zones are
> different. If the allocation is local, there is no problem. However,
> if it is handled by the remote node due to memory shortage, the problem
> would happen.
I'm okay with your change but I try again to be better. Please check the
following rewritten commit message and please let me know if it is better
than before.
Thanks.
------------------------>8-------------------------------
>From 3842d99d54c02c25cc6f6077fd0e97acabc10917 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <iamjoonsoo.kim@....com>
Date: Fri, 4 May 2018 09:55:21 +0900
Subject: [PATCH] mm/page_alloc: use ac->high_zoneidx for classzone_idx
Currently, we use classzone_idx to calculate lowmem reserve proetection
for an allocation request. This classzone_idx causes a problem
on NUMA systems when the lowmem reserve protection exists for some zones
on a node that do not exist on other nodes.
Before further explanation, I should first clarify how to compute
the classzone_idx and the high_zoneidx.
- ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and
represents the index of the highest zone the allocation can use
- classzone_idx was supposed to be the index of the highest zone on
the local node that the allocation can use, that is actually available
in the system
Think about following example. Node 0 has 4 populated zone,
DMA/DMA32/NORMAL/MOVABLE. Node 1 has 1 populated zone, NORMAL. Some zones,
such as MOVABLE, doesn't exist on node 1 and this makes following
difference.
Assume that there is an allocation request whose gfp_zone(gfp_mask) is
the zone, MOVABLE. Then, it's high_zoneidx is 3. If this allocation is
initiated on node 0, it's classzone_idx is 3 since actually
available/usable zone on local (node 0) is MOVABLE. If this allocation
is initiated on node 1, it's classzone_idx is 2 since actually
available/usable zone on local (node 1) is NORMAL.
You can see that classzone_idx of the allocation request are different
according to their starting node, even if their high_zoneidx is the same.
Think more about these two allocation requests. If they are processed
on local, there is no problem. However, if allocation is initiated
on node 1 are processed on remote, in this example, at the NORMAL zone
on node 0, due to memory shortage, problem occurs. Their different
classzone_idx leads to different lowmem reserve and then different
min watermark. See the following example.
root@...ntu:/sys/devices/system/memory# cat /proc/zoneinfo
Node 0, zone DMA
per-node stats
...
pages free 3965
min 5
low 8
high 11
spanned 4095
present 3998
managed 3977
protection: (0, 2961, 4928, 5440)
...
Node 0, zone DMA32
pages free 757955
min 1129
low 1887
high 2645
spanned 1044480
present 782303
managed 758116
protection: (0, 0, 1967, 2479)
...
Node 0, zone Normal
pages free 459806
min 750
low 1253
high 1756
spanned 524288
present 524288
managed 503620
protection: (0, 0, 0, 4096)
...
Node 0, zone Movable
pages free 130759
min 195
low 326
high 457
spanned 1966079
present 131072
managed 131072
protection: (0, 0, 0, 0)
...
Node 1, zone DMA
pages free 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
protection: (0, 0, 1006, 1006)
Node 1, zone DMA32
pages free 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
protection: (0, 0, 1006, 1006)
Node 1, zone Normal
per-node stats
...
pages free 233277
min 383
low 640
high 897
spanned 262144
present 262144
managed 257744
protection: (0, 0, 0, 0)
...
Node 1, zone Movable
pages free 0
min 0
low 0
high 0
spanned 262144
present 0
managed 0
protection: (0, 0, 0, 0)
- static min watermark for the NORMAL zone on node 0 is 750.
- lowmem reserve for the request with classzone idx 3 at the NORMAL
on node 0 is 4096.
- lowmem reserve for the request with classzone idx 2 at the NORMAL
on node 0 is 0.
So, overall min watermark is:
allocation initiated on node 0 (classzone_idx 3): 750 + 4096 = 4846
allocation initiated on node 1 (classzone_idx 2): 750 + 0 = 750
allocation initiated on node 1 will have some precedence than allocation
initiated on node 0 because min watermark of the former allocation is
lower than the other. So, allocation initiated on node 1 could succeed
on node 0 when allocation initiated on node 0 could not, and, this could
cause too many numa_miss allocation. Then, performance could be
downgraded.
Recently, there was a regression report about this problem on CMA patches
since CMA memory are placed in ZONE_MOVABLE by those patches. I checked
that problem is disappeared with this fix that uses high_zoneidx
for classzone_idx.
http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
Using high_zoneidx for classzone_idx is more consistent way than previous
approach because system's memory layout doesn't affect anything to it.
With this patch, both classzone_idx on above example will be 3 so will
have the same min watermark.
allocation initiated on node 0: 750 + 4096 = 4846
allocation initiated on node 1: 750 + 4096 = 4846
One could wonder if there is a side effect that allocation initiated on
node 1 will use higher bar when allocation is handled on local since
classzone_idx could be higher than before. It will not happen because
the zone without managed page doesn't contributes lowmem_reserve at all.
Reported-by: Ye Xiaolong <xiaolong.ye@...el.com>
Tested-by: Ye Xiaolong <xiaolong.ye@...el.com>
Acked-by: Vlastimil Babka <vbabka@...e.cz>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@....com>
---
mm/internal.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/internal.h b/mm/internal.h
index c39c895..aebaa33 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -119,7 +119,7 @@ struct alloc_context {
bool spread_dirty_pages;
};
-#define ac_classzone_idx(ac) zonelist_zone_idx(ac->preferred_zoneref)
+#define ac_classzone_idx(ac) (ac->high_zoneidx)
/*
* Locate the struct page for both the matching buddy in our
--
2.7.4
Powered by blists - more mailing lists