[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87bldud6nj.fsf@yhuang-dev.intel.com>
Date: Tue, 12 Jan 2021 14:13:36 +0800
From: "Huang\, Ying" <ying.huang@...el.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Mel Gorman <mgorman@...e.de>, <linux-mm@...ck.org>,
<linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
"Ingo Molnar" <mingo@...hat.com>, Rik van Riel <riel@...riel.com>,
Johannes Weiner <hannes@...xchg.org>,
"Matthew Wilcox \(Oracle\)" <willy@...radead.org>,
"Dave Hansen" <dave.hansen@...el.com>,
Andi Kleen <ak@...ux.intel.com>,
"Michal Hocko" <mhocko@...e.com>,
David Rientjes <rientjes@...gle.com>,
<linux-api@...r.kernel.org>
Subject: Re: [PATCH -V8 1/3] numa balancing: Migrate on fault among multiple bound nodes
Hi, Peter,
Huang Ying <ying.huang@...el.com> writes:
> Now, NUMA balancing can only optimize the page placement among the
> NUMA nodes if the default memory policy is used. Because the memory
> policy specified explicitly should take precedence. But this seems
> too strict in some situations. For example, on a system with 4 NUMA
> nodes, if the memory of an application is bound to the node 0 and 1,
> NUMA balancing can potentially migrate the pages between the node 0
> and 1 to reduce cross-node accessing without breaking the explicit
> memory binding policy.
>
> So in this patch, we add MPOL_F_NUMA_BALANCING mode flag to
> set_mempolicy() when mode is MPOL_BIND. With the flag specified, NUMA
> balancing will be enabled within the thread to optimize the page
> placement within the constrains of the specified memory binding
> policy. With the newly added flag, the NUMA balancing control
> mechanism becomes,
>
> - sysctl knob numa_balancing can enable/disable the NUMA balancing
> globally.
>
> - even if sysctl numa_balancing is enabled, the NUMA balancing will be
> disabled for the memory areas or applications with the explicit memory
> policy by default.
>
> - MPOL_F_NUMA_BALANCING can be used to enable the NUMA balancing for the
> applications when specifying the explicit memory policy (MPOL_BIND).
>
> Various page placement optimization based on the NUMA balancing can be
> done with these flags. As the first step, in this patch, if the
> memory of the application is bound to multiple nodes (MPOL_BIND), and
> in the hint page fault handler the accessing node are in the policy
> nodemask, the page will be tried to be migrated to the accessing node
> to reduce the cross-node accessing.
>
> If the newly added MPOL_F_NUMA_BALANCING flag is specified by an
> application on an old kernel version without its support,
> set_mempolicy() will return -1 and errno will be set to EINVAL. The
> application can use this behavior to run on both old and new kernel
> versions.
>
> And if the MPOL_F_NUMA_BALANCING flag is specified for the mode other
> than MPOL_BIND, set_mempolicy() will return -1 and errno will be set
> to EINVAL as before. Because we don't support optimization based on
> the NUMA balancing for these modes.
>
> In the previous version of the patch, we tried to reuse MPOL_MF_LAZY
> for mbind(). But that flag is tied to MPOL_MF_MOVE.*, so it seems not
> a good API/ABI for the purpose of the patch.
>
> And because it's not clear whether it's necessary to enable NUMA
> balancing for a specific memory area inside an application, so we only
> add the flag at the thread level (set_mempolicy()) instead of the
> memory area level (mbind()). We can do that when it become necessary.
>
> To test the patch, we run a test case as follows on a 4-node machine
> with 192 GB memory (48 GB per node).
>
> 1. Change pmbench memory accessing benchmark to call set_mempolicy()
> to bind its memory to node 1 and 3 and enable NUMA balancing. Some
> related code snippets are as follows,
>
> #include <numaif.h>
> #include <numa.h>
>
> struct bitmask *bmp;
> int ret;
>
> bmp = numa_parse_nodestring("1,3");
> ret = set_mempolicy(MPOL_BIND | MPOL_F_NUMA_BALANCING,
> bmp->maskp, bmp->size + 1);
> /* If MPOL_F_NUMA_BALANCING isn't supported, fall back to MPOL_BIND */
> if (ret < 0 && errno == EINVAL)
> ret = set_mempolicy(MPOL_BIND, bmp->maskp, bmp->size + 1);
> if (ret < 0) {
> perror("Failed to call set_mempolicy");
> exit(-1);
> }
>
> 2. Run a memory eater on node 3 to use 40 GB memory before running pmbench.
>
> 3. Run pmbench with 64 processes, the working-set size of each process
> is 640 MB, so the total working-set size is 64 * 640 MB = 40 GB. The
> CPU and the memory (as in step 1.) of all pmbench processes is bound
> to node 1 and 3. So, after CPU usage is balanced, some pmbench
> processes run on the CPUs of the node 3 will access the memory of
> the node 1.
>
> 4. After the pmbench processes run for 100 seconds, kill the memory
> eater. Now it's possible for some pmbench processes to migrate
> their pages from node 1 to node 3 to reduce cross-node accessing.
>
> Test results show that, with the patch, the pages can be migrated from
> node 1 to node 3 after killing the memory eater, and the pmbench score
> can increase about 17.5%.
>
> Signed-off-by: "Huang, Ying" <ying.huang@...el.com>
> Acked-by: Mel Gorman <mgorman@...e.de>
> Cc: Andrew Morton <akpm@...ux-foundation.org>
> Cc: Ingo Molnar <mingo@...hat.com>
> Cc: Rik van Riel <riel@...riel.com>
> Cc: Johannes Weiner <hannes@...xchg.org>
> Cc: "Matthew Wilcox (Oracle)" <willy@...radead.org>
> Cc: Dave Hansen <dave.hansen@...el.com>
> Cc: Andi Kleen <ak@...ux.intel.com>
> Cc: Michal Hocko <mhocko@...e.com>
> Cc: David Rientjes <rientjes@...gle.com>
> Cc: linux-api@...r.kernel.org
It seems that Andrew has no objection to this patch. Is it possible for
you to merge it through your tree?
Best Regards,
Huang, Ying
Powered by blists - more mailing lists