[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87h77ebn6j.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Thu, 31 Mar 2022 15:23:16 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>,
Jagdish Gediya <jvgediya@...ux.ibm.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
akpm@...ux-foundation.org, baolin.wang@...ux.alibaba.com,
dave.hansen@...ux.intel.com, Fan Du <fan.du@...el.com>
Subject: Re: [PATCH] mm: migrate: set demotion targets differently
"Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com> writes:
> "Huang, Ying" <ying.huang@...el.com> writes:
>
>> "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com> writes:
>>
>>> "Huang, Ying" <ying.huang@...el.com> writes:
>>>
>>>> Hi, Jagdish,
>>>>
>>>> Jagdish Gediya <jvgediya@...ux.ibm.com> writes:
>>>>
>>>
>>> ...
>>>
>>>>> e.g. with below NUMA topology, where node 0 & 1 are
>>>>> cpu + dram nodes, node 2 & 3 are equally slower memory
>>>>> only nodes, and node 4 is slowest memory only node,
>>>>>
>>>>> available: 5 nodes (0-4)
>>>>> node 0 cpus: 0 1
>>>>> node 0 size: n MB
>>>>> node 0 free: n MB
>>>>> node 1 cpus: 2 3
>>>>> node 1 size: n MB
>>>>> node 1 free: n MB
>>>>> node 2 cpus:
>>>>> node 2 size: n MB
>>>>> node 2 free: n MB
>>>>> node 3 cpus:
>>>>> node 3 size: n MB
>>>>> node 3 free: n MB
>>>>> node 4 cpus:
>>>>> node 4 size: n MB
>>>>> node 4 free: n MB
>>>>> node distances:
>>>>> node 0 1 2 3 4
>>>>> 0: 10 20 40 40 80
>>>>> 1: 20 10 40 40 80
>>>>> 2: 40 40 10 40 80
>>>>> 3: 40 40 40 10 80
>>>>> 4: 80 80 80 80 10
>>>>>
>>>>> The existing implementation gives below demotion targets,
>>>>>
>>>>> node demotion_target
>>>>> 0 3, 2
>>>>> 1 4
>>>>> 2 X
>>>>> 3 X
>>>>> 4 X
>>>>>
>>>>> With this patch applied, below are the demotion targets,
>>>>>
>>>>> node demotion_target
>>>>> 0 3, 2
>>>>> 1 3, 2
>>>>> 2 3
>>>>> 3 4
>>>>> 4 X
>>>>
>>>> For such machine, I think the perfect demotion order is,
>>>>
>>>> node demotion_target
>>>> 0 2, 3
>>>> 1 2, 3
>>>> 2 4
>>>> 3 4
>>>> 4 X
>>>
>>> I guess the "equally slow nodes" is a confusing definition here. Now if the
>>> system consists of 2 1GB equally slow memory and the firmware doesn't want to
>>> differentiate between them, firmware can present a single NUMA node
>>> with 2GB capacity? The fact that we are finding two NUMA nodes is a hint
>>> that there is some difference between these two memory devices. This is
>>> also captured by the fact that the distance between 2 and 3 is 40 and not 10.
>>
>> Do you have more information about this?
>
> Not sure I follow the question there. I was checking shouldn't firmware
> do a single NUMA node if two memory devices are of the same type? How will
> optane present such a config? Both the DIMMs will have the same
> proximity domain value and hence dax kmem will add them to the same NUMA
> node?
Sorry for confusing. I just wanted to check whether you have more
information about the machine configuration above. The machines in my
hand have no complex NUMA topology as in the patch description.
> If you are suggesting that firmware doesn't do that, then I agree with you
> that a demotion target like the below is good.
>
> node demotion_target
> 0 2, 3
> 1 2, 3
> 2 4
> 3 4
> 4 X
>
> We can also achieve that with a smiple change as below.
Glad to see the demotion order can be implemented in a simple way.
My concern is that is it necessary to do this? If there are real
machines with the NUMA topology, then I think it's good to add the
support. But if not, why do we make the code complex unnecessarily?
I don't have these kind of machines, do you have and will have?
> @@ -3120,7 +3120,7 @@ static void __set_migration_target_nodes(void)
> {
> nodemask_t next_pass = NODE_MASK_NONE;
> nodemask_t this_pass = NODE_MASK_NONE;
> - nodemask_t used_targets = NODE_MASK_NONE;
> + nodemask_t this_pass_used_targets = NODE_MASK_NONE;
> int node, best_distance;
>
> /*
> @@ -3141,17 +3141,20 @@ static void __set_migration_target_nodes(void)
> /*
> * To avoid cycles in the migration "graph", ensure
> * that migration sources are not future targets by
> - * setting them in 'used_targets'. Do this only
> + * setting them in 'this_pass_used_targets'. Do this only
> * once per pass so that multiple source nodes can
> * share a target node.
> *
> - * 'used_targets' will become unavailable in future
> + * 'this_pass_used_targets' will become unavailable in future
> * passes. This limits some opportunities for
> * multiple source nodes to share a destination.
> */
> - nodes_or(used_targets, used_targets, this_pass);
> + nodes_or(this_pass_used_targets, this_pass_used_targets, this_pass);
>
> for_each_node_mask(node, this_pass) {
> +
> + nodemask_t used_targets = this_pass_used_targets;
> +
> best_distance = -1;
>
> /*
Best Regards,
Huang, Ying
Powered by blists - more mailing lists