[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <543B7C03.1080002@redhat.com>
Date: Mon, 13 Oct 2014 03:15:15 -0400
From: Rik van Riel <riel@...hat.com>
To: Peter Zijlstra <peterz@...radead.org>
CC: linux-kernel@...r.kernel.org, mgorman@...e.de, chegu_vinod@...com,
mingo@...nel.org, efault@....de, vincent.guittot@...aro.org
Subject: Re: [PATCH RFC 4/5] sched,numa: calculate node scores in complex
NUMA topologies
On 10/12/2014 10:53 AM, Peter Zijlstra wrote:
> On Wed, Oct 08, 2014 at 03:37:29PM -0400, riel@...hat.com wrote:
>> From: Rik van Riel <riel@...hat.com>
>>
>> In order to do task placement on systems with complex NUMA topologies,
>> it is necessary to count the faults on nodes nearby the node that is
>> being examined for a potential move.
>>
>> In case of a system with a backplane interconnect, we are dealing with
>> groups of NUMA nodes; each of the nodes within a group is the same number
>> of hops away from nodes in other groups in the system. Optimal placement
>> on this topology is achieved by counting all nearby nodes equally. When
>> comparing nodes A and B at distance N, nearby nodes are those at distances
>> smaller than N from nodes A or B.
>>
>> Placement strategy on a system with a glueless mesh NUMA topology needs
>> to be different, because there are no natural groups of nodes determined
>> by the hardware. Instead, when dealing with two nodes A and B at distance
>> N, N >= 2, there will be intermediate nodes at distance < N from both nodes
>> A and B. Good placement can be achieved by right shifting the faults on
>> nearby nodes by the number of hops from the node being scored. In this
>> context, a nearby node is any node less than the maximum distance in the
>> system away from the node. Those nodes are skipped for efficiency reasons,
>> there is no real policy reason to do so.
>
>
>> +/* Handle placement on systems where not all nodes are directly connected. */
>> +static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
>> + int hoplimit, bool task)
>> +{
>> + unsigned long score = 0;
>> + int node;
>> +
>> + /*
>> + * All nodes are directly connected, and the same distance
>> + * from each other. No need for fancy placement algorithms.
>> + */
>> + if (sched_numa_topology_type == NUMA_DIRECT)
>> + return 0;
>> +
>> + for_each_online_node(node) {
>
>> + }
>> +
>> + return score;
>> +}
>
>> @@ -944,6 +1003,8 @@ static inline unsigned long task_weight(struct task_struct *p, int nid,
>> return 0;
>>
>> faults = task_faults(p, nid);
>> + faults += score_nearby_nodes(p, nid, hops, true);
>> +
>> return 1000 * faults / total_faults;
>> }
>
>> @@ -961,6 +1022,8 @@ static inline unsigned long group_weight(struct task_struct *p, int nid,
>> return 0;
>>
>> faults = group_faults(p, nid);
>> + faults += score_nearby_nodes(p, nid, hops, false);
>> +
>> return 1000 * faults / total_faults;
>> }
>
> So this makes {task,group}_weight() O(nr_nodes), and we call these
> function from O(nr_nodes) loops, giving a total of O(nr_nodes^2)
> computational complexity, right?
>
> Seems important to mention; I realize this is only for !DIRECT, but
> still, I bet the real large people (those same 512 nodes we had
> previous) would not really appreciate this.
>
If desired, we can set up a nodemask for each node that
covers only the nodes that are less than the maximum
distance away from each other.
Even on the very large systems, that is likely to be
less than a dozen nodes.
I am not sure whether to add that complexity now, or
whether that should be considered premature optimization :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists