[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1ca0d77f-7cf3-57d8-af23-169975b63b32@hisilicon.com>
Date: Mon, 16 Nov 2020 15:59:38 +0800
From: Shaokun Zhang <zhangshaokun@...ilicon.com>
To: Dave Hansen <dave.hansen@...el.com>,
<linux-kernel@...r.kernel.org>, <netdev@...r.kernel.org>
CC: Yuqi Jin <jinyuqi@...wei.com>,
Rusty Russell <rusty@...tcorp.com.au>,
Andrew Morton <akpm@...ux-foundation.org>,
Juergen Gross <jgross@...e.com>,
Paul Burton <paul.burton@...s.com>,
Michal Hocko <mhocko@...e.com>,
"Michael Ellerman" <mpe@...erman.id.au>,
Mike Rapoport <rppt@...ux.ibm.com>,
"Anshuman Khandual" <anshuman.khandual@....com>
Subject: Re: [PATCH v6] lib: optimize cpumask_local_spread()
Hi Dave,
在 2020/11/14 0:02, Dave Hansen 写道:
> On 11/12/20 6:06 PM, Shaokun Zhang wrote:
>>>> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 - 3) in the 2-cpu
>>>> system(0 - 1). The topology of this server is followed:
>>>
>>> This is with a feature enabled that Intel calls sub-NUMA-clustering
>>> (SNC), right? Explaining *that* feature would also be great context for
>>
>> Correct,
>>
>>> why this gets triggered on your system and not normally on others and
>>> why nobody noticed this until now.
>>
>> This is on intel 6248 platform:
>
> I have no idea what a "6248 platform" is.
>
My apologies that it's Cascade Lake, [1]
>>>> +static void calc_node_distance(int *node_dist, int node)
>>>> +{
>>>> + int i;
>>>> +
>>>> + for (i = 0; i < nr_node_ids; i++)
>>>> + node_dist[i] = node_distance(node, i);
>>>> +}
>>>
>>> This appears to be the only place node_dist[] is written. That means it
>>> always contains a one-dimensional slice of the two-dimensional data
>>> represented by node_distance().
>>>
>>> Why is a copy of this data needed?
>>
>> It is used to store the distance with the @node for later, apologies that I
>> can't follow your question correctly.
>
> Right, the data that you store is useful. *But*, it's also a verbatim
> copy of the data from node_distance(). Why not just use node_distance()
> directly in your code rather than creating a partial copy of it in the
Ok, I will remove this redundant function in next version.
> local node_dist[] array?
>
>
>>>> unsigned int cpumask_local_spread(unsigned int i, int node)
>>>> {
>>>> - int cpu, hk_flags;
>>>> + static DEFINE_SPINLOCK(spread_lock);
>>>> + static int node_dist[MAX_NUMNODES];
>>>> + static bool used[MAX_NUMNODES];
>>>
>>> Not to be *too* picky, but there is a reason we declare nodemask_t as a
>>> bitmap and not an array of bools. Isn't this just wasteful?
>>>
>>>> + unsigned long flags;
>>>> + int cpu, hk_flags, j, id;
>>>> const struct cpumask *mask;
>>>>
>>>> hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>>>> @@ -220,20 +256,28 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
>>>> return cpu;
>>>> }
>>>> } else {
>>>> - /* NUMA first. */
>>>> - for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
>>>> - if (i-- == 0)
>>>> - return cpu;
>>>> - }
>>>> + spin_lock_irqsave(&spread_lock, flags);
>>>> + memset(used, 0, nr_node_ids * sizeof(bool));
>>>> + calc_node_distance(node_dist, node);
>>>> + /* Local node first then the nearest node is used */
>>>
>>> Is this comment really correct? This makes it sound like there is only
>>
>> I think it is correct, that's what we want to choose the nearest node.
>>
>>> fallback to a single node. Doesn't the _code_ fall back basically
>>> without limit?
>>
>> If I follow your question correctly, without this patch, if the local
>> node is used up, one random node will be choosed, right? Now we firstly
>> choose the nearest node by the distance, if all nodes has been choosen,
>> it will return the initial solution.
>
> The comment makes it sound like the code does:
> 1. Do the local node
> 2. Do the next nearest node
> 3. Stop
>
That's more clear, I will udpate the comments as the new patch.
> In reality, I *think* it's more of a loop where it search
> ever-increasing distances away from the local node.
>
> I just think the comment needs to be made more precise.
Got it.
>
>>>> + for (j = 0; j < nr_node_ids; j++) {
>>>> + id = find_nearest_node(node_dist, used);
>>>> + if (id < 0)
>>>> + break;
>>>>
>>>> - for_each_cpu(cpu, mask) {
>>>> - /* Skip NUMA nodes, done above. */
>>>> - if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
>>>> - continue;
>>>> + for_each_cpu_and(cpu, cpumask_of_node(id), mask)
>>>> + if (i-- == 0) {
>>>> + spin_unlock_irqrestore(&spread_lock,
>>>> + flags);
>>>> + return cpu;
>>>> + }
>>>> + used[id] = 1;
>>>> + }
>>>> + spin_unlock_irqrestore(&spread_lock, flags);
>>>
>>> The existing code was pretty sparsely commented. This looks to me to
>>> make it more complicated and *less* commented. Not the best combo.
>>
>> Apologies for the bad comments, hopefully I describe it clearly by the above
>> explantion.
>
> Do you want to take another pass at submitting this patch?
'Another pass'? Sorry for my bad understading, I don't follow it correctly.
Thanks,
Shaokun
[1]https://en.wikichip.org/wiki/intel/xeon_gold/6248
> .
>
Powered by blists - more mailing lists