netdev - Re: [PATCH v7] lib: optimize cpumask_local

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b3122c82-e0fc-5bb8-82ec-43ae785f381f@intel.com>
Date:   Mon, 30 Nov 2020 09:08:38 -0800
From:   Dave Hansen <dave.hansen@...el.com>
To:     Shaokun Zhang <zhangshaokun@...ilicon.com>,
        linux-kernel@...r.kernel.org, netdev@...r.kernel.org
Cc:     Yuqi Jin <jinyuqi@...wei.com>,
        Rusty Russell <rusty@...tcorp.com.au>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Juergen Gross <jgross@...e.com>,
        Paul Burton <paul.burton@...s.com>,
        Michal Hocko <mhocko@...e.com>,
        Michael Ellerman <mpe@...erman.id.au>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        Anshuman Khandual <anshuman.khandual@....com>
Subject: Re: [PATCH v7] lib: optimize cpumask_local_spread()

>>>  {
>>> -	int cpu, hk_flags;
>>> +	static DEFINE_SPINLOCK(spread_lock);
>>> +	static bool used[MAX_NUMNODES];
>>
>> I thought I mentioned this last time.  How large is this array?  How
>> large would it be if it were a nodemask_t?  Would this be less code if
> 
> Apologies that I forgot to do it.
> 
>> you just dynamically allocated and freed the node mask instead of having
>> a spinlock and a memset?
> 
> Ok, but I think the spinlock is also needed, do I miss something?

There was no spinlock there before your patch.  You just need it to
protect the structures you declared static.  If you didn't have static
structures, you wouldn't need a lock.

>>> +	unsigned long flags;
>>> +	int cpu, hk_flags, j, id;
>>>  	const struct cpumask *mask;
>>>  
>>>  	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>>> @@ -352,20 +379,27 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
>>>  				return cpu;
>>>  		}
>>>  	} else {
>>> -		/* NUMA first. */
>>> -		for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
>>> -			if (i-- == 0)
>>> -				return cpu;
>>> +		spin_lock_irqsave(&spread_lock, flags);
>>> +		memset(used, 0, nr_node_ids * sizeof(bool));
>>> +		/* select node according to the distance from local node */
>>> +		for (j = 0; j < nr_node_ids; j++) {
>>> +			id = find_nearest_node(node, used);
>>> +			if (id < 0)
>>> +				break;
>>
>> There's presumably an outer loop in a driver which is trying to bind a
>> bunch of interrupts to a bunch of CPUs.  We know there are on the order
>> of dozens of these interrupts.
>>
>> 	for_each_interrupt() // in the driver
>> 		for (j=0;j<nr_node_ids;j++) // cpumask_local_spread()
>> 			// find_nearest_node():
>> 			for (i = 0; i < nr_node_ids; i++) {
>> 			for (i = 0; i < nr_node_ids; i++) {
>>
>> Does this worry anybody else?  It thought our upper limits on the number
>> of NUMA nodes was 1024.  Doesn't that make our loop O(N^3) where the
>> worst case is hundreds of millions of loops?
> 
> If the NUMA nodes is 1024 in real system, it is more worthy to find the
> earest node, rather than choose a random one, And it is only called in
> I/O device initialization. Comments also are given to this interface.

This doesn't really make me feel better.  An end user booting this on a
big system with a bunch of cards could see a minutes-long delay.  I can
also see funky stuff happening like if we have a ton of NUMA nodes and
few CPUs.

>> I don't want to prematurely optimize this, but that seems like something
>> that might just fall over on bigger systems.
>>
>> This also seems really wasteful if we have a bunch of memory-only nodes.
>>  Each of those will be found via find_nearest_node(), but then this loop:
> 
> Got it, all effort is used to choose the nearest node for performance. If
> we don't it, I think some one will also debug this in future.

If we're going to kick the can down the road for some poor sod to debug,
can we at least help them out with a warning?

Maybe we WARN_ONCE() after we fall back for more than 2 or 3 nodes.

But, I still don't think you've addressed my main concern: This is
horrifically inefficient searching for CPUs inside nodes that are known
to have no CPUs.

>>> +			for_each_cpu_and(cpu, cpumask_of_node(id), mask)
>>> +				if (i-- == 0) {
>>> +					spin_unlock_irqrestore(&spread_lock,
>>> +							       flags);
>>> +					return cpu;
>>> +				}
>>> +			used[id] = true;
>>>  		}
>>
>> Will just exit immediately because cpumask_of_node() is empty.
> 
> Yes, and this node used[id] became true.
> 
>>
>> 'used', for instance, should start by setting 'true' for all nodes which
>> are not in N_CPUS.
> 
> No, because I used 'nr_node_ids' which is possible node ids to check.

I'm saying that it's wasteful to loop over and search in all the nodes.