lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b3122c82-e0fc-5bb8-82ec-43ae785f381f@intel.com>
Date:   Mon, 30 Nov 2020 09:08:38 -0800
From:   Dave Hansen <dave.hansen@...el.com>
To:     Shaokun Zhang <zhangshaokun@...ilicon.com>,
        linux-kernel@...r.kernel.org, netdev@...r.kernel.org
Cc:     Yuqi Jin <jinyuqi@...wei.com>,
        Rusty Russell <rusty@...tcorp.com.au>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Juergen Gross <jgross@...e.com>,
        Paul Burton <paul.burton@...s.com>,
        Michal Hocko <mhocko@...e.com>,
        Michael Ellerman <mpe@...erman.id.au>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        Anshuman Khandual <anshuman.khandual@....com>
Subject: Re: [PATCH v7] lib: optimize cpumask_local_spread()

>>>  {
>>> -	int cpu, hk_flags;
>>> +	static DEFINE_SPINLOCK(spread_lock);
>>> +	static bool used[MAX_NUMNODES];
>>
>> I thought I mentioned this last time.  How large is this array?  How
>> large would it be if it were a nodemask_t?  Would this be less code if
> 
> Apologies that I forgot to do it.
> 
>> you just dynamically allocated and freed the node mask instead of having
>> a spinlock and a memset?
> 
> Ok, but I think the spinlock is also needed, do I miss something?

There was no spinlock there before your patch.  You just need it to
protect the structures you declared static.  If you didn't have static
structures, you wouldn't need a lock.

>>> +	unsigned long flags;
>>> +	int cpu, hk_flags, j, id;
>>>  	const struct cpumask *mask;
>>>  
>>>  	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>>> @@ -352,20 +379,27 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
>>>  				return cpu;
>>>  		}
>>>  	} else {
>>> -		/* NUMA first. */
>>> -		for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
>>> -			if (i-- == 0)
>>> -				return cpu;
>>> +		spin_lock_irqsave(&spread_lock, flags);
>>> +		memset(used, 0, nr_node_ids * sizeof(bool));
>>> +		/* select node according to the distance from local node */
>>> +		for (j = 0; j < nr_node_ids; j++) {
>>> +			id = find_nearest_node(node, used);
>>> +			if (id < 0)
>>> +				break;
>>
>> There's presumably an outer loop in a driver which is trying to bind a
>> bunch of interrupts to a bunch of CPUs.  We know there are on the order
>> of dozens of these interrupts.
>>
>> 	for_each_interrupt() // in the driver
>> 		for (j=0;j<nr_node_ids;j++) // cpumask_local_spread()
>> 			// find_nearest_node():
>> 			for (i = 0; i < nr_node_ids; i++) {
>> 			for (i = 0; i < nr_node_ids; i++) {
>>
>> Does this worry anybody else?  It thought our upper limits on the number
>> of NUMA nodes was 1024.  Doesn't that make our loop O(N^3) where the
>> worst case is hundreds of millions of loops?
> 
> If the NUMA nodes is 1024 in real system, it is more worthy to find the
> earest node, rather than choose a random one, And it is only called in
> I/O device initialization. Comments also are given to this interface.

This doesn't really make me feel better.  An end user booting this on a
big system with a bunch of cards could see a minutes-long delay.  I can
also see funky stuff happening like if we have a ton of NUMA nodes and
few CPUs.

>> I don't want to prematurely optimize this, but that seems like something
>> that might just fall over on bigger systems.
>>
>> This also seems really wasteful if we have a bunch of memory-only nodes.
>>  Each of those will be found via find_nearest_node(), but then this loop:
> 
> Got it, all effort is used to choose the nearest node for performance. If
> we don't it, I think some one will also debug this in future.

If we're going to kick the can down the road for some poor sod to debug,
can we at least help them out with a warning?

Maybe we WARN_ONCE() after we fall back for more than 2 or 3 nodes.

But, I still don't think you've addressed my main concern: This is
horrifically inefficient searching for CPUs inside nodes that are known
to have no CPUs.

>>> +			for_each_cpu_and(cpu, cpumask_of_node(id), mask)
>>> +				if (i-- == 0) {
>>> +					spin_unlock_irqrestore(&spread_lock,
>>> +							       flags);
>>> +					return cpu;
>>> +				}
>>> +			used[id] = true;
>>>  		}
>>
>> Will just exit immediately because cpumask_of_node() is empty.
> 
> Yes, and this node used[id] became true.
> 
>>
>> 'used', for instance, should start by setting 'true' for all nodes which
>> are not in N_CPUS.
> 
> No, because I used 'nr_node_ids' which is possible node ids to check.

I'm saying that it's wasteful to loop over and search in all the nodes.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ