netdev - Re: [PATCH v3 3/4] net: mana: Allow irq_setup() to skip cpus for affinity

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250515044942.GA5681@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
Date: Wed, 14 May 2025 21:49:42 -0700
From: Shradha Gupta <shradhagupta@...ux.microsoft.com>
To: Michael Kelley <mhklinux@...look.com>
Cc: Yury Norov <yury.norov@...il.com>, Dexuan Cui <decui@...rosoft.com>,
	Wei Liu <wei.liu@...nel.org>,
	Haiyang Zhang <haiyangz@...rosoft.com>,
	"K. Y. Srinivasan" <kys@...rosoft.com>,
	Andrew Lunn <andrew+netdev@...n.ch>,
	"David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
	Konstantin Taranov <kotaranov@...rosoft.com>,
	Simon Horman <horms@...nel.org>, Leon Romanovsky <leon@...nel.org>,
	Maxim Levitsky <mlevitsk@...hat.com>,
	Erni Sri Satya Vennela <ernis@...ux.microsoft.com>,
	Peter Zijlstra <peterz@...radead.org>,
	"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
	"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Nipun Gupta <nipun.gupta@....com>, Jason Gunthorpe <jgg@...pe.ca>,
	Jonathan Cameron <Jonathan.Cameron@...ei.com>,
	Anna-Maria Behnsen <anna-maria@...utronix.de>,
	Kevin Tian <kevin.tian@...el.com>, Long Li <longli@...rosoft.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Bjorn Helgaas <bhelgaas@...gle.com>, Rob Herring <robh@...nel.org>,
	Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>,
	Krzysztof Wilczy???~Dski <kw@...ux.com>,
	Lorenzo Pieralisi <lpieralisi@...nel.org>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
	Paul Rosswurm <paulros@...rosoft.com>,
	Shradha Gupta <shradhagupta@...rosoft.com>
Subject: Re: [PATCH v3 3/4] net: mana: Allow irq_setup() to skip cpus for
 affinity

On Wed, May 14, 2025 at 05:26:45PM +0000, Michael Kelley wrote:
> From: Yury Norov <yury.norov@...il.com> Sent: Wednesday, May 14, 2025 9:55 AM
> > 
> > On Wed, May 14, 2025 at 04:53:34AM +0000, Michael Kelley wrote:
> > > > -static int irq_setup(unsigned int *irqs, unsigned int len, int node)
> > > > +static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > > > +		     bool skip_first_cpu)
> > > >  {
> > > >  	const struct cpumask *next, *prev = cpu_none_mask;
> > > >  	cpumask_var_t cpus __free(free_cpumask_var);
> > > > @@ -1303,9 +1304,20 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node)
> > > >  		while (weight > 0) {
> > > >  			cpumask_andnot(cpus, next, prev);
> > > >  			for_each_cpu(cpu, cpus) {
> > > > +				/*
> > > > +				 * if the CPU sibling set is to be skipped we
> > > > +				 * just move on to the next CPUs without len--
> > > > +				 */
> > > > +				if (unlikely(skip_first_cpu)) {
> > > > +					skip_first_cpu = false;
> > > > +					goto next_cpumask;
> > > > +				}
> > > > +
> > > >  				if (len-- == 0)
> > > >  					goto done;
> > > > +
> > > >  				irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
> > > > +next_cpumask:
> > > >  				cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));
> > > >  				--weight;
> > > >  			}
> > >
> > > With a little bit of reordering of the code, you could avoid the need for the "next_cpumask"
> > > label and goto statement.  "continue" is usually cleaner than a "goto". Here's what I'm thinking:
> > >
> > > 		for_each_cpu(cpu, cpus) {
> > > 			cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));
> > > 			--weight;
> > 
> > cpumask_andnot() is O(N), and before it was conditional on 'len == 0',
> > so we didn't do that on the very last step. Your version has to do that.
> > Don't know how important that is for real workloads. Shradha maybe can
> > measure it...
> 
> Yes, there's one extra invocation of cpumask_andnot(). But if the
> VM has a small number of CPUs, that extra invocation is negligible.
> If the VM has a large number of CPUs, we're already executing
> cpumask_andnot() many times, so one extra time is also negligible.
> And this whole thing is done only at device initialization time, so
> it's not a hot path.
>

Hi Michael, Yury,

That's right, the overhead is negligible. Tested with some common
workloads. I will change this in the next version.

Shradha.
 
> > 
> > >
> > > 			If (unlikely(skip_first_cpu)) {
> > > 				skip_first_cpu = false;
> > > 				continue;
> > > 			}
> > >
> > > 			If (len-- == 0)
> > > 				goto done;
> > >
> > > 			irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
> > > 		}
> > >
> > > I wish there were some comments in irq_setup() explaining the overall intention of
> > > the algorithm. I can see how the goal is to first assign CPUs that are local to the current
> > > NUMA node, and then expand outward to CPUs that are further away. And you want
> > > to *not* assign both siblings in a hyper-threaded core.
> > 
> > I wrote this function, so let me step in.
> > 
> > The intention is described in the corresponding commit message:
> > 
> >   Souradeep investigated that the driver performs faster if IRQs are
> >   spread on CPUs with the following heuristics:
> > 
> >   1. No more than one IRQ per CPU, if possible;
> >   2. NUMA locality is the second priority;
> >   3. Sibling dislocality is the last priority.
> > 
> >   Let's consider this topology:
> > 
> >   Node            0               1
> >   Core        0       1       2       3
> >   CPU       0   1   2   3   4   5   6   7
> > 
> >   The most performant IRQ distribution based on the above topology
> >   and heuristics may look like this:
> > 
> >   IRQ     Nodes   Cores   CPUs
> >   0       1       0       0-1
> >   1       1       1       2-3
> >   2       1       0       0-1
> >   3       1       1       2-3
> >   4       2       2       4-5
> >   5       2       3       6-7
> >   6       2       2       4-5
> >   7       2       3       6-7
> > 
> > > But I can't figure out what
> > > "weight" is trying to accomplish. Maybe this was discussed when the code first
> > > went in, but I can't remember now. :-(
> > 
> > The weight here is to implement the heuristic discovered by Souradeep:
> > NUMA locality is preferred over sibling dislocality.
> > 
> > The outer for_each() loop resets the weight to the actual number of
> > CPUs in the hop. Then inner for_each() loop decrements it by the
> > number of sibling groups (cores) while assigning first IRQ to each
> > group.
> > 
> > Now, because NUMA locality is more important, we should walk the
> > same set of siblings and assign 2nd IRQ, and it's implemented by the
> > medium while() loop. So, we do like this unless the number of IRQs
> > assigned on this hop will not become equal to number of CPUs in the
> > hop (weight == 0). Then we switch to the next hop and do the same
> > thing.
> > 
> > Hope that helps.
> 
> Yes, that helps! So the key to understanding "weight" is that
> NUMA locality is preferred over sibling dislocality.
> 
> This is a great summary!  All or most of it should go as a
> comment describing the function and what it is trying to do.
> 
> Michael