[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<SN6PR02MB4157AA971E41FE92B1878F9DD491A@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Wed, 14 May 2025 17:26:45 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Yury Norov <yury.norov@...il.com>
CC: Shradha Gupta <shradhagupta@...ux.microsoft.com>, Dexuan Cui
<decui@...rosoft.com>, Wei Liu <wei.liu@...nel.org>, Haiyang Zhang
<haiyangz@...rosoft.com>, "K. Y. Srinivasan" <kys@...rosoft.com>, Andrew Lunn
<andrew+netdev@...n.ch>, "David S. Miller" <davem@...emloft.net>, Eric
Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni
<pabeni@...hat.com>, Konstantin Taranov <kotaranov@...rosoft.com>, Simon
Horman <horms@...nel.org>, Leon Romanovsky <leon@...nel.org>, Maxim Levitsky
<mlevitsk@...hat.com>, Erni Sri Satya Vennela <ernis@...ux.microsoft.com>,
Peter Zijlstra <peterz@...radead.org>, "linux-hyperv@...r.kernel.org"
<linux-hyperv@...r.kernel.org>, "linux-pci@...r.kernel.org"
<linux-pci@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, Nipun Gupta <nipun.gupta@....com>, Jason
Gunthorpe <jgg@...pe.ca>, Jonathan Cameron <Jonathan.Cameron@...ei.com>,
Anna-Maria Behnsen <anna-maria@...utronix.de>, Kevin Tian
<kevin.tian@...el.com>, Long Li <longli@...rosoft.com>, Thomas Gleixner
<tglx@...utronix.de>, Bjorn Helgaas <bhelgaas@...gle.com>, Rob Herring
<robh@...nel.org>, Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>,
Krzysztof Wilczy�~Dski <kw@...ux.com>, Lorenzo
Pieralisi <lpieralisi@...nel.org>, "netdev@...r.kernel.org"
<netdev@...r.kernel.org>, "linux-rdma@...r.kernel.org"
<linux-rdma@...r.kernel.org>, Paul Rosswurm <paulros@...rosoft.com>, Shradha
Gupta <shradhagupta@...rosoft.com>
Subject: RE: [PATCH v3 3/4] net: mana: Allow irq_setup() to skip cpus for
affinity
From: Yury Norov <yury.norov@...il.com> Sent: Wednesday, May 14, 2025 9:55 AM
>
> On Wed, May 14, 2025 at 04:53:34AM +0000, Michael Kelley wrote:
> > > -static int irq_setup(unsigned int *irqs, unsigned int len, int node)
> > > +static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > > + bool skip_first_cpu)
> > > {
> > > const struct cpumask *next, *prev = cpu_none_mask;
> > > cpumask_var_t cpus __free(free_cpumask_var);
> > > @@ -1303,9 +1304,20 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node)
> > > while (weight > 0) {
> > > cpumask_andnot(cpus, next, prev);
> > > for_each_cpu(cpu, cpus) {
> > > + /*
> > > + * if the CPU sibling set is to be skipped we
> > > + * just move on to the next CPUs without len--
> > > + */
> > > + if (unlikely(skip_first_cpu)) {
> > > + skip_first_cpu = false;
> > > + goto next_cpumask;
> > > + }
> > > +
> > > if (len-- == 0)
> > > goto done;
> > > +
> > > irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
> > > +next_cpumask:
> > > cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));
> > > --weight;
> > > }
> >
> > With a little bit of reordering of the code, you could avoid the need for the "next_cpumask"
> > label and goto statement. "continue" is usually cleaner than a "goto". Here's what I'm thinking:
> >
> > for_each_cpu(cpu, cpus) {
> > cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));
> > --weight;
>
> cpumask_andnot() is O(N), and before it was conditional on 'len == 0',
> so we didn't do that on the very last step. Your version has to do that.
> Don't know how important that is for real workloads. Shradha maybe can
> measure it...
Yes, there's one extra invocation of cpumask_andnot(). But if the
VM has a small number of CPUs, that extra invocation is negligible.
If the VM has a large number of CPUs, we're already executing
cpumask_andnot() many times, so one extra time is also negligible.
And this whole thing is done only at device initialization time, so
it's not a hot path.
>
> >
> > If (unlikely(skip_first_cpu)) {
> > skip_first_cpu = false;
> > continue;
> > }
> >
> > If (len-- == 0)
> > goto done;
> >
> > irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
> > }
> >
> > I wish there were some comments in irq_setup() explaining the overall intention of
> > the algorithm. I can see how the goal is to first assign CPUs that are local to the current
> > NUMA node, and then expand outward to CPUs that are further away. And you want
> > to *not* assign both siblings in a hyper-threaded core.
>
> I wrote this function, so let me step in.
>
> The intention is described in the corresponding commit message:
>
> Souradeep investigated that the driver performs faster if IRQs are
> spread on CPUs with the following heuristics:
>
> 1. No more than one IRQ per CPU, if possible;
> 2. NUMA locality is the second priority;
> 3. Sibling dislocality is the last priority.
>
> Let's consider this topology:
>
> Node 0 1
> Core 0 1 2 3
> CPU 0 1 2 3 4 5 6 7
>
> The most performant IRQ distribution based on the above topology
> and heuristics may look like this:
>
> IRQ Nodes Cores CPUs
> 0 1 0 0-1
> 1 1 1 2-3
> 2 1 0 0-1
> 3 1 1 2-3
> 4 2 2 4-5
> 5 2 3 6-7
> 6 2 2 4-5
> 7 2 3 6-7
>
> > But I can't figure out what
> > "weight" is trying to accomplish. Maybe this was discussed when the code first
> > went in, but I can't remember now. :-(
>
> The weight here is to implement the heuristic discovered by Souradeep:
> NUMA locality is preferred over sibling dislocality.
>
> The outer for_each() loop resets the weight to the actual number of
> CPUs in the hop. Then inner for_each() loop decrements it by the
> number of sibling groups (cores) while assigning first IRQ to each
> group.
>
> Now, because NUMA locality is more important, we should walk the
> same set of siblings and assign 2nd IRQ, and it's implemented by the
> medium while() loop. So, we do like this unless the number of IRQs
> assigned on this hop will not become equal to number of CPUs in the
> hop (weight == 0). Then we switch to the next hop and do the same
> thing.
>
> Hope that helps.
Yes, that helps! So the key to understanding "weight" is that
NUMA locality is preferred over sibling dislocality.
This is a great summary! All or most of it should go as a
comment describing the function and what it is trying to do.
Michael
Powered by blists - more mailing lists