linux-kernel - Re: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240110090931.GB5436@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
Date: Wed, 10 Jan 2024 01:09:31 -0800
From: Souradeep Chakrabarti <schakrabarti@...ux.microsoft.com>
To: Yury Norov <yury.norov@...il.com>
Cc: Michael Kelley <mhklinux@...look.com>,
	"kys@...rosoft.com" <kys@...rosoft.com>,
	"haiyangz@...rosoft.com" <haiyangz@...rosoft.com>,
	"wei.liu@...nel.org" <wei.liu@...nel.org>,
	"decui@...rosoft.com" <decui@...rosoft.com>,
	"davem@...emloft.net" <davem@...emloft.net>,
	"edumazet@...gle.com" <edumazet@...gle.com>,
	"kuba@...nel.org" <kuba@...nel.org>,
	"pabeni@...hat.com" <pabeni@...hat.com>,
	"longli@...rosoft.com" <longli@...rosoft.com>,
	"leon@...nel.org" <leon@...nel.org>,
	"cai.huoqing@...ux.dev" <cai.huoqing@...ux.dev>,
	"ssengar@...ux.microsoft.com" <ssengar@...ux.microsoft.com>,
	"vkuznets@...hat.com" <vkuznets@...hat.com>,
	"tglx@...utronix.de" <tglx@...utronix.de>,
	"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
	"schakrabarti@...rosoft.com" <schakrabarti@...rosoft.com>,
	"paulros@...rosoft.com" <paulros@...rosoft.com>
Subject: Re: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs
 per CPUs

On Tue, Jan 09, 2024 at 03:28:59PM -0800, Yury Norov wrote:
> Hi Michael,
> 
> So, I'm just a guy who helped to formulate the heuristics in an
> itemized form, and implement them using the existing kernel API.
> I have no access to MANA machines and I ran no performance tests
> myself.
> 
> On Tue, Jan 09, 2024 at 07:22:38PM +0000, Michael Kelley wrote:
> > From: Souradeep Chakrabarti <schakrabarti@...ux.microsoft.com> Sent: Tuesday, January 9, 2024 2:51 AM
> > > 
> > > From: Yury Norov <yury.norov@...il.com>
> > > 
> > > Souradeep investigated that the driver performs faster if IRQs are
> > > spread on CPUs with the following heuristics:
> > > 
> > > 1. No more than one IRQ per CPU, if possible;
> > > 2. NUMA locality is the second priority;
> > > 3. Sibling dislocality is the last priority.
> > > 
> > > Let's consider this topology:
> > > 
> > > Node            0               1
> > > Core        0       1       2       3
> > > CPU       0   1   2   3   4   5   6   7
> > > 
> > > The most performant IRQ distribution based on the above topology
> > > and heuristics may look like this:
> > > 
> > > IRQ     Nodes   Cores   CPUs
> > > 0       1       0       0-1
> > > 1       1       1       2-3
> > > 2       1       0       0-1
> > > 3       1       1       2-3
> > > 4       2       2       4-5
> > > 5       2       3       6-7
> > > 6       2       2       4-5
> > > 7       2       3       6-7
> > 
> > I didn't pay attention to the detailed discussion of this issue
> > over the past 2 to 3 weeks during the holidays in the U.S., but
> > the above doesn't align with the original problem as I understood
> > it.  I thought the original problem was to avoid putting IRQs on
> > both hyper-threads in the same core, and that the perf
> > improvements are based on that configuration.  At least that's
> > what the commit message for Patch 4/4 in this series says.
> 
> Yes, and the original distribution suggested by Souradeep looks very
> similar:
> 
>   IRQ     Nodes   Cores   CPUs
>   0       1       0       0
>   1       1       1       2
>   2       1       0       1
>   3       1       1       3
>   4       2       2       4
>   5       2       3       6
>   6       2       2       5
>   7       2       3       7
> 
> I just added a bit more flexibility, so that kernel may pick any
> sibling for the IRQ. As I understand, both approaches have similar
> performance. Probably my fine-tune added another half-percent...
> 
> Souradeep, can you please share the exact numbers on this?
> 
> > The above chart results in 8 IRQs being assigned to the 8 CPUs,
> > probably with 1 IRQ per CPU.   At least on x86, if the affinity
> > mask for an IRQ contains multiple CPUs, matrix_find_best_cpu()
> > should balance the IRQ assignments between the CPUs in the mask.
> > So the original problem is still present because both hyper-threads
> > in a core are likely to have an IRQ assigned.
> 
> That's what I think, if the topology makes us to put IRQs in the
> same sibling group, the best thing we can to is to rely on existing
> balancing mechanisms in a hope that they will do their job well.
> 
> > Of course, this example has 8 IRQs and 8 CPUs, so assigning an
> > IRQ to every hyper-thread may be the only choice.  If that's the
> > case, maybe this just isn't a good example to illustrate the
> > original problem and solution.
> 
> Yeah... This example illustrates the order of IRQ distribution.
> I really doubt that if we distribute IRQs like in the above example,
> there would be any difference in performance. But I think it's quite
> a good illustration. I could write the title for the table like this:
> 
>         The order of IRQ distribution for the best performance
>         based on [...] may look like this.
> 
> > But even with a better example
> > where the # of IRQs is <= half the # of CPUs in a NUMA node,
> > I don't think the code below accomplishes the original intent.
> > 
> > Maybe I've missed something along the way in getting to this
> > version of the patch.  Please feel free to set me straight. :-)
> 
> Hmm. So if the number of IRQs is the half # of CPUs in the nodes,
> which is 2 in the example above, the distribution will look like
> this:
> 
>   IRQ     Nodes   Cores   CPUs
>   0       1       0       0-1
>   1       1       1       2-3
> 
> And each IRQ belongs to a different sibling group. This follows
> the rules above.
> 
> I think of it like we assign an IRQ to a group of 2 CPUs, so from
> the heuristic #1 perspective, each CPU is assigned with 1/2 of the
> IRQ.
> 
> If I add one more IRQ, then according to the heuristics, NUMA locality
> trumps sibling dislocality, so we'd assign IRO to the same node on any
> core. My algorithm assigns it to the core #0:
> 
>   2       1       0       0-1
> 
> This doubles # of IRQs for the CPUs 0 and 1: from 1/2 to 1.
> 
> The next IRQ should be assigned to the same node again, and we've got
> the only choice:
> 
> 
>   3       1       1       2-3
> 
> Starting from IRQ #5, the node #1 is full - each CPU is assigned with
> exactly one IRQ, and the heuristic #1 makes us to switch to the other
> node; and then do the same thing:
> 
>   4       2       2       4-5
>   5       2       3       6-7
>   6       2       2       4-5
>   7       2       3       6-7
> 
> So I think the algorithm is correct... Really hope the above makes
> sense. :) If so, I can add it to the commit message for patch #3.
> 
> Nevertheless... Souradeep, in addition to the performance numbers, can
> you share your topology and actual IRQ distribution that gains 15%? I
> think it should be added to the patch #4 commit message.
Sure I will add my topology in #4 commit message. Thanks for the suggestion.
> 
> Thanks,
> Yury