[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20231208060539.GA5744@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
Date: Thu, 7 Dec 2023 22:05:39 -0800
From: Souradeep Chakrabarti <schakrabarti@...ux.microsoft.com>
To: Yury Norov <yury.norov@...il.com>
Cc: kys@...rosoft.com, haiyangz@...rosoft.com, wei.liu@...nel.org,
decui@...rosoft.com, davem@...emloft.net, edumazet@...gle.com,
kuba@...nel.org, pabeni@...hat.com, longli@...rosoft.com,
leon@...nel.org, cai.huoqing@...ux.dev, ssengar@...ux.microsoft.com,
vkuznets@...hat.com, tglx@...utronix.de,
linux-hyperv@...r.kernel.org, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-rdma@...r.kernel.org,
sch^Crabarti@...rosoft.com, paulros@...rosoft.com
Subject: Re: [PATCH V4 net-next] net: mana: Assigning IRQ affinity on HT cores
On Tue, Dec 05, 2023 at 02:52:06PM -0800, Yury Norov wrote:
> On Tue, Dec 05, 2023 at 03:01:38AM -0800, Souradeep Chakrabarti wrote:
> > > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > index 6367de0c2c2e..2194a53cce10 100644
> > > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > @@ -1243,15 +1243,57 @@ void mana_gd_free_res_map(struct gdma_resource *r)
> > > > r->size = 0;
> > > > }
> > > >
> > > > +static int irq_setup(int *irqs, int nvec, int start_numa_node)
> > > > +{
> > > > + int i = 0, cpu, err = 0;
> > > > + const struct cpumask *node_cpumask;
> > > > + unsigned int next_node = start_numa_node;
> > > > + cpumask_var_t visited_cpus, node_cpumask_temp;
> > > > +
> > > > + if (!zalloc_cpumask_var(&visited_cpus, GFP_KERNEL)) {
> > > > + err = ENOMEM;
> > > > + return err;
> > > > + }
> > > > + if (!zalloc_cpumask_var(&node_cpumask_temp, GFP_KERNEL)) {
> > > > + err = -ENOMEM;
> > > > + return err;
> > > > + }
> > >
> > > Can you add a bit more of vertical spacing?
> > >
> > > > + rcu_read_lock();
> > > > + for_each_numa_hop_mask(node_cpumask, next_node) {
> > > > + cpumask_copy(node_cpumask_temp, node_cpumask);
> > > > + for_each_cpu(cpu, node_cpumask_temp) {
> > > > + cpumask_andnot(node_cpumask_temp, node_cpumask_temp,
> > > > + topology_sibling_cpumask(cpu));
> > > > + irq_set_affinity_and_hint(irqs[i], cpumask_of(cpu));
> > > > + if (++i == nvec)
> > > > + goto free_mask;
> > > > + cpumask_set_cpu(cpu, visited_cpus);
> > > > + if (cpumask_empty(node_cpumask_temp)) {
> > > > + cpumask_copy(node_cpumask_temp, node_cpumask);
> > > > + cpumask_andnot(node_cpumask_temp, node_cpumask_temp,
> > > > + visited_cpus);
> > > > + cpu = 0;
> > > > + }
> > >
> > > It feels like you can calculate number of sibling groups in a hop in
> > > advance, so that you'll know how many IRQs you want to assign per each
> > > hop, and avoid resetting the node_cpumask_temp and spinning in inner
> > > loop for more than once...
> > >
> > > Can you print your topology, and describe how you want to spread IRQs
> > > on it, and how your existing code does spread them?
> > >
> > The topology of one system is
> > > numactl -H
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
> > 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
> > 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
> > node 0 size: 459521 MB
> > node 0 free: 456316 MB
> > node 1 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
> > 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
> > 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168
> > 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
> > node 1 size: 459617 MB
> > node 1 free: 456864 MB
> > node distances:
> > node 0 1
> > 0: 10 21
> > 1: 21 10
> > and I want to spread the IRQs in numa0 node first with
> > CPU0 - IRQ0
> > CPU2 - IRQ1
> > CPU4 - IRQ2
> > CPU6 - IRQ3
> > ---
> > ---
> > ---
> > CPU94 - IRQ47
> > then
> > CPU1 - IRQ48
> > CPU3 - IRQ49
> > CPU32 - IRQ64
> >
> > In a topology where NUMA0 has 20 cores and NUMA1 has 20 cores, with total 80 CPUS, there I want
> > CPU0 - IRQ0
> > CPU2 - IRQ1
> > CPU4 - IRQ2
> > ---
> > ---
> > ---
> > CPU38 - IRQ19
> > Then
> > CPU1 - IRQ20
> > CPU3 - IRQ21
> > ---
> > ---
> > CPU39 - IRQ39
> > Node1
> > CPU40 - IRQ40
> > CPU42 - IRQ41
> > CPU44 - IRQ42
> > ---
> > CPU78 - IRQ58
> > CPU41 - IRQ59
> > CPU43 - IRQ60
> > ---
> > ---
> > CPU49 - IRQ64
> >
> >
> > Exisitng code :
> > https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/microsoft/mana/gdma_main.c#L1246
> >
> > This uses cpumask_local_spread, so in a system where node has 64 cores, it spreads all 64+1 IRQs on
> > 33 cores, rather than spreading it only on HT cores.
>
> So from what you said, it looks like you're trying to implement the
> following heuristics:
>
> 1. No more than one IRQ per CPU, if possible;
> 2. NUMA locality is the second priority;
> 3. Sibling dislocality is the last priority;
>
> Can you confirm that?
>
> If the above correct, your code is quite close to what you want except
> that for every new hop (outer loop) you have to clear CPUs belonging to
> previous hop, which is in you case the same as visited_cpus mask.
>
> But I think you can do it even better if just account the number of
> assigned IRQs. That way you can get rid of the most of housekeeping
> code.
>
> const struct cpumask *next, *prev = cpu_none_mask;
>
> for_each_numa_hop_mask(next, node) {
> cpumask_and_not(curr, next, prev);
>
> for (w = cpumask_weight(curr), cnt = 0; cnt < w; cnt++)
> cpumask_copy(cpus, curr);
> for_each_cpu(cpu, cpus) {
> if (i++ == nvec)
> goto done;
>
> cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));
> irq_set_affinity_and_hint(irqs[i], topology_sibling_cpumask(cpu)); // [*]
> }
> }
> prev = next;
> }
I am experimenting with the proposal here, and based on the result will
update the V5 patch.
>
> [*] I already mentioned that in v3, and also asking here: if you're saying
> that wrt IRQs distribution, all CPUs belonging to the same sibling group
> are the same, why don't you assign all the group to the IRQ. It gives the
> system flexibility to balance workload better.
>
> Let's consider this topology:
>
> Node 0 1
> Core 0 1 2 3
> CPU 0 1 2 3 4 5 6 7
>
> The code above should create the following mapping for the IRQs:
> IRQ Nodes Cores CPUs
> 0 1 0 0-1
> 1 1 1 2-3
> 2 1 0 0-1
> 3 1 1 2-3
> 4 2 2 4-5
> 5 2 3 6-7
> 6 2 2 4-5
> 7 2 3 6-7
>
> This is pretty close to what I proposed in v3, except that it flips
> priorities of NUMA locality vs sibling dislocality. My original
> suggestion is simpler in implementation and aligns with my natural
> feeling of 'fair' IRQ distribution.
>
> Can you make sure that your heuristics are the best wrt performance?
>
> Regarding the rest of the discussion, I think that for_each_numa_hop_mask()
> together with some basic cpumaks operations results quite a readable
> and maintainable code, and we don't need any more generic API to
> support this type of distribution tasks.
>
> What do you think guys?
>
> Thanks,
> Yury
Powered by blists - more mailing lists