linux-kernel - RE: [PATCH net-next] net: mana: Assigning IRQ affinity on HT cores

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <TY0P153MB0781F6392EFB41FE2F50EEE0CCBBA@TY0P153MB0781.APCP153.PROD.OUTLOOK.COM>
Date:   Tue, 21 Nov 2023 13:59:18 +0000
From:   Souradeep Chakrabarti <schakrabarti@...rosoft.com>
To:     Michael Kelley <mhklinux@...look.com>,
        Souradeep Chakrabarti <schakrabarti@...ux.microsoft.com>,
        KY Srinivasan <kys@...rosoft.com>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        "wei.liu@...nel.org" <wei.liu@...nel.org>,
        Dexuan Cui <decui@...rosoft.com>,
        "davem@...emloft.net" <davem@...emloft.net>,
        "edumazet@...gle.com" <edumazet@...gle.com>,
        "kuba@...nel.org" <kuba@...nel.org>,
        "pabeni@...hat.com" <pabeni@...hat.com>,
        Long Li <longli@...rosoft.com>,
        "leon@...nel.org" <leon@...nel.org>,
        "cai.huoqing@...ux.dev" <cai.huoqing@...ux.dev>,
        "ssengar@...ux.microsoft.com" <ssengar@...ux.microsoft.com>,
        "vkuznets@...hat.com" <vkuznets@...hat.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>
CC:     Paul Rosswurm <paulros@...rosoft.com>
Subject: RE: [PATCH net-next] net: mana: Assigning IRQ affinity on HT cores



>-----Original Message-----
>From: Michael Kelley <mhklinux@...look.com>
>Sent: Thursday, November 16, 2023 11:47 AM
>To: Michael Kelley <mhklinux@...look.com>; Souradeep Chakrabarti
><schakrabarti@...ux.microsoft.com>; KY Srinivasan <kys@...rosoft.com>;
>Haiyang Zhang <haiyangz@...rosoft.com>; wei.liu@...nel.org; Dexuan Cui
><decui@...rosoft.com>; davem@...emloft.net; edumazet@...gle.com;
>kuba@...nel.org; pabeni@...hat.com; Long Li <longli@...rosoft.com>;
>leon@...nel.org; cai.huoqing@...ux.dev; ssengar@...ux.microsoft.com;
>vkuznets@...hat.com; tglx@...utronix.de; linux-hyperv@...r.kernel.org;
>netdev@...r.kernel.org; linux-kernel@...r.kernel.org; linux-
>rdma@...r.kernel.org
>Cc: Souradeep Chakrabarti <schakrabarti@...rosoft.com>; Paul Rosswurm
><paulros@...rosoft.com>
>Subject: [EXTERNAL] RE: [PATCH net-next] net: mana: Assigning IRQ affinity on HT
>cores
>
>[Some people who received this message don't often get email from
>mhklinux@...look.com. Learn why this is important at
>https://aka.ms/LearnAboutSenderIdentification ]
>
>From: Michael Kelley <mhklinux@...look.com> Sent: Wednesday, November 15,
>2023 9:26 PM
>> >
>> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
>> > b/drivers/net/ethernet/microsoft/mana/gdma_main.c
>> > index 6367de0c2c2e..839be819d46e 100644
>> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
>> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
>> > @@ -1243,13 +1243,115 @@ void mana_gd_free_res_map(struct
>> > gdma_resource *r)
>> >     r->size = 0;
>> >  }
>> >
>> > +static void cpu_mask_set(cpumask_var_t *filter_mask, cpumask_var_t
>> > **filter_mask_list)
>> > +{
>> > +   unsigned int core_count = 0, cpu;
>> > +   cpumask_var_t *filter_mask_list_tmp;
>> > +
>> > +   BUG_ON(!filter_mask || !filter_mask_list);
>>
>> This check seems superfluous. The function is invoked from only one
>> call site in the code below, and that site call site can easily ensure
>> that it doesn't pass a NULL value for either parameter.
Fixed it in V2 patch.
>>
>> > +   filter_mask_list_tmp = *filter_mask_list;
>> > +   cpumask_copy(*filter_mask, cpu_online_mask);
>> > +   /* for each core create a cpumask lookup table,
>> > +    * which stores all the corresponding siblings
>> > +    */
>> > +   for_each_cpu(cpu, *filter_mask) {
>> > +
>>       BUG_ON(!alloc_cpumask_var(&filter_mask_list_tmp[core_count],
>> GFP_KERNEL));
>>
>> Don't panic if a memory allocation fails.  Must back out, clean up,
>> and return an error.
>>
>> > +           cpumask_or(filter_mask_list_tmp[core_count],
>filter_mask_list_tmp[core_count],
>> > +                      topology_sibling_cpumask(cpu));
>>
>> alloc_cpumask_var() does not zero out the returned cpumask.  So the
>> cpumask_or() is operating on uninitialized garbage.  Use
>> zalloc_cpumask_var() to get a cpumask initialized to zero.
>>
>> But why is the cpumask_or() being done at all?  Couldn't this just be
>> a cpumask_copy()?  In that case, alloc_cpumask_var() is OK.
>>
>> > +           cpumask_andnot(*filter_mask, *filter_mask,
>topology_sibling_cpumask(cpu));
>> > +           core_count++;
>> > +   }
>> > +}
>> > +
>> > +static int irq_setup(int *irqs, int nvec) {
>> > +   cpumask_var_t filter_mask;
>> > +   cpumask_var_t *filter_mask_list;
>> > +   unsigned int cpu_first, cpu, irq_start, cores = 0;
>> > +   int i, core_count = 0, numa_node, cpu_count = 0, err = 0;
>> > +
>> > +   BUG_ON(!alloc_cpumask_var(&filter_mask, GFP_KERNEL));
>>
>> Again, don't panic if a memory allocation fails.  Must back out and
>> return an error.
>>
>> > +   cpus_read_lock();
>> > +   cpumask_copy(filter_mask, cpu_online_mask);
>>
>> For readability, I'd suggest adding a blank line before any of the
>> multi-line comments below.
>>
>> > +   /* count the number of cores
>> > +    */
>> > +   for_each_cpu(cpu, filter_mask) {
>> > +           cpumask_andnot(filter_mask, filter_mask,
>topology_sibling_cpumask(cpu));
>> > +           cores++;
>> > +   }
>> > +   filter_mask_list = kcalloc(cores, sizeof(cpumask_var_t), GFP_KERNEL);
>> > +   if (!filter_mask_list) {
>> > +           err = -ENOMEM;
>> > +           goto free_irq;
>
>One additional macro-level comment.  Creating and freeing the filter_mask_list is
>a real pain.  But it is needed to identify which CPUs in a core are available to have
>an IRQ assigned to them.  So let me suggest a modified approach to meeting that
>need.
>
>1) Count the number of cores just as before.
>
>2) Then instead of allocating filter_mask_list, allocate an array of integers, with one
>array entry per core.  Call the array core_id_list.
>Somewhat like the code in cpu_mask_set(), populate the array with the CPU
>number of the first CPU in each core.  The populating only needs to be done once,
>so the code can be inline in irq_setup().
>It doesn't need to be in a helper function like cpu_mask_set().
>
>3) Allocate a single cpumask_var_t local variable that is initialized to a copy of
>cpu_online_mask.  Call it avail_cpus.  This local variable indicates which CPUs
>(across all cores) are available to have an IRQ assigned.
>
>4) At the beginning of the "for" loop over nvec, current code has:
>
>        cpu_first = cpumask_first(filter_mask_list[core_count]);
>
>Instead, do:
>
>        cpu_first = cpumask_first_and(&avail_cpus,
>                        topology_sibling_cpumask(core_id_list[core_count]));
>
>The above code effectively creates on-the-fly the cpumask previously stored in
>filter_mask_list, and finds the first CPU as before.
>
>5) When a CPU is assigned an IRQ, clear that CPU in avail_cpus, not in the
>filter_mask_list entry.
>
>6) When the NUMA node wraps around and current code calls cpu_mask_set(),
>instead reinitialize avail_cpus to cpu_online_mask like in my #3 above.
>
>In summary, with this approach filter_mask_list isn't needed at all.  The messiness
>of allocating and freeing an array of cpumask's goes away.  I haven't coded it, but I
>think the result will be simpler and less error prone.
>
Changed it in V2.
>Michael
>
>> > +   }
>> > +   /* if number of cpus are equal to max_queues per port, then
>> > +    * one extra interrupt for the hardware channel communication.
>> > +    */
>>
>> Note that higher level code may set nvec based on the # of online
>> CPUs, but until the cpus_read_lock is held, the # of online CPUs can
>> change. So nvec might have been determined when the # of CPUs was 8,
>> but 2 CPUs could go offline before the cpus_read_lock is obtained.  So
>> nvec could be bigger than just 1 more than num_online_cpus().  I'm not
>> sure how to handle the check below to special case the hardware
>> communication channel.  But realize that the main "for" loop below
>> could end up assigning multiple IRQs to the same CPU if nvec >
>> num_online_cpus().
>>
>> > +   if (nvec - 1 == num_online_cpus()) {
>> > +           irq_start = 1;
>> > +           cpu_first = cpumask_first(cpu_online_mask);
>> > +           irq_set_affinity_and_hint(irqs[0], cpumask_of(cpu_first));
>> > +   } else {
>> > +           irq_start = 0;
>> > +   }
>> > +   /* reset the core_count and num_node to 0.
>> > +    */
>> > +   core_count = 0;
>> > +   numa_node = 0;
>> > +   cpu_mask_set(&filter_mask, &filter_mask_list);
>>
>> FWIW, passing filter_mask as the first argument above was confusing to
>> me until I realized that the value of filter_mask doesn't matter --
>> it's just to use the memory allocated for filter_mask.  Maybe a
>> comment would help.  And ditto for the invocation of cpu_mask_set()
>> further down.
Fixed it in V2.
>>
>> > +   /* for each interrupt find the cpu of a particular
>> > +    * sibling set and if it belongs to the specific numa
>> > +    * then assign irq to it and clear the cpu bit from
>> > +    * the corresponding sibling list from filter_mask_list.
>> > +    * Increase the cpu_count for that node.
>> > +    * Once all cpus for a numa node is assigned, then
>> > +    * move to different numa node and continue the same.
>> > +    */
>> > +   for (i = irq_start; i < nvec; ) {
>> > +           cpu_first = cpumask_first(filter_mask_list[core_count]);
>> > +           if (cpu_first < nr_cpu_ids && cpu_to_node(cpu_first) ==
>> > + numa_node) {
>>
>> Note that it's possible to have a NUMA node with zero online CPUs.
>> It could be a NUMA node that was originally a memory-only NUMA node
>> and never had any CPUs, or the NUMA node could have had CPUs, but they
>> were all later taken offline.  Such a NUMA node is still considered to
>> be online because it has memory, but it has no online CPUs.
>>
>> If this code ever sets "numa_node" to such a NUMA node, the "for" loop
>> will become an infinite loop because the "if"
>> statement above will never find a CPU that is assigned to "numa_node".
>>
>> > +                   irq_set_affinity_and_hint(irqs[i], cpumask_of(cpu_first));
>> > +                   cpumask_clear_cpu(cpu_first, filter_mask_list[core_count]);
>> > +                   cpu_count = cpu_count + 1;
>> > +                   i = i + 1;
>> > +                   /* checking if all the cpus are used from the
>> > +                    * particular node.
>> > +                    */
>> > +                   if (cpu_count == nr_cpus_node(numa_node)) {
>> > +                           numa_node = numa_node + 1;
>> > +                           if (numa_node == num_online_nodes()) {
>> > +                                   cpu_mask_set(&filter_mask,
>> > + &filter_mask_list);
>>
>> The second and subsequent calls to cpu_mask_set() will leak any memory
>> previously allocated by alloc_cpumask_var() in cpu_mask_set().
>>
>> I agree with Haiyang's comment about starting with a NUMA node other
>> than NUMA node zero.  But if you do so, note that testing for
>> wrap-around at num_online_nodes() won't be equivalent to testing for
>> getting back to the starting NUMA node.
>> You really want to run cpu_mask_set() again only when you get back to
>> the starting NUMA node.
Fixed it in V2.
>>
>> > +                                   numa_node = 0;
>> > +                           }
>> > +                           cpu_count = 0;
>> > +                           core_count = 0;
>> > +                           continue;
>> > +                   }
>> > +           }
>> > +           if ((core_count + 1) % cores == 0)
>> > +                   core_count = 0;
>> > +           else
>> > +                   core_count++;
>>
>> The if/else could be more compactly written as:
>>
>>               if (++core_count == cores)
>>                       core_count = 0;
>>
>> > +   }
>> > +free_irq:
>> > +   cpus_read_unlock();
>> > +   if (filter_mask)
>>
>> This check for non-NULL filter_mask is incorrect and might not even
>> compile if CONFIG_CPUMASK_OFFSTACK=n.  For testing purposes, you
>> should make sure to build and run with each
>> option:  with CONFIG_CPUMASK_OFFSTACK=y and also
>> CONFIG_CPUMASK_OFFSTACK=n.
Fixed it in V2.
>>
>> It's safe to just call free_cpumask_var() without the check.
>>
>> > +           free_cpumask_var(filter_mask);
>> > +   if (filter_mask_list) {
>> > +           for (core_count = 0; core_count < cores; core_count++)
>> > +                   free_cpumask_var(filter_mask_list[core_count]);
>> > +           kfree(filter_mask_list);
>> > +   }
>> > +   return err;
>> > +}