linux-kernel - Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <mafs0350y7tg7.fsf@amazon.de>
Date:   Fri, 4 Aug 2023 16:50:16 +0200
From:   Pratyush Yadav <ptyadav@...zon.de>
To:     Keith Busch <kbusch@...nel.org>
CC:     Christoph Hellwig <hch@....de>, Sagi Grimberg <sagi@...mberg.me>,
        "Jens Axboe" <axboe@...nel.dk>, <linux-nvme@...ts.infradead.org>,
        <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has
 none

On Fri, Jul 28 2023, Keith Busch wrote:

> On Fri, Jul 28, 2023 at 08:09:32PM +0200, Pratyush Yadav wrote:
>>
>> I am guessing you are looking at irq_create_affinity_masks(). Yeah, It
>> does not take into account the NUMA information. In fact, even if it
>> did, the NUMA node associated with the IRQ is NUMA_NO_NODE
>> (/proc/$irq/node == -1).
>>
>> I did some more digging over the week to figure out what is going on. It
>> seems like the kernel _does_ in fact allow all CPUs in the affinity. I
>> added some prints in set_affinity_irq() in
>> drivers/xen/events/events_base.c (since that is the irqchip for the
>> interrupt). I see it being called with mask: ffffffff,ffffffff.
>>
>> But I later see the function being called again with a different mask:
>> 00000000,00008000. The stack trace shows the call is coming from
>> ksys_write(). The process doing the write is irqbalance.
>>
>> So I think your earlier statement was incorrect. irqbalance does in fact
>> balance these interrupts and it probably looks at the NUMA information
>> of the device to make that decision. My original reasoning holds and
>> irqbalance is the one picking the affinity.
>>
>> With this explanation, do you think the patch is good to go?
>
> irqbalance still writes to the /proc/<irq>/smp_affinity to change it,
> right? That's just getting I/O errors on my machines because it fails
> irq_can_set_affinity_usr() for nvme's kernel managed interrupts (except
> the first vector, but that one is not used for I/O). Is there another
> path irqbalance is using that's somehow getting past the appropriate
> checks? Or perhaps is your xen irq_chip somehow bypassing the managed
> irq property?

I picked the interrupt "nvme4q26" as example. The call stack is (printed
via WARN_ON(1)):

    ? __warn+0x7d/0x140
    ? set_affinity_irq+0xf0/0x220
    ? report_bug+0xf8/0x1e0
    ? handle_bug+0x44/0x80
    ? exc_invalid_op+0x13/0x60
    ? asm_exc_invalid_op+0x16/0x20
    ? set_affinity_irq+0xf0/0x220
    ? set_affinity_irq+0xf0/0x220
    irq_do_set_affinity+0x135/0x1e0
    irq_set_affinity_locked+0x186/0x1f0
    __irq_set_affinity+0x41/0x70
    write_irq_affinity.isra.8+0xf6/0x120
    proc_reg_write+0x59/0x80
    vfs_write+0xc7/0x3c0
    ? __do_sys_newfstat+0x35/0x60
    ? __fget_light+0xcb/0x120
    ksys_write+0xa5/0xe0
    do_syscall_64+0x42/0x90
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

The check you mention is in write_irq_affinity(). I added some prints
there and it turns out that __irqd_can_set_affinity() returns true and
irqd_affinity_is_managed() is false.

I did some more digging and it turns out that the masks are called by
irq_create_affinity_masks() which sets is_managed to the IRQ affinity
descriptors.

This is then passed down to __msi_domain_alloc_locked(). On a non-Xen
system you would end up calling __msi_domain_alloc_irqs() next since
ops->domain_alloc_irqs() is only implemented by Xen. This function finds
the masks created earlier and passes them down to
__irq_domain_alloc_irqs(). This then eventually lands in alloc_descs()
which checks is_managed and sets IRQD_AFFINITY_MANAGED.

On Xen though, xen_msi_domain_alloc_irqs() is called. This eventually
lands in xen_allocate_irqs_dynamic() which calls irq_alloc_descs(). This
macro calls __irq_alloc_descs() with affinity set to NULL. This leads to
us losing the is_managed flag and the affinities created by
irq_create_affinity_masks() via group_cpus_evenly().

As a result of this, MSI IRQs on Xen can never be managed by the kernel.
They are marked as userspace manageable and irqbalance can set their
affinity. Applying the (hacky) patch below fixes this problem and lets
the kernel manage the affinities.

------ 8< ------
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index c7715f8bd4522..15f36e34e28b4 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -743,9 +743,10 @@ static void xen_irq_init(unsigned irq)
 	list_add_tail(&info->list, &xen_irq_list_head);
 }
 
-static int __must_check xen_allocate_irqs_dynamic(int nvec)
+static int __must_check xen_allocate_irqs_dynamic(int nvec,
+						  struct irq_affinity_desc *affd)
 {
-	int i, irq = irq_alloc_descs(-1, 0, nvec, -1);
+	int i, irq = __irq_alloc_descs(-1, 0, nvec, -1, THIS_MODULE, affd);
 
 	if (irq >= 0) {
 		for (i = 0; i < nvec; i++)
@@ -758,7 +759,7 @@ static int __must_check xen_allocate_irqs_dynamic(int nvec)
 static inline int __must_check xen_allocate_irq_dynamic(void)
 {
 
-	return xen_allocate_irqs_dynamic(1);
+	return xen_allocate_irqs_dynamic(1, NULL);
 }
 
 static int __must_check xen_allocate_irq_gsi(unsigned gsi)
@@ -1108,7 +1109,7 @@ int xen_bind_pirq_msi_to_irq(struct pci_dev *dev, struct msi_desc *msidesc,
 
 	mutex_lock(&irq_mapping_update_lock);
 
-	irq = xen_allocate_irqs_dynamic(nvec);
+	irq = xen_allocate_irqs_dynamic(nvec, msidesc->affinity);
 	if (irq < 0)
 		goto out;
------ >8 ------ 

With this patch, I get the below affinities:

    $   for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
    >     cat /proc/irq/$i/{smp,effective}_affinity_list; \
    >   done
    8
    8
    16-17,48,65,67,69

    18-19,50,71,73,75

    20,52,77,79

    21,53,81,83

    22,54,85,87

    23,55,89,91

    24,56,93,95

    25,57,97,99

    26,58,101,103

    27,59,105,107

    28,60,109,111

    29,61,113,115

    30,62,117,119

    31,63,121,123

    49,51,125,127

    0,32,64,66

    1,33,68,70

    2,34,72,74

    3,35,76,78

    4,36,80,82

    5,37,84,86

    6,38,88,90

    7,39,92,94

    8,40,96,98

    9,41,100,102

    10,42,104,106

    11,43,108,110

    12,44,112,114

    13,45,116,118

    14,46,120,122

    15,47,124,126

The blank lines are because effective_smp_affinity is blank for all but the first interrupt.

The problem is, even with this I still get the same performance
difference when running on Node 1 vs Node 0. I am not sure why. Any
pointers?

-- 
Regards,
Pratyush Yadav



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879