linux-kernel - RE: [PATCH v3] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SA1PR21MB133572B296D53A64A1A6792FBF039@SA1PR21MB1335.namprd21.prod.outlook.com>
Date:   Sat, 12 Nov 2022 00:58:33 +0000
From:   Dexuan Cui <decui@...rosoft.com>
To:     Lorenzo Pieralisi <lpieralisi@...nel.org>,
        "wei.liu@...nel.org" <wei.liu@...nel.org>
CC:     "quic_jhugo@...cinc.com" <quic_jhugo@...cinc.com>,
        "quic_carlv@...cinc.com" <quic_carlv@...cinc.com>,
        KY Srinivasan <kys@...rosoft.com>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        "sthemmin@...rosoft.com" <sthemmin@...rosoft.com>,
        "bhelgaas@...gle.com" <bhelgaas@...gle.com>,
        "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
        "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "Michael Kelley (LINUX)" <mikelley@...rosoft.com>,
        "robh@...nel.org" <robh@...nel.org>, "kw@...ux.com" <kw@...ux.com>,
        "helgaas@...nel.org" <helgaas@...nel.org>,
        "alex.williamson@...hat.com" <alex.williamson@...hat.com>,
        "boqun.feng@...il.com" <boqun.feng@...il.com>,
        Boqun Feng <Boqun.Feng@...rosoft.com>
Subject: RE: [PATCH v3] PCI: hv: Only reuse existing IRTE allocation for
 Multi-MSI

> From: Lorenzo Pieralisi <lpieralisi@...nel.org>
> Sent: Friday, November 11, 2022 1:56 AM
> > ...
> > + * For the single-MSI and MSI-X cases, it's OK for
> > hv_compose_msi_req_get_cpu()
> > + * to always return the same dummy vCPU, because a second call to
> > + * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to
> > choose a
> > + * new pCPU for the interrupt. But for the multi-MSI case, the second call to
> > + * hv_compose_msi_msg() exits without sending a message to the vPCI VSP,
> > so the
> 
> Why ? Can't you fix _that_ ? Why can't the initial call to
> hv_compose_msi_msg() determine the _real_ target vCPU ?

Unluckily I can't fix this because of the way it works in x86's irq management
code. This is out of the control of the pci-hyperv driver.

hv_compose_msi_msg() uses irq_data_get_effective_affinity_mask() to get
the "effective"mask.

On x86, when the irq is initialized, irq_data_update_effective_affinity() is
called from apic_update_irq_cfg() -- please refer to the below debug code.

When the initial call to hv_compose_msi_msg() is invoked, it's from
pci_alloc_irq_vectors(), and the x86 irq code always passes CPU0 to pci-hyperv.
Please see the below "cpumask_first(cpu_online_mask)" in
vector_assign_managed_shutdown().

When hv_compose_msi_msg() is invoked the second time, it's from
request_irq(), and the x86 irq code passes the real effectivey CPU to pci-hyperv.

I added the below debug code and pasted the trimmed output below.

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -179,6 +179,7 @@ static void vector_assign_managed_shutdown(struct irq_data *irqd)
        unsigned int cpu = cpumask_first(cpu_online_mask);

        apic_update_irq_cfg(irqd, MANAGED_IRQ_SHUTDOWN_VECTOR, cpu);
+       WARN(irqd->irq >= 24, "cdx: vector_assign_managed_shutdown: irq=%d, cpu=%d\n", irqd->irq, cpu);
 }

 static int reserve_managed_vector(struct irq_data *irqd)
@@ -251,6 +252,7 @@ assign_vector_locked(struct irq_data *irqd, const struct cpumask *dest)
                return vector;
        apic_update_vector(irqd, vector, cpu);
        apic_update_irq_cfg(irqd, vector, cpu);
+       WARN(irqd->irq >= 24, "cdx: assign_vector_locked: irq=%d, cpu=%d\n", irqd->irq, cpu);

        return 0;
 }
@@ -328,6 +330,7 @@ assign_managed_vector(struct irq_data *irqd, const struct cpumask *dest)
                return vector;
        apic_update_vector(irqd, vector, cpu);
        apic_update_irq_cfg(irqd, vector, cpu);
+       WARN(irqd->irq >= 24, "cdx: assign_managed_vector: irq=%d, cpu=%d\n", irqd->irq, cpu);
        return 0;
 }

--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1803,6 +1803,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
        u32 size;
        int ret;

+       WARN(1, "cdx: hv_compose_msi_msg: irq=%d\n", data->irq);
        /* Reuse the previous allocation */
        if (data->chip_data) {
                int_desc = data->chip_data;

  1 cdx: vector_assign_managed_shutdown: irq=24, cpu=0
  2 WARNING: CPU: 3 PID: 87 at arch/x86/kernel/apic/vector.c:182 vector_assign_managed_shutdown+0x53/0x60
  3 RIP: 0010:vector_assign_managed_shutdown+0x53/0x60
  4  reserve_irq_vector_locked+0x41/0xa0
  5  x86_vector_alloc_irqs+0x298/0x460
  6  irq_domain_alloc_irqs_hierarchy+0x1b/0x50
  7  irq_domain_alloc_irqs_parent+0x17/0x30
  8  msi_domain_alloc+0x83/0x150
  9  irq_domain_alloc_irqs_hierarchy+0x1b/0x50
 10  __irq_domain_alloc_irqs+0xdf/0x320
 11  __msi_domain_alloc_irqs+0x103/0x3e0
 12  msi_domain_alloc_irqs_descs_locked+0x3e/0x90
 13  pci_msi_setup_msi_irqs+0x2d/0x40
 14  __pci_enable_msix_range+0x2fd/0x420
 15  pci_alloc_irq_vectors_affinity+0xb0/0x110
 16  nvme_reset_work+0x1cf/0x1170
 17  process_one_work+0x21f/0x3f0
 18  worker_thread+0x4a/0x3c0
 19  kthread+0xff/0x130
 20  ret_from_fork+0x22/0x30
 21
 22 cdx: vector_assign_managed_shutdown: irq=24, cpu=0
 23 WARNING: CPU: 3 PID: 87 at arch/x86/kernel/apic/vector.c:182 vector_assign_managed_shutdown+0x53/0x60
 24 RIP: 0010:vector_assign_managed_shutdown+0x53/0x60
 25  x86_vector_activate+0x149/0x1e0
 26  __irq_domain_activate_irq+0x58/0x90
 27  __irq_domain_activate_irq+0x38/0x90
 28  irq_domain_activate_irq+0x2d/0x50
 29  __msi_domain_alloc_irqs+0x1bb/0x3e0
 30  msi_domain_alloc_irqs_descs_locked+0x3e/0x90
 31  pci_msi_setup_msi_irqs+0x2d/0x40
 32  __pci_enable_msix_range+0x2fd/0x420
 33  pci_alloc_irq_vectors_affinity+0xb0/0x110
 34  nvme_reset_work+0x1cf/0x1170
 35  process_one_work+0x21f/0x3f0
 36  worker_thread+0x4a/0x3c0
 37  kthread+0xff/0x130
 38  ret_from_fork+0x22/0x30
 39
 40
 41 cdx: hv_compose_msi_msg: irq=24
 42 WARNING: CPU: 3 PID: 87 at drivers/pci/controller/pci-hyperv.c:1806 hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv]
 43 RIP: 0010:hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv]
 44  irq_chip_compose_msi_msg+0x32/0x50
 45  msi_domain_activate+0x45/0xa0
 46  __irq_domain_activate_irq+0x58/0x90
 47  irq_domain_activate_irq+0x2d/0x50
 48  __msi_domain_alloc_irqs+0x1bb/0x3e0
 49  msi_domain_alloc_irqs_descs_locked+0x3e/0x90
 50  pci_msi_setup_msi_irqs+0x2d/0x40
 51  __pci_enable_msix_range+0x2fd/0x420
 52  pci_alloc_irq_vectors_affinity+0xb0/0x110
 53  nvme_reset_work+0x1cf/0x1170
 54  process_one_work+0x21f/0x3f0
 55  worker_thread+0x4a/0x3c0
 56  kthread+0xff/0x130
 57  ret_from_fork+0x22/0x30
 58
 59
 60
 61 cdx: assign_vector_locked: irq=24, cpu=3
 62 WARNING: CPU: 0 PID: 87 at arch/x86/kernel/apic/vector.c:255 assign_vector_locked+0x160/0x170
 63 RIP: 0010:assign_vector_locked+0x160/0x170
 64  assign_irq_vector_any_locked+0x6a/0x150
 65  x86_vector_activate+0x93/0x1e0
 66  __irq_domain_activate_irq+0x58/0x90
 67  __irq_domain_activate_irq+0x38/0x90
 68  irq_domain_activate_irq+0x2d/0x50
 69  irq_activate+0x29/0x30
 70  __setup_irq+0x2e5/0x780
 71  request_threaded_irq+0x112/0x180
 72  pci_request_irq+0xa3/0xf0
 73  queue_request_irq+0x74/0x80
 74  nvme_reset_work+0x408/0x1170
 75  process_one_work+0x21f/0x3f0
 76  worker_thread+0x4a/0x3c0
 77  kthread+0xff/0x130
 78  ret_from_fork+0x22/0x30
 79
 80 cdx: hv_compose_msi_msg: irq=24
 81 WARNING: CPU: 0 PID: 87 at drivers/pci/controller/pci-hyperv.c:1806 hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv]
 82 RIP: 0010:hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv]
 83  irq_chip_compose_msi_msg+0x32/0x50
 84  msi_domain_activate+0x45/0xa0
 85  __irq_domain_activate_irq+0x58/0x90
 86  irq_domain_activate_irq+0x2d/0x50
 87  irq_activate+0x29/0x30
 88  __setup_irq+0x2e5/0x780
 89  request_threaded_irq+0x112/0x180
 90  pci_request_irq+0xa3/0xf0
 91  queue_request_irq+0x74/0x80
 92  nvme_reset_work+0x408/0x1170
 93  process_one_work+0x21f/0x3f0
 94  worker_thread+0x4a/0x3c0
 95  kthread+0xff/0x130
 96  ret_from_fork+0x22/0x30

> > + * original dummy vCPU is used. This dummy vCPU must be round-robin'ed
> > so that
> > + * the pCPUs are spread out. All interrupts for a multi-MSI device end up
> > using
> > + * the same pCPU, even though the vCPUs will be spread out by later calls
> > + * to hv_irq_unmask(), but that is the best we can do now.
> > + *
> > + * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does
> *not*
> 
> "current" Hyper-V means nothing, remove it or provide versioning
> information. Imagine yourself reading this comment some time
> in the future.

Good point. @Wei, can you please help fix this? I think we can change
"With current Hyper-V"
To
"With Hyper-V in Nov 2022".

BTW, it's hard to provide the exact versioning info, because technically
there are many versions of Hyper-V...

> I can't claim to understand how this MSI vCPU to pCPU mapping is made to
> work in current code but I can't ack this patch sorry, if you feel like
> it is good to merge it it is your and Hyper-V maintainers call, feel
> free to go ahead - I can review PCI hyper-V changes that affect PCI
> and IRQs core APIs, I don't know Hyper-V internals.
> 
> Lorenzo

I understand. Thanks! 

I discussed the issue with Hyper-V team. I believe I understood it and
this patch is the right thing to have.

@Wei, Bjorn spotted two typos in the commit message, and Lorenzo
suggested a change to the above "current". Can you please help
fix these and merge the patch? Or, let me know if it'd be easier if
I should send v4.

Thanks,
Dexuan