linux-kernel - RE: [PATCH] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SA1PR21MB1335D08F987BBAE08EADF010BF6B9@SA1PR21MB1335.namprd21.prod.outlook.com>
Date:   Tue, 16 Aug 2022 21:04:25 +0000
From:   Dexuan Cui <decui@...rosoft.com>
To:     Lorenzo Pieralisi <lpieralisi@...nel.org>,
        "bhelgaas@...gle.com" <bhelgaas@...gle.com>
CC:     Jeffrey Hugo <quic_jhugo@...cinc.com>,
        "wei.liu@...nel.org" <wei.liu@...nel.org>,
        KY Srinivasan <kys@...rosoft.com>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        Stephen Hemminger <sthemmin@...rosoft.com>,
        "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
        "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "Michael Kelley (LINUX)" <mikelley@...rosoft.com>,
        "robh@...nel.org" <robh@...nel.org>, "kw@...ux.com" <kw@...ux.com>,
        "helgaas@...nel.org" <helgaas@...nel.org>,
        "alex.williamson@...hat.com" <alex.williamson@...hat.com>,
        "boqun.feng@...il.com" <boqun.feng@...il.com>,
        Boqun Feng <Boqun.Feng@...rosoft.com>,
        Carl Vanderlip <quic_carlv@...cinc.com>
Subject: RE: [PATCH] PCI: hv: Only reuse existing IRTE allocation for
 Multi-MSI

> From: Lorenzo Pieralisi <lpieralisi@...nel.org>
> Sent: Tuesday, August 16, 2022 8:48 AM
> To: Dexuan Cui <decui@...rosoft.com>
> 
> On Wed, Aug 10, 2022 at 09:51:23PM +0000, Dexuan Cui wrote:
> > > From: Jeffrey Hugo <quic_jhugo@...cinc.com>
> > > Sent: Thursday, August 4, 2022 7:22 AM
> > >  ...
> > > > Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in
> > > compose_msi_msg()")
> > > > Signed-off-by: Dexuan Cui <decui@...rosoft.com>
> > > > Cc: Jeffrey Hugo <quic_jhugo@...cinc.com>
> > > > Cc: Carl Vanderlip <quic_carlv@...cinc.com>
> > > > ---
> > >
> > > I'm sorry a regression has been discovered.  Right now, the issue
> > > doesn't make sense to me.  I'd love to know what you find out.
> > >
> > > This stopgap solution appears reasonable to me.
> > >
> > > Reviewed-by: Jeffrey Hugo <quic_jhugo@...cinc.com>
> >
> > Hi Lorenzo, Bjorn and all,
> > Would you please take a look at the patch?
> 
> I am not very happy with this patch, it is a temporary solution to
> a potential problem (or reworded: we don't know what the problem is
> but we are fixing it anyway - in a way that is potentially not
> related to the bug at all).

Exactly. I understand your concern. We're still trying to figure out
the root cause. The problem is that so far the issue only repros with
that Azure VM SKU "L64s v2" (64 AMD vCPUs, 8 NVMe drives, 32 I/O
queues per NVMe device). Unluckily I can't find the same hardware
and software locally. I tried to repro the issue on two local Hyper-V
hosts with 4 NVMe devices (8 I/O queues per NVMe device) but failed
to repro the issue. It's very difficult to debug the issue on Azure (if
that's possible) because we don't have hypervisor debugger to examine
the hypervisor and the hardware (I'm still trying to figure out how
difficult it's to set up the debugger).

As I mentioned, b4b77778ecc5 itself looks good, but it changes the
behavior of compose_msi_msg(), that has been working for many years.

This small patch restores the old behavior for non-Multi-MSI, so it's
safe. Considering we may need quite a long time to get the root cause,
IMO this patch is a good temporary solution for non-Multi-MSI.
 
> If the commit you are fixing is the problem I'd rather revert it,
> waiting to understand the problem and to rework the code accordingly.

IMO we should not revert b4b77778ecc5, because Jeff's Multi-MSI-capable
PCI device needs this patch to work properly. It looks like Jeff's PCI device
doesn't suffer from the interrupt issue I described in the commit log.

BTW, Multi-MSI never worked in pci-hyperv before Jeff's 4 recent 
patches. Jeff's PCI device is the first Multi-MSI PCI device we ever tested
with pci-hyperv.

> I don't think b4b77778ecc5 is essential to Hyper-V code - it is a
> rework, let's fix it and repost it when it is updated through the
> debugging you are carrying out. In the meantime we can revert it
> to fix the issue.

Thanks for sharing your thoughts. IMO b4b77778ecc5 is not a simple
rework. As I explained above, IMO we should not revert it.

Before b4b77778ecc5:
1. when a PCI device driver calls pci_alloc_irq_vectors(),
 hv_compose_msi_msg() is called to create an interrupt mapping with
 the dummy vector= 0xef (i.e., MANAGED_IRQ_SHUTDOWN_VECTOR); 

2. when the PCI device driver calls request_irq(), hv_compose_msi_msg()
 is called again with a real vector and cpu.
  2.1 hv_compose_msi_msg() destroys the earlier interrupt mapping..
  2.2 hv_compose_msi_msg() creates a new mapping with the real vector/cpu.
  2.3 hv_arch_irq_unmask() uses a hypercall to update the mapping with the real vector/cpu.

(BTW, hv_arch_irq_unmask() is also called when irqbalance daemon is running).

Step 2.2 and 2.3 do seem duplicated, but my understanding is that this is
how Linux irq and PCI MSI code work.

With b4b77778ecc5, we no longer have step 2.1 and 2.2 (please note
the "return" in that commit). We never tested this code path before
b4b77778ecc5, and we supposed this should work before we found
the NVMe interrupt issue with the VM SKU "L64s v2".

The strangest thing is that the NVMe devices do work for low IO-depth,
though the read/write speed is a little bit lower (which is also strange); 

With high IO-depth,suddenly queue-29 of each NVMe device stops
working -- after the NVMe devices are initialized, hv_compose_msi_msg()
and hv_arch_irq_unmask() are not called, meaning the MSI-X tables on
the NVMe devices are still the same, and the IOMMU's Interrupt 
Remapping Table Entries should also be the same, so it's really weird
the interrupts on queue-29 are no longer happening...

My speculation is that there may be some kind of weird hypervisor bug,
but it's hard to prove it with no debugger. :-( 

I appreciate your thoughts on my analysis.

To recap, my suggestion is that we use this patch as a temporary solution
and we continue the investigation, which unluckily may need a long period
of time.

Thanks,
Dexuan