[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z-Gv6TG9dwKI-fvz@macbook.local>
Date: Mon, 24 Mar 2025 20:18:01 +0100
From: Roger Pau Monné <roger.pau@...rix.com>
To: Daniel Gomez <da.gomez@...nel.org>
Cc: Jürgen Groß <jgross@...e.com>,
Bjorn Helgaas <helgaas@...nel.org>, linux-kernel@...r.kernel.org,
xen-devel@...ts.xenproject.org, linux-pci@...r.kernel.org,
Thomas Gleixner <tglx@...utronix.de>,
Bjorn Helgaas <bhelgaas@...gle.com>, Ingo Molnar <mingo@...hat.com>,
Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH v3 3/3] PCI/MSI: Convert pci_msi_ignore_mask to per MSI
domain flag
On Mon, Mar 24, 2025 at 07:58:14PM +0100, Daniel Gomez wrote:
> On Mon, Mar 24, 2025 at 06:51:54PM +0100, Roger Pau Monné wrote:
> > On Mon, Mar 24, 2025 at 03:29:46PM +0100, Daniel Gomez wrote:
> > >
> > > Hi,
> > >
> > > On Fri, Mar 21, 2025 at 09:00:09AM +0100, Jürgen Groß wrote:
> > > > On 20.03.25 22:07, Bjorn Helgaas wrote:
> > > > > On Wed, Feb 19, 2025 at 10:20:57AM +0100, Roger Pau Monne wrote:
> > > > > > Setting pci_msi_ignore_mask inhibits the toggling of the mask bit for both
> > > > > > MSI and MSI-X entries globally, regardless of the IRQ chip they are using.
> > > > > > Only Xen sets the pci_msi_ignore_mask when routing physical interrupts over
> > > > > > event channels, to prevent PCI code from attempting to toggle the maskbit,
> > > > > > as it's Xen that controls the bit.
> > > > > >
> > > > > > However, the pci_msi_ignore_mask being global will affect devices that use
> > > > > > MSI interrupts but are not routing those interrupts over event channels
> > > > > > (not using the Xen pIRQ chip). One example is devices behind a VMD PCI
> > > > > > bridge. In that scenario the VMD bridge configures MSI(-X) using the
> > > > > > normal IRQ chip (the pIRQ one in the Xen case), and devices behind the
> > > > > > bridge configure the MSI entries using indexes into the VMD bridge MSI
> > > > > > table. The VMD bridge then demultiplexes such interrupts and delivers to
> > > > > > the destination device(s). Having pci_msi_ignore_mask set in that scenario
> > > > > > prevents (un)masking of MSI entries for devices behind the VMD bridge.
> > > > > >
> > > > > > Move the signaling of no entry masking into the MSI domain flags, as that
> > > > > > allows setting it on a per-domain basis. Set it for the Xen MSI domain
> > > > > > that uses the pIRQ chip, while leaving it unset for the rest of the
> > > > > > cases.
> > > > > >
> > > > > > Remove pci_msi_ignore_mask at once, since it was only used by Xen code, and
> > > > > > with Xen dropping usage the variable is unneeded.
> > > > > >
> > > > > > This fixes using devices behind a VMD bridge on Xen PV hardware domains.
> > > > > >
> > > > > > Albeit Devices behind a VMD bridge are not known to Xen, that doesn't mean
> > > > > > Linux cannot use them. By inhibiting the usage of
> > > > > > VMD_FEAT_CAN_BYPASS_MSI_REMAP and the removal of the pci_msi_ignore_mask
> > > > > > bodge devices behind a VMD bridge do work fine when use from a Linux Xen
> > > > > > hardware domain. That's the whole point of the series.
> > > > > >
> > > > > > Signed-off-by: Roger Pau Monné <roger.pau@...rix.com>
> > > > > > Reviewed-by: Thomas Gleixner <tglx@...utronix.de>
> > > > > > Acked-by: Juergen Gross <jgross@...e.com>
> > > > >
> > > > > Acked-by: Bjorn Helgaas <bhelgaas@...gle.com>
> > > > >
> > > > > I assume you'll merge this series via the Xen tree. Let me know if
> > > > > otherwise.
> > > >
> > > > I've pushed the series to the linux-next branch of the Xen tree.
> > > >
> > > >
> > > > Juergen
> > >
> > > This patch landed in latest next-20250324 tag causing this crash:
> > >
> > > [ 0.753426] BUG: kernel NULL pointer dereference, address: 0000000000000002
> > > [ 0.753921] #PF: supervisor read access in kernel mode
> > > [ 0.754286] #PF: error_code(0x0000) - not-present page
> > > [ 0.754656] PGD 0 P4D 0
> > > [ 0.754842] Oops: Oops: 0000 [#1]
> > > [ 0.755080] CPU: 0 UID: 0 PID: 1 Comm: swapper Not tainted 6.14.0-rc7-next-20250324 #1 NONE
> > > [ 0.755691] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> > > [ 0.756349] RIP: 0010:msix_prepare_msi_desc+0x39/0x80
> > > [ 0.756390] Code: 20 c7 46 04 01 00 00 00 8b 56 4c 89 d0 0d 01 01 00 00 66 89 46 4c 8b 8f 64 02 00 00 89 4e 50 48 8b 8f 70 06 00 00 48 89 4e 58 <41> f6 40 02 40 75 2a c1 ea 02 bf 80 00 00 00 21 fa 25 7f ff ff ff
> > > [ 0.756390] RSP: 0000:ffff8881002a76e0 EFLAGS: 00010202
> > > [ 0.756390] RAX: 0000000000000101 RBX: ffff88810074d000 RCX: ffffc9000002e000
> > > [ 0.756390] RDX: 0000000000000000 RSI: ffff8881002a7710 RDI: ffff88810074d000
> > > [ 0.756390] RBP: ffff8881002a7710 R08: 0000000000000000 R09: ffff8881002a76b4
> > > [ 0.756390] R10: 000000701000c001 R11: ffffffff82a3dc01 R12: 0000000000000000
> > > [ 0.756390] R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000002
> > > [ 0.756390] FS: 0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000
> > > [ 0.756390] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 0.756390] CR2: 0000000000000002 CR3: 0000000002a3d001 CR4: 00000000003706b0
> > > [ 0.756390] Call Trace:
> > > [ 0.756390] <TASK>
> > > [ 0.756390] ? __die_body+0x1b/0x60
> > > [ 0.756390] ? page_fault_oops+0x2d0/0x310
> > > [ 0.756390] ? exc_page_fault+0x59/0xc0
> > > [ 0.756390] ? asm_exc_page_fault+0x22/0x30
> > > [ 0.756390] ? msix_prepare_msi_desc+0x39/0x80
> > > [ 0.756390] ? msix_capability_init+0x172/0x2c0
> > > [ 0.756390] ? __pci_enable_msix_range+0x1a8/0x1d0
> > > [ 0.756390] ? pci_alloc_irq_vectors_affinity+0x7c/0xf0
> > > [ 0.756390] ? vp_find_vqs_msix+0x187/0x400
> > > [ 0.756390] ? vp_find_vqs+0x2f/0x250
> > > [ 0.756390] ? snprintf+0x3e/0x50
> > > [ 0.756390] ? vp_modern_find_vqs+0x13/0x60
> > > [ 0.756390] ? init_vq+0x184/0x1e0
> > > [ 0.756390] ? vp_get_status+0x20/0x20
> > > [ 0.756390] ? virtblk_probe+0xeb/0x8d0
> > > [ 0.756390] ? __kernfs_new_node+0x122/0x160
> > > [ 0.756390] ? vp_get_status+0x20/0x20
> > > [ 0.756390] ? virtio_dev_probe+0x171/0x1c0
> > > [ 0.756390] ? really_probe+0xc2/0x240
> > > [ 0.756390] ? driver_probe_device+0x1d/0x70
> > > [ 0.756390] ? __driver_attach+0x96/0xe0
> > > [ 0.756390] ? driver_attach+0x20/0x20
> > > [ 0.756390] ? bus_for_each_dev+0x7b/0xb0
> > > [ 0.756390] ? bus_add_driver+0xe6/0x200
> > > [ 0.756390] ? driver_register+0x5e/0xf0
> > > [ 0.756390] ? virtio_blk_init+0x4d/0x90
> > > [ 0.756390] ? add_boot_memory_block+0x90/0x90
> > > [ 0.756390] ? do_one_initcall+0xe2/0x250
> > > [ 0.756390] ? xas_store+0x4b/0x4b0
> > > [ 0.756390] ? number+0x13b/0x260
> > > [ 0.756390] ? ida_alloc_range+0x36a/0x3b0
> > > [ 0.756390] ? parameq+0x13/0x90
> > > [ 0.756390] ? parse_args+0x10f/0x2a0
> > > [ 0.756390] ? do_initcall_level+0x83/0xb0
> > > [ 0.756390] ? do_initcalls+0x43/0x70
> > > [ 0.756390] ? rest_init+0x80/0x80
> > > [ 0.756390] ? kernel_init_freeable+0x70/0xb0
> > > [ 0.756390] ? kernel_init+0x16/0x110
> > > [ 0.756390] ? ret_from_fork+0x30/0x40
> > > [ 0.756390] ? rest_init+0x80/0x80
> > > [ 0.756390] ? ret_from_fork_asm+0x11/0x20
> > > [ 0.756390] </TASK>
> > > [ 0.756390] Modules linked in:
> > > [ 0.756390] CR2: 0000000000000002
> > > [ 0.756390] ---[ end trace 0000000000000000 ]---
> > > [ 0.756390] RIP: 0010:msix_prepare_msi_desc+0x39/0x80
> > > [ 0.756390] Code: 20 c7 46 04 01 00 00 00 8b 56 4c 89 d0 0d 01 01 00 00 66 89 46 4c 8b 8f 64 02 00 00 89 4e 50 48 8b 8f 70 06 00 00 48 89 4e 58 <41> f6 40 02 40 75 2a c1 ea 02 bf 80 00 00 00 21 fa 25 7f ff ff ff
> > > [ 0.756390] RSP: 0000:ffff8881002a76e0 EFLAGS: 00010202
> > > [ 0.756390] RAX: 0000000000000101 RBX: ffff88810074d000 RCX: ffffc9000002e000
> > > [ 0.756390] RDX: 0000000000000000 RSI: ffff8881002a7710 RDI: ffff88810074d000
> > > [ 0.756390] RBP: ffff8881002a7710 R08: 0000000000000000 R09: ffff8881002a76b4
> > > [ 0.756390] R10: 000000701000c001 R11: ffffffff82a3dc01 R12: 0000000000000000
> > > [ 0.756390] R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000002
> > > [ 0.756390] FS: 0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000
> > > [ 0.756390] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 0.756390] CR2: 0000000000000002 CR3: 0000000002a3d001 CR4: 00000000003706b0
> > > [ 0.756390] note: swapper[1] exited with irqs disabled
> > > [ 0.782774] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
> > > [ 0.783560] Kernel Offset: disabled
> > > [ 0.783909] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]---
> > >
> > >
> > > msix_prepare_msi_desc+0x39/0x80:
> > > msix_prepare_msi_desc at drivers/pci/msi/msi.c:616
> > > 611 desc->nvec_used = 1;
> > > 612 desc->pci.msi_attrib.is_msix = 1;
> > > 613 desc->pci.msi_attrib.is_64 = 1;
> > > 614 desc->pci.msi_attrib.default_irq = dev->irq;
> > > 615 desc->pci.mask_base = dev->msix_base;
> > > >616< desc->pci.msi_attrib.can_mask = !(info->flags & MSI_FLAG_NO_MASK) &&
> > > 617 !desc->pci.msi_attrib.is_virtual;
> > > 618
> > > 619 if (desc->pci.msi_attrib.can_mask) {
> > > 620 void __iomem *addr = pci_msix_desc_addr(desc);
> > > 621
> > >
> > > Reverting patch 3 fixes the issue.
> >
> > Thanks for the report and sorry for the breakage. Do you have a QEMU
> > command line I can use to try to reproduce this locally?
> >
> > Will work on a patch ASAP.
>
> Thanks for the quick reply.
>
> The issue is that info appears to be uninitialized. So, this worked for me:
Indeed, irq_domain->host_data is NULL, there's no msi_domain_info. As
this is x86, I was expecting x86 ot always use
x86_init_dev_msi_info(), but that doesn't seem to be the case. I
would like to better understand this.
> diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
> index dcbb4f9ac578..b76c7ec33602 100644
> --- a/drivers/pci/msi/msi.c
> +++ b/drivers/pci/msi/msi.c
> @@ -609,8 +609,10 @@ void msix_prepare_msi_desc(struct pci_dev *dev, struct msi_desc *desc)
> desc->pci.msi_attrib.is_64 = 1;
> desc->pci.msi_attrib.default_irq = dev->irq;
> desc->pci.mask_base = dev->msix_base;
> - desc->pci.msi_attrib.can_mask = !(info->flags & MSI_FLAG_NO_MASK) &&
> - !desc->pci.msi_attrib.is_virtual;
> + desc->pci.msi_attrib.can_mask =
> + info ? !(info->flags & MSI_FLAG_NO_MASK) &&
> + !desc->pci.msi_attrib.is_virtual :
> + 1;
>
> if (desc->pci.msi_attrib.can_mask) {
> void __iomem *addr = pci_msix_desc_addr(desc);
> @@ -743,7 +745,7 @@ static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries,
> /* Disable INTX */
> pci_intx_for_msi(dev, 0);
>
> - if (!(info->flags & MSI_FLAG_NO_MASK)) {
> + if (info && !(info->flags & MSI_FLAG_NO_MASK)) {
I think this should rather be:
if (!info || !(info->flags & MSI_FLAG_NO_MASK)) {
So that in case of no info the default action is to mask the entries.
> /*
> * Ensure that all table entries are masked to prevent
> * stale entries from firing in a crash kernel.
>
> I also noticed d (struct irq_domain) can return NULL if CONFIG_GENERIC_MSI_IRQ
> is not set and we are not checking that either.
>
> I run QEMU with vmctl [1]. This is my command:
>
> [1] https://github.com/SamsungDS/vmctl
>
> /usr/bin/qemu-system-x86_64 \
> -nodefaults \
> -display "none" \
> -machine "q35,accel=kvm,kernel-irqchip=split" \
> -cpu "host" \
> -smp "4" \
> -m "8G" \
> -device "intel-iommu,intremap=on" \
> -netdev "user,id=net0,hostfwd=tcp::2222-:22" \
> -device "virtio-net-pci,netdev=net0" \
> -device "virtio-rng-pci" \
> -drive "id=boot,file=file.qcow2,format=qcow2,if=virtio,discard=unmap,media=disk,read-only=no" \
> -device "pcie-root-port,id=pcie_root_port0,chassis=1,slot=0" \
> -device "nvme,id=nvme0,serial=deadbeef,bus=pcie_root_port0,mdts=7" \
> -drive "id=nvm,file=~/nvm.img,format=raw,if=none,discard=unmap,media=disk,read-only=no" \
> -device "nvme-ns,id=nvm,drive=nvm,bus=nvme0,nsid=1,logical_block_size=4096,physical_block_size=4096" \
> -pidfile "~/vmctl/confdir/run/nvme/pidfile" \
> -kernel "~/src/kernel/linux/arch/x86_64/boot/bzImage" \
> -append "root=/dev/vda1 console=ttyS0,115200 audit=0" \
> -virtfs "local,path=~/linux,security_model=none,readonly=on,mount_tag=kernel_dir" \
> -serial "mon:stdio" \
> -d "guest_errors" \
> -D "~/vmctl/confdir/log/nvme/qemu.log"
Can you narrow down the command line to the minimum required to
reproduce the issue?
Can you attach the Kconfig used to build the crashing kernel?
Thanks, Roger.
Powered by blists - more mailing lists