linux-kernel - Re: [PATCH v3 3/3] PCI/MSI: Convert pci_msi_ignore

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z-Gv6TG9dwKI-fvz@macbook.local>
Date: Mon, 24 Mar 2025 20:18:01 +0100
From: Roger Pau Monné <roger.pau@...rix.com>
To: Daniel Gomez <da.gomez@...nel.org>
Cc: Jürgen Groß <jgross@...e.com>,
	Bjorn Helgaas <helgaas@...nel.org>, linux-kernel@...r.kernel.org,
	xen-devel@...ts.xenproject.org, linux-pci@...r.kernel.org,
	Thomas Gleixner <tglx@...utronix.de>,
	Bjorn Helgaas <bhelgaas@...gle.com>, Ingo Molnar <mingo@...hat.com>,
	Borislav Petkov <bp@...en8.de>,
	Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
	"H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH v3 3/3] PCI/MSI: Convert pci_msi_ignore_mask to per MSI
 domain flag

On Mon, Mar 24, 2025 at 07:58:14PM +0100, Daniel Gomez wrote:
> On Mon, Mar 24, 2025 at 06:51:54PM +0100, Roger Pau Monné wrote:
> > On Mon, Mar 24, 2025 at 03:29:46PM +0100, Daniel Gomez wrote:
> > > 
> > > Hi,
> > > 
> > > On Fri, Mar 21, 2025 at 09:00:09AM +0100, Jürgen Groß wrote:
> > > > On 20.03.25 22:07, Bjorn Helgaas wrote:
> > > > > On Wed, Feb 19, 2025 at 10:20:57AM +0100, Roger Pau Monne wrote:
> > > > > > Setting pci_msi_ignore_mask inhibits the toggling of the mask bit for both
> > > > > > MSI and MSI-X entries globally, regardless of the IRQ chip they are using.
> > > > > > Only Xen sets the pci_msi_ignore_mask when routing physical interrupts over
> > > > > > event channels, to prevent PCI code from attempting to toggle the maskbit,
> > > > > > as it's Xen that controls the bit.
> > > > > > 
> > > > > > However, the pci_msi_ignore_mask being global will affect devices that use
> > > > > > MSI interrupts but are not routing those interrupts over event channels
> > > > > > (not using the Xen pIRQ chip).  One example is devices behind a VMD PCI
> > > > > > bridge.  In that scenario the VMD bridge configures MSI(-X) using the
> > > > > > normal IRQ chip (the pIRQ one in the Xen case), and devices behind the
> > > > > > bridge configure the MSI entries using indexes into the VMD bridge MSI
> > > > > > table.  The VMD bridge then demultiplexes such interrupts and delivers to
> > > > > > the destination device(s).  Having pci_msi_ignore_mask set in that scenario
> > > > > > prevents (un)masking of MSI entries for devices behind the VMD bridge.
> > > > > > 
> > > > > > Move the signaling of no entry masking into the MSI domain flags, as that
> > > > > > allows setting it on a per-domain basis.  Set it for the Xen MSI domain
> > > > > > that uses the pIRQ chip, while leaving it unset for the rest of the
> > > > > > cases.
> > > > > > 
> > > > > > Remove pci_msi_ignore_mask at once, since it was only used by Xen code, and
> > > > > > with Xen dropping usage the variable is unneeded.
> > > > > > 
> > > > > > This fixes using devices behind a VMD bridge on Xen PV hardware domains.
> > > > > > 
> > > > > > Albeit Devices behind a VMD bridge are not known to Xen, that doesn't mean
> > > > > > Linux cannot use them.  By inhibiting the usage of
> > > > > > VMD_FEAT_CAN_BYPASS_MSI_REMAP and the removal of the pci_msi_ignore_mask
> > > > > > bodge devices behind a VMD bridge do work fine when use from a Linux Xen
> > > > > > hardware domain.  That's the whole point of the series.
> > > > > > 
> > > > > > Signed-off-by: Roger Pau Monné <roger.pau@...rix.com>
> > > > > > Reviewed-by: Thomas Gleixner <tglx@...utronix.de>
> > > > > > Acked-by: Juergen Gross <jgross@...e.com>
> > > > > 
> > > > > Acked-by: Bjorn Helgaas <bhelgaas@...gle.com>
> > > > > 
> > > > > I assume you'll merge this series via the Xen tree.  Let me know if
> > > > > otherwise.
> > > > 
> > > > I've pushed the series to the linux-next branch of the Xen tree.
> > > > 
> > > > 
> > > > Juergen
> > > 
> > > This patch landed in latest next-20250324 tag causing this crash:
> > > 
> > > [    0.753426] BUG: kernel NULL pointer dereference, address: 0000000000000002
> > > [    0.753921] #PF: supervisor read access in kernel mode
> > > [    0.754286] #PF: error_code(0x0000) - not-present page
> > > [    0.754656] PGD 0 P4D 0
> > > [    0.754842] Oops: Oops: 0000 [#1]
> > > [    0.755080] CPU: 0 UID: 0 PID: 1 Comm: swapper Not tainted 6.14.0-rc7-next-20250324 #1 NONE
> > > [    0.755691] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> > > [    0.756349] RIP: 0010:msix_prepare_msi_desc+0x39/0x80
> > > [    0.756390] Code: 20 c7 46 04 01 00 00 00 8b 56 4c 89 d0 0d 01 01 00 00 66 89 46 4c 8b 8f 64 02 00 00 89 4e 50 48 8b 8f 70 06 00 00 48 89 4e 58 <41> f6 40 02 40 75 2a c1 ea 02 bf 80 00 00 00 21 fa 25 7f ff ff ff
> > > [    0.756390] RSP: 0000:ffff8881002a76e0 EFLAGS: 00010202
> > > [    0.756390] RAX: 0000000000000101 RBX: ffff88810074d000 RCX: ffffc9000002e000
> > > [    0.756390] RDX: 0000000000000000 RSI: ffff8881002a7710 RDI: ffff88810074d000
> > > [    0.756390] RBP: ffff8881002a7710 R08: 0000000000000000 R09: ffff8881002a76b4
> > > [    0.756390] R10: 000000701000c001 R11: ffffffff82a3dc01 R12: 0000000000000000
> > > [    0.756390] R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000002
> > > [    0.756390] FS:  0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000
> > > [    0.756390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [    0.756390] CR2: 0000000000000002 CR3: 0000000002a3d001 CR4: 00000000003706b0
> > > [    0.756390] Call Trace:
> > > [    0.756390]  <TASK>
> > > [    0.756390]  ? __die_body+0x1b/0x60
> > > [    0.756390]  ? page_fault_oops+0x2d0/0x310
> > > [    0.756390]  ? exc_page_fault+0x59/0xc0
> > > [    0.756390]  ? asm_exc_page_fault+0x22/0x30
> > > [    0.756390]  ? msix_prepare_msi_desc+0x39/0x80
> > > [    0.756390]  ? msix_capability_init+0x172/0x2c0
> > > [    0.756390]  ? __pci_enable_msix_range+0x1a8/0x1d0
> > > [    0.756390]  ? pci_alloc_irq_vectors_affinity+0x7c/0xf0
> > > [    0.756390]  ? vp_find_vqs_msix+0x187/0x400
> > > [    0.756390]  ? vp_find_vqs+0x2f/0x250
> > > [    0.756390]  ? snprintf+0x3e/0x50
> > > [    0.756390]  ? vp_modern_find_vqs+0x13/0x60
> > > [    0.756390]  ? init_vq+0x184/0x1e0
> > > [    0.756390]  ? vp_get_status+0x20/0x20
> > > [    0.756390]  ? virtblk_probe+0xeb/0x8d0
> > > [    0.756390]  ? __kernfs_new_node+0x122/0x160
> > > [    0.756390]  ? vp_get_status+0x20/0x20
> > > [    0.756390]  ? virtio_dev_probe+0x171/0x1c0
> > > [    0.756390]  ? really_probe+0xc2/0x240
> > > [    0.756390]  ? driver_probe_device+0x1d/0x70
> > > [    0.756390]  ? __driver_attach+0x96/0xe0
> > > [    0.756390]  ? driver_attach+0x20/0x20
> > > [    0.756390]  ? bus_for_each_dev+0x7b/0xb0
> > > [    0.756390]  ? bus_add_driver+0xe6/0x200
> > > [    0.756390]  ? driver_register+0x5e/0xf0
> > > [    0.756390]  ? virtio_blk_init+0x4d/0x90
> > > [    0.756390]  ? add_boot_memory_block+0x90/0x90
> > > [    0.756390]  ? do_one_initcall+0xe2/0x250
> > > [    0.756390]  ? xas_store+0x4b/0x4b0
> > > [    0.756390]  ? number+0x13b/0x260
> > > [    0.756390]  ? ida_alloc_range+0x36a/0x3b0
> > > [    0.756390]  ? parameq+0x13/0x90
> > > [    0.756390]  ? parse_args+0x10f/0x2a0
> > > [    0.756390]  ? do_initcall_level+0x83/0xb0
> > > [    0.756390]  ? do_initcalls+0x43/0x70
> > > [    0.756390]  ? rest_init+0x80/0x80
> > > [    0.756390]  ? kernel_init_freeable+0x70/0xb0
> > > [    0.756390]  ? kernel_init+0x16/0x110
> > > [    0.756390]  ? ret_from_fork+0x30/0x40
> > > [    0.756390]  ? rest_init+0x80/0x80
> > > [    0.756390]  ? ret_from_fork_asm+0x11/0x20
> > > [    0.756390]  </TASK>
> > > [    0.756390] Modules linked in:
> > > [    0.756390] CR2: 0000000000000002
> > > [    0.756390] ---[ end trace 0000000000000000 ]---
> > > [    0.756390] RIP: 0010:msix_prepare_msi_desc+0x39/0x80
> > > [    0.756390] Code: 20 c7 46 04 01 00 00 00 8b 56 4c 89 d0 0d 01 01 00 00 66 89 46 4c 8b 8f 64 02 00 00 89 4e 50 48 8b 8f 70 06 00 00 48 89 4e 58 <41> f6 40 02 40 75 2a c1 ea 02 bf 80 00 00 00 21 fa 25 7f ff ff ff
> > > [    0.756390] RSP: 0000:ffff8881002a76e0 EFLAGS: 00010202
> > > [    0.756390] RAX: 0000000000000101 RBX: ffff88810074d000 RCX: ffffc9000002e000
> > > [    0.756390] RDX: 0000000000000000 RSI: ffff8881002a7710 RDI: ffff88810074d000
> > > [    0.756390] RBP: ffff8881002a7710 R08: 0000000000000000 R09: ffff8881002a76b4
> > > [    0.756390] R10: 000000701000c001 R11: ffffffff82a3dc01 R12: 0000000000000000
> > > [    0.756390] R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000002
> > > [    0.756390] FS:  0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000
> > > [    0.756390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [    0.756390] CR2: 0000000000000002 CR3: 0000000002a3d001 CR4: 00000000003706b0
> > > [    0.756390] note: swapper[1] exited with irqs disabled
> > > [    0.782774] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
> > > [    0.783560] Kernel Offset: disabled
> > > [    0.783909] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]---
> > > 
> > > 
> > > msix_prepare_msi_desc+0x39/0x80:
> > > msix_prepare_msi_desc at drivers/pci/msi/msi.c:616
> > >  611            desc->nvec_used                         = 1;
> > >  612            desc->pci.msi_attrib.is_msix            = 1;
> > >  613            desc->pci.msi_attrib.is_64              = 1;
> > >  614            desc->pci.msi_attrib.default_irq        = dev->irq;
> > >  615            desc->pci.mask_base                     = dev->msix_base;
> > > >616<           desc->pci.msi_attrib.can_mask           = !(info->flags & MSI_FLAG_NO_MASK) &&
> > >  617                                                      !desc->pci.msi_attrib.is_virtual;
> > >  618
> > >  619            if (desc->pci.msi_attrib.can_mask) {
> > >  620                    void __iomem *addr = pci_msix_desc_addr(desc);
> > >  621
> > > 
> > > Reverting patch 3 fixes the issue.
> > 
> > Thanks for the report and sorry for the breakage.  Do you have a QEMU
> > command line I can use to try to reproduce this locally?
> > 
> > Will work on a patch ASAP.
> 
> Thanks for the quick reply.
> 
> The issue is that info appears to be uninitialized. So, this worked for me:

Indeed, irq_domain->host_data is NULL, there's no msi_domain_info.  As
this is x86, I was expecting x86 ot always use
x86_init_dev_msi_info(), but that doesn't seem to be the case.  I
would like to better understand this.

> diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
> index dcbb4f9ac578..b76c7ec33602 100644
> --- a/drivers/pci/msi/msi.c
> +++ b/drivers/pci/msi/msi.c
> @@ -609,8 +609,10 @@ void msix_prepare_msi_desc(struct pci_dev *dev, struct msi_desc *desc)
>         desc->pci.msi_attrib.is_64              = 1;
>         desc->pci.msi_attrib.default_irq        = dev->irq;
>         desc->pci.mask_base                     = dev->msix_base;
> -       desc->pci.msi_attrib.can_mask           = !(info->flags & MSI_FLAG_NO_MASK) &&
> -                                                 !desc->pci.msi_attrib.is_virtual;
> +       desc->pci.msi_attrib.can_mask =
> +               info ? !(info->flags & MSI_FLAG_NO_MASK) &&
> +                               !desc->pci.msi_attrib.is_virtual :
> +                      1;
> 
>         if (desc->pci.msi_attrib.can_mask) {
>                 void __iomem *addr = pci_msix_desc_addr(desc);
> @@ -743,7 +745,7 @@ static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries,
>         /* Disable INTX */
>         pci_intx_for_msi(dev, 0);
> 
> -       if (!(info->flags & MSI_FLAG_NO_MASK)) {
> +       if (info && !(info->flags & MSI_FLAG_NO_MASK)) {

I think this should rather be:

if (!info || !(info->flags & MSI_FLAG_NO_MASK)) {

So that in case of no info the default action is to mask the entries.

>                 /*
>                  * Ensure that all table entries are masked to prevent
>                  * stale entries from firing in a crash kernel.
> 
> I also noticed d (struct irq_domain) can return NULL if CONFIG_GENERIC_MSI_IRQ
> is not set and we are not checking that either.
> 
> I run QEMU with vmctl [1]. This is my command:
> 
> [1] https://github.com/SamsungDS/vmctl
> 
> /usr/bin/qemu-system-x86_64 \
>   -nodefaults \
>   -display "none" \
>   -machine "q35,accel=kvm,kernel-irqchip=split" \
>   -cpu "host" \
>   -smp "4" \
>   -m "8G" \
>   -device "intel-iommu,intremap=on" \
>   -netdev "user,id=net0,hostfwd=tcp::2222-:22" \
>   -device "virtio-net-pci,netdev=net0" \
>   -device "virtio-rng-pci" \
>   -drive "id=boot,file=file.qcow2,format=qcow2,if=virtio,discard=unmap,media=disk,read-only=no" \
>   -device "pcie-root-port,id=pcie_root_port0,chassis=1,slot=0" \
>   -device "nvme,id=nvme0,serial=deadbeef,bus=pcie_root_port0,mdts=7" \
>   -drive "id=nvm,file=~/nvm.img,format=raw,if=none,discard=unmap,media=disk,read-only=no" \
>   -device "nvme-ns,id=nvm,drive=nvm,bus=nvme0,nsid=1,logical_block_size=4096,physical_block_size=4096" \
>   -pidfile "~/vmctl/confdir/run/nvme/pidfile" \
>   -kernel "~/src/kernel/linux/arch/x86_64/boot/bzImage" \
>   -append "root=/dev/vda1 console=ttyS0,115200 audit=0" \
>   -virtfs "local,path=~/linux,security_model=none,readonly=on,mount_tag=kernel_dir" \
>   -serial "mon:stdio" \
>   -d "guest_errors" \
>   -D "~/vmctl/confdir/log/nvme/qemu.log"

Can you narrow down the command line to the minimum required to
reproduce the issue?

Can you attach the Kconfig used to build the crashing kernel?

Thanks, Roger.