[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BN9PR11MB527633EBB159880EE64696868CBC2@BN9PR11MB5276.namprd11.prod.outlook.com>
Date: Thu, 17 Apr 2025 03:13:43 +0000
From: "Tian, Kevin" <kevin.tian@...el.com>
To: Baolu Lu <baolu.lu@...ux.intel.com>, Joerg Roedel <joro@...tes.org>, "Will
Deacon" <will@...nel.org>, Robin Murphy <robin.murphy@....com>, Jarkko Nikula
<jarkko.nikula@...ux.intel.com>
CC: "iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH 1/1] iommu/vt-d: Revert ATS timing change to fix boot
failure
> From: Baolu Lu <baolu.lu@...ux.intel.com>
> Sent: Thursday, April 17, 2025 10:46 AM
>
> On 4/17/25 10:23, Tian, Kevin wrote:
> >> From: Lu Baolu <baolu.lu@...ux.intel.com>
> >> Sent: Wednesday, April 16, 2025 3:36 PM
> >>
> >> Commit <5518f239aff1> ("iommu/vt-d: Move scalable mode ATS
> enablement
> >> to
> >> probe path") changed the PCI ATS enablement logic to run earlier,
> >> specifically before the default domain attachment.
> >>
> >> On some client platforms, this change resulted in boot failures, causing
> >> the kernel to panic with the following message and call trace:
> >>
> >> Kernel panic - not syncing: DMAR hardware is malfunctioning
> >> CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #175
> >> Call Trace:
> >> <TASK>
> >> dump_stack_lvl+0x6f/0xb0
> >> dump_stack+0x10/0x16
> >> panic+0x10a/0x2b7
> >> iommu_enable_translation.cold+0xc/0xc
> >> intel_iommu_init+0xe39/0xec0
> >> ? trace_hardirqs_on+0x1e/0xd0
> >> ? __pfx_pci_iommu_init+0x10/0x10
> >> pci_iommu_init+0xd/0x40
> >> do_one_initcall+0x5b/0x390
> >> kernel_init_freeable+0x26d/0x2b0
> >> ? __pfx_kernel_init+0x10/0x10
> >> kernel_init+0x15/0x120
> >> ret_from_fork+0x35/0x60
> >> ? __pfx_kernel_init+0x10/0x10
> >> ret_from_fork_asm+0x1a/0x30
> >> RIP: 1f0f:0x0
> >> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> >> RSP: 0000:0000000000000000 EFLAGS: 841f0f2e66 ORIG_RAX:
> >> 1f0f2e6600000000
> >> RAX: 0000000000000000 RBX: 1f0f2e6600000000 RCX:
> >> 2e66000000000084
> >> RDX: 0000000000841f0f RSI: 000000841f0f2e66 RDI:
> >> 00841f0f2e660000
> >> RBP: 00841f0f2e660000 R08: 00841f0f2e660000 R09:
> >> 000000841f0f2e66
> >> R10: 0000000000841f0f R11: 2e66000000000084 R12:
> >> 000000841f0f2e66
> >> R13: 0000000000841f0f R14: 2e66000000000084 R15:
> >> 1f0f2e6600000000
> >> </TASK>
> >> ---[ end Kernel panic - not syncing: DMAR hardware is malfunctioning ]---
> >>
> >> Fix this by reverting the timing change for ATS enablement introduced by
> >> the offending commit and restoring the previous behavior.
> >>
> >
> > it's unclear how this timing is related to the dumped stack. Is there
> > more detail how they are related?
> >
>
> I'm not sure, but I'm trying to find a machine and get more information.
> Anyway, let's revert the change and remove the boot regression first.
>
I'm fine with this fix for regression but let's do investigate more.
Reviewed-by: Kevin Tian <kevin.tian@...el.com>
Powered by blists - more mailing lists