[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <65d4d7e0-4d90-48d7-8e4a-d16800df148a@arm.com>
Date: Mon, 15 Apr 2024 22:44:34 +0100
From: Robin Murphy <robin.murphy@....com>
To: Eric Wagner <ewagner12@...il.com>, Jason Gunthorpe <jgg@...pe.ca>
Cc: Joerg Roedel <joro@...tes.org>, Will Deacon <will@...nel.org>,
Suravee Suthikulpanit <suravee.suthikulpanit@....com>,
iommu@...ts.linux.dev, linux-kernel@...r.kernel.org
Subject: Re: Kernel 6.7 regression doesn't boot if using AMD eGPU
On 2024-04-15 7:57 pm, Eric Wagner wrote:
> Apologies if I made a mistake in the first bisect, I'm new to kernel
> debugging.
>
> I tested cedc811c76778bdef91d405717acee0de54d8db5 (x86/amd) and
> 3613047280ec42a4e1350fdc1a6dd161ff4008cc (core) directly and both were good.
> Then I ran git bisect again with e8cca466a84a75f8ff2a7a31173c99ee6d1c59d2
> as the bad and 6e6c6d6bc6c96c2477ddfea24a121eb5ee12b7a3 as the good and the
> bisect log is attached. It ended up at the same commit as before.
>
> I've also attached a picture of the boot screen that occurs when it hangs.
> 0000:05:00.0 is the PCIe bus address of the RX 580 eGPU that's causing the
> problem.
Looks like 59ddce4418da483 probably broke things most - prior to that,
the fact that it's behind a Thunderbolt port would have always taken
precedence and forced IOMMU_DOMAIN_DMA regardless of what the driver may
have wanted to say, whereas now we ask the driver first, then complain
that it conflicts with the untrusted status and ultimately don't
configure the IOMMU at all. Meanwhile the GPU driver presumably goes on
to believe it's using dma-direct with no IOMMU present, resulting in
fireworks when its traffic reaches the IOMMU. Great :(
However the other notable thing that also happened between 6.6 and 6.7
was the removal of the AMD iommu_v2 code, so there's some possibility
that the GPU driver still may have only been working before due to that
also subverting the default domain with its own identity domain, so
whether it would actually work again with
iommu_get_default_domain_type() sorted out is yet another question... As
a first step I'd test the quick hack below, but be prepared for things
to still break slightly differently.
Cheers,
Robin.
----->8-----
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 996e79dc582d..063e1eb32fbd 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1774,7 +1774,7 @@ static int iommu_get_default_domain_type(struct
iommu_group *group,
untrusted,
"Device is not trusted, but driver is overriding group %u to %s,
refusing to probe.\n",
group->id, iommu_domain_type_str(driver_type));
- return -1;
+ //return -1;
}
driver_type = IOMMU_DOMAIN_DMA;
}
Powered by blists - more mailing lists