linux-kernel - Re: Kernel 6.7 regression doesn't boot if using AMD eGPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <65d4d7e0-4d90-48d7-8e4a-d16800df148a@arm.com>
Date: Mon, 15 Apr 2024 22:44:34 +0100
From: Robin Murphy <robin.murphy@....com>
To: Eric Wagner <ewagner12@...il.com>, Jason Gunthorpe <jgg@...pe.ca>
Cc: Joerg Roedel <joro@...tes.org>, Will Deacon <will@...nel.org>,
 Suravee Suthikulpanit <suravee.suthikulpanit@....com>,
 iommu@...ts.linux.dev, linux-kernel@...r.kernel.org
Subject: Re: Kernel 6.7 regression doesn't boot if using AMD eGPU

On 2024-04-15 7:57 pm, Eric Wagner wrote:
> Apologies if I made a mistake in the first bisect, I'm new to kernel
> debugging.
> 
> I tested cedc811c76778bdef91d405717acee0de54d8db5 (x86/amd) and
> 3613047280ec42a4e1350fdc1a6dd161ff4008cc (core) directly and both were good.
> Then I ran git bisect again with e8cca466a84a75f8ff2a7a31173c99ee6d1c59d2
> as the bad and 6e6c6d6bc6c96c2477ddfea24a121eb5ee12b7a3 as the good and the
> bisect log is attached. It ended up at the same commit as before.
> 
> I've also attached a picture of the boot screen that occurs when it hangs.
> 0000:05:00.0 is the PCIe bus address of the RX 580 eGPU that's causing the
> problem.

Looks like 59ddce4418da483 probably broke things most - prior to that, 
the fact that it's behind a Thunderbolt port would have always taken 
precedence and forced IOMMU_DOMAIN_DMA regardless of what the driver may 
have wanted to say, whereas now we ask the driver first, then complain 
that it conflicts with the untrusted status and ultimately don't 
configure the IOMMU at all. Meanwhile the GPU driver presumably goes on 
to believe it's using dma-direct with no IOMMU present, resulting in 
fireworks when its traffic reaches the IOMMU. Great :(

However the other notable thing that also happened between 6.6 and 6.7 
was the removal of the AMD iommu_v2 code, so there's some possibility 
that the GPU driver still may have only been working before due to that 
also subverting the default domain with its own identity domain, so 
whether it would actually work again with 
iommu_get_default_domain_type() sorted out is yet another question... As 
a first step I'd test the quick hack below, but be prepared for things 
to still break slightly differently.

Cheers,
Robin.

----->8-----
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 996e79dc582d..063e1eb32fbd 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1774,7 +1774,7 @@ static int iommu_get_default_domain_type(struct 
iommu_group *group,
  				untrusted,
  				"Device is not trusted, but driver is overriding group %u to %s, 
refusing to probe.\n",
  				group->id, iommu_domain_type_str(driver_type));
-			return -1;
+			//return -1;
  		}
  		driver_type = IOMMU_DOMAIN_DMA;
  	}