[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8bce512e-abb6-495d-85a4-63648229859e@gmail.com>
Date: Fri, 15 Dec 2023 13:37:35 +0100
From: Christian König <ckoenig.leichtzumerken@...il.com>
To: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>
Cc: amd-gfx list <amd-gfx@...ts.freedesktop.org>,
dri-devel <dri-devel@...ts.freedesktop.org>,
Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
"Deucher, Alexander" <Alexander.Deucher@....com>
Subject: Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal
error during GPU init"
Am 15.12.23 um 12:45 schrieb Mikhail Gavrilov:
> On Tue, Feb 28, 2023 at 5:43 PM Christian König
> <ckoenig.leichtzumerken@...il.com> wrote:
>> The point is it doesn't need to talk to the amdgpu hardware. What it
>> does is that it talks to the good old VGA/VESA emulation and that just
>> happens to be still enabled by the BIOS/GRUB.
>>
>> And that VGA/VESA emulation doesn't need any BAR or whatever to keep the
>> hw running in the state where it was initialized before the kernel
>> started. The kernel just grabs the addresses where it needs to write the
>> display data and keeps going with that.
>>
>> But when a hw specific driver wants to load this is the first thing
>> which gets disabled because we need to load new firmware. And with the
>> BARs disabled this can't be re-enabled without rebooting the system.
>>
>>> My suggestion is that if
>>> amdgpu fails to talk to the hardware, then let another suitable driver
>>> do it. I attached a system log when I apply "pci=nocrs" with
>>> "modprobe.blacklist=amdgpu" for showing that graphics work right in
>>> this case.
>>> To do this, does the Linux module loading mechanism need to be refined?
>> That's actually working as expected. The real problem is that the BIOS
>> on that system is so broken that we can't access the hw correctly.
>>
>> What we could to do is to check the BARs very early on and refuse to
>> load when they are disable. The problem with this approach is that there
>> are systems where it is normal that the BARs are disable until the
>> driver loads and get enabled during the hardware initialization process.
>>
>> What you might want to look into is to find a quirk for the BIOS to
>> properly enable the nvme controller.
>>
> That's interesting. I noticed that now amdgpu could work even with
> parameter [pci=nocrs] on 6.7.0-0.rc4 and higher kernels.
> It means BARs became available?
> I attached here the kerner log and lspci. What's changed?
I have no idea :)
From the logs I can see that the AMDGPU now has the proper BARs assigned:
[ 5.722015] pci 0000:03:00.0: [1002:73df] type 00 class 0x038000
[ 5.722051] pci 0000:03:00.0: reg 0x10: [mem
0xf800000000-0xfbffffffff 64bit pref]
[ 5.722081] pci 0000:03:00.0: reg 0x18: [mem
0xfc00000000-0xfc0fffffff 64bit pref]
[ 5.722112] pci 0000:03:00.0: reg 0x24: [mem 0xfca00000-0xfcafffff]
[ 5.722134] pci 0000:03:00.0: reg 0x30: [mem 0xfcb00000-0xfcb1ffff pref]
[ 5.722368] pci 0000:03:00.0: PME# supported from D1 D2 D3hot D3cold
[ 5.722484] pci 0000:03:00.0: 63.008 Gb/s available PCIe bandwidth,
limited by 8.0 GT/s PCIe x8 link at 0000:00:01.1 (capable of 252.048
Gb/s with 16.0 GT/s PCIe x16 link)
And with that the driver can work perfectly fine.
Have you updated the BIOS or added/removed some other hardware? Maybe
somebody added a quirk for your BIOS into the PCIe code or something
like that.
Regards,
Christian.
Powered by blists - more mailing lists