linux-kernel - Keyword Review - Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <57e38bdd-8369-adb7-f095-26652d4ad8d5@amd.com>
Date:   Fri, 24 Feb 2023 08:12:22 +0100
From:   Christian König <christian.koenig@....com>
To:     Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>,
        amd-gfx list <amd-gfx@...ts.freedesktop.org>,
        dri-devel <dri-devel@...ts.freedesktop.org>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        "Deucher, Alexander" <Alexander.Deucher@....com>
Subject: Keyword Review - Re: amdgpu didn't start with pci=nocrs parameter,
 get error "Fatal error during GPU init"

Hi Mikhail,

this is pretty clearly a problem with the system and/or it's BIOS and 
not the GPU hw or the driver.

The option pci=nocrs makes the kernel ignore additional resource windows 
the BIOS reports through ACPI. This then most likely leads to problems 
with amdgpu because it can't bring up its PCIe resources any more.

The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help 
understand the problem, but I strongly suggest to try a BIOS update first.

Regards,
Christian.

Am 24.02.23 um 00:40 schrieb Mikhail Gavrilov:
> Hi,
> I have a laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007. But
> it is impossible to use without AC power because the system losts nvme
> when I disconnect the power adapter.
>
> Messages from kernel log when it happens:
> nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
> nvme nvme0: Does your device have a faulty power saving mode enabled?
> nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
> and report a bug
>
> I tried to use recommended parameters
> (nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off) to resolve
> this issue, but without successed.
>
> In the linux-nvme mail list the last advice was to try the "pci=nocrs"
> parameter.
>
> But with this parameter the amdgpu driver refuses to work and makes
> the system unbootable. I can solve the problem with the booting system
> by blacklisting the driver but it is not a good solution, because I
> don't wanna lose the GPU.
>
> Why amdgpu not work with "pci=nocrs" ?
> And is it possible to solve this incompatibility?
> It is very important because when I boot the system without amdgpu
> driver with "pci=nocrs" nvme is not losts when I disconnect the power
> adapter. So "pci=nocrs" really helps.
>
> Below that I see in kernel log when adds "pci=nocrs" parameter:
>
> amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ATRM
> amdgpu: ATOM BIOS: SWBRT77321.001
> [drm] VCN(0) decode is enabled in VM mode
> [drm] VCN(0) encode is enabled in VM mode
> [drm] JPEG decode is enabled in VM mode
> Console: switching to colour dummy device 80x25
> amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature
> disabled as experimental (default)
> [drm] GPU posting now...
> [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment
> size is 9-bit
> amdgpu 0000:03:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 -
> 0x00000082FEFFFFFF (12272M used)
> amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
> amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 -
> 0x0000FFFFFFFFFFFF
> [drm] Detected VRAM RAM=12272M, BAR=16384M
> [drm] RAM width 192bits GDDR6
> [drm] amdgpu: 12272M of VRAM memory ready
> [drm] amdgpu: 31774M of GTT memory ready.
> amdgpu 0000:03:00.0: amdgpu: (-14) failed to allocate kernel bo
> [drm] Debug VRAM access will use slowpath MM access
> amdgpu 0000:03:00.0: amdgpu: Failed to DMA MAP the dummy page
> [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block
> <gmc_v10_0> failed -12
> amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
>
> Of course a full system log is also attached.
>