linux-kernel - amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABXGCsMbqw2qzWSCDfp3cNrYVJ1oxLv8Aixfm_Dt91x1cvFX4w@mail.gmail.com>
Date:   Fri, 24 Feb 2023 04:40:49 +0500
From:   Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>
To:     amd-gfx list <amd-gfx@...ts.freedesktop.org>,
        dri-devel <dri-devel@...ts.freedesktop.org>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        "Deucher, Alexander" <Alexander.Deucher@....com>,
        Christian König <ckoenig.leichtzumerken@...il.com>
Subject: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error
 during GPU init"

Hi,
I have a laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007. But
it is impossible to use without AC power because the system losts nvme
when I disconnect the power adapter.

Messages from kernel log when it happens:
nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
nvme nvme0: Does your device have a faulty power saving mode enabled?
nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
and report a bug

I tried to use recommended parameters
(nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off) to resolve
this issue, but without successed.

In the linux-nvme mail list the last advice was to try the "pci=nocrs"
parameter.

But with this parameter the amdgpu driver refuses to work and makes
the system unbootable. I can solve the problem with the booting system
by blacklisting the driver but it is not a good solution, because I
don't wanna lose the GPU.

Why amdgpu not work with "pci=nocrs" ?
And is it possible to solve this incompatibility?
It is very important because when I boot the system without amdgpu
driver with "pci=nocrs" nvme is not losts when I disconnect the power
adapter. So "pci=nocrs" really helps.

Below that I see in kernel log when adds "pci=nocrs" parameter:

amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ATRM
amdgpu: ATOM BIOS: SWBRT77321.001
[drm] VCN(0) decode is enabled in VM mode
[drm] VCN(0) encode is enabled in VM mode
[drm] JPEG decode is enabled in VM mode
Console: switching to colour dummy device 80x25
amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature
disabled as experimental (default)
[drm] GPU posting now...
[drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment
size is 9-bit
amdgpu 0000:03:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 -
0x00000082FEFFFFFF (12272M used)
amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 -
0x0000FFFFFFFFFFFF
[drm] Detected VRAM RAM=12272M, BAR=16384M
[drm] RAM width 192bits GDDR6
[drm] amdgpu: 12272M of VRAM memory ready
[drm] amdgpu: 31774M of GTT memory ready.
amdgpu 0000:03:00.0: amdgpu: (-14) failed to allocate kernel bo
[drm] Debug VRAM access will use slowpath MM access
amdgpu 0000:03:00.0: amdgpu: Failed to DMA MAP the dummy page
[drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block
<gmc_v10_0> failed -12
amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.

Of course a full system log is also attached.

-- 
Best Regards,
Mike Gavrilov.

Download attachment "system-log-Fatal-error-during-GPU-init.tar.xz" of type "application/x-xz" (40988 bytes)