linux-kernel - amdgpu - BUG: kernel NULL pointer dereference, address: 0000000000000000

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com>
Date:   Wed, 29 Jun 2022 02:01:26 -0500
From:   "David C. Rankin" <drankinatty@...denlinkmail.com>
To:     kernel <linux-kernel@...r.kernel.org>
Subject: amdgpu - BUG: kernel NULL pointer dereference, address:
 0000000000000000

All,

   There appears to be a bug (regression maybe?) in the amdgpu driver 
resulting in a Fatal error during GPU init. This began with the 5.17 kernel 
and is still present in the current 5.18 kernel. However, the 
effect/consequence on the kernel due to the NULL pointer dereference seems to 
be getting worse and not causes the machine to hang at the end of the shutdown 
procedure. (tough for boxes that are remote adminned).

I have two servers with old AMD cards that have this exact problem. lspci -v 
(as user) reports the card as:

01:00.1 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] RV370 
[Radeon X300 SE]
         Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0f03
         Flags: fast devsel, NUMA node 0
         Memory at fea20000 (32-bit, non-prefetchable) [size=64K]
         Capabilities: <access denied>
         Kernel modules: amdgpu

The host is:

Host: valkyrie Kernel: 5.18.7-arch1-1 arch: x86_64 bits: 64 compiler: gcc
     v: 12.1.0 parameters: BOOT_IMAGE=/vmlinuz-linux
     root=UUID=515ef9dc-769f-4548-9a08-3a92fa83d86b rw iommu=soft
     amd_iommu_dump= quiet audit=0
   Console: pty pts/0 DM: LightDM v: 1.30.0 Distro: Arch Linux

Machine:
   Type: Desktop Mobo: Gigabyte model: 990FXA-UD3 v: x.x serial: N/A
     BIOS: American Megatrends v: F3 date: 05/28/2015

Memory:
   RAM: total: 31.31 GiB used: 1012.9 MiB (3.2%)

CPU:
   Info: model: AMD FX-8350 socket: AM3 bits: 64 type: MT MCP arch: Piledriver
     built: 2012-13 process: GF 32nm family: 0x15 (21) model-id: 2 stepping: 0
     microcode: 0x6000852

Graphics:
   Device-1: AMD RV370 [Radeon X300] driver: radeon v: kernel
     alternate: amdgpu arch: Rage 9 code: R360-R400 process: TSMC 110nm
     built: 2003-08 pcie: gen: 1 speed: 2.5 GT/s lanes: 16 ports:
     active: DVI-I-1 empty: SVIDEO-1 bus-ID: 01:00.0 chip-ID: 1002:5b60
     class-ID: 0300

The NULL pointer dereference occurs during GPU init of the card. These cards 
are fanless and specifically chosen for that. They are used in server installs 
and have been flawless for years. If it was just one card acting up, I could 
see it may be a card problem, but I have two identical servers setup with this 
card and both show the exact same "BUG: kernel NULL pointer dereference":

[    9.660937] [drm] amdgpu kernel modesetting enabled.
[    9.661025] amdgpu: CRAT table not found
[    9.661028] amdgpu: Virtual CRAT table created for CPU
[    9.661040] amdgpu: Topology: Add CPU node
[    9.661296] [drm] initializing kernel modesetting (IP DISCOVERY 
0x1002:0x5B70 0x1002:0x0F03 0x00).
[    9.661302] amdgpu 0000:01:00.1: amdgpu: Trusted Memory Zone (TMZ) feature 
disabled as experimental (default)
[    9.661305] amdgpu 0000:01:00.1: amdgpu: Fatal error during GPU init
[    9.661318] amdgpu: probe of 0000:01:00.1 failed with error -12
[    9.661338] BUG: kernel NULL pointer dereference, address: 0000000000000000

Full dmesg output for this with backtrace is attached.

Bugs related to this problem are open with freedesktop, and with Archinux.

https://gitlab.freedesktop.org/drm/amd/-/issues/2070

and

https://bugs.archlinux.org/task/74346#comment209209

Are those the proper locations for the bug report or does a kernel bug also 
need to be opened to track the issue? Let me know there and let me know if you 
need any further information from the machines and I'm happy to get it.

-- 
David C. Rankin, J.D.,P.E.
View attachment "amdgpu_dmesg_NULL-pointer.txt" of type "text/plain" (6346 bytes)