lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1603684905.h43s1t0y05.none@localhost>
Date:   Mon, 26 Oct 2020 00:29:00 -0400
From:   "Alex Xu (Hello71)" <alex_y_xu@...oo.ca>
To:     Nicholas Kazlauskas <nicholas.kazlauskas@....com>,
        alexander.deucher@....com, Harry Wentland <harry.wentland@....com>,
        Leo Li <sunpeng.li@....com>, amd-gfx@...ts.freedesktop.org
Cc:     linux-kernel@...r.kernel.org
Subject: amdgpu crashes on OOM

Hi,

I frequently encounter OOM on my system, mostly due to my own fault. 
Recently, I noticed that not only does a swap storm happen and OOM 
killer gets invoked, but the graphics output freezes permanently. 
Checking the kernel messages, I see:

kworker/u24:4: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
CPU: 6 PID: 279469 Comm: kworker/u24:4 Tainted: G        W         5.9.0-14732-g20b1adb60cf6 #2
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450 Pro4, BIOS P4.20 06/18/2020
Workqueue: events_unbound commit_work
Call Trace:
 ? dump_stack+0x57/0x6a
 ? warn_alloc.cold+0x69/0xcd
 ? __alloc_pages_direct_compact+0xfb/0x116
 ? __alloc_pages_slowpath.constprop.0+0x9c2/0xc14
 ? __alloc_pages_nodemask+0x143/0x167
 ? kmalloc_order+0x24/0x64
 ? dc_create_state+0x1a/0x4d
 ? amdgpu_dm_atomic_commit_tail+0x1b19/0x227d

followed by:

WARNING: CPU: 6 PID: 279469 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7511 amdgpu_dm_atomic_commit_tail+0x217c/0x227d

followed by:

BUG: unable to handle page fault for address: 0000000000012480
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
[ ... ]
RIP: 0010:dc_resource_state_copy_construct+0x10/0x455
[ ... ]
Call Trace:
 ? amdgpu_dm_atomic_commit_tail+0x2193/0x227

This area of code is quite odd:

dc_state_temp = dc_create_state(dm->dc);
ASSERT(dc_state_temp);
dc_state = dc_state_temp;
dc_resource_state_copy_construct_current(dm->dc, dc_state);

This ASSERT macro is misleading: unless CONFIG_DEBUG_KERNEL_DC is set, 
it is actually WARN_ON_ONCE(!(expr)). Therefore, this code fails to 
allocate memory (causing a warning to be printed), prints another 
warning that it failed, then proceeds to immediately dereference it, 
crashing the thread (and the kernel if panic_on_oops is set).

While I am not by any means a graphics or kernel expert, it seems to me 
like there should be a better solution than crashing. If nothing else, 
the OOM killer should be invoked and the operation retried. We may lose 
some frames or see some corruption, but that's far better than totally 
breaking.

Thanks,
Alex.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ