linux-kernel - Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABXGCsNrnYZO6NfF624j0xrBkdF9vjZhcyF8iZrEr4eGcjpSCA@mail.gmail.com>
Date:   Sat, 9 Jul 2022 17:10:55 +0500
From:   Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>
To:     Christian König <ckoenig.leichtzumerken@...il.com>
Cc:     amd-gfx list <amd-gfx@...ts.freedesktop.org>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        tzimmermann@...e.de
Subject: Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU
 stopped entering in graphic mode.

On Thu, Jul 7, 2022 at 2:50 PM Christian König
<ckoenig.leichtzumerken@...il.com> wrote:
>
> Am 07.07.22 um 02:20 schrieb Mikhail Gavrilov:
> > On Tue, Jun 28, 2022 at 2:21 PM Mikhail Gavrilov
> > <mikhail.v.gavrilov@...il.com> wrote:
> > Christian can you look why
> > drm_aperture_remove_conflicting_pci_framebuffers cause this kernel bug
> > on my machine?
>
> That looks like a problem outside of the amdgpu driver.
>
> What happens is that during load amdgpu requests whatever driver
> (vesafb,vgafb or efifb) is currently handling the framebuffer to unload.
> This unload in turn now crashes for some reason.
>
> My best suggestion is to try to bisect this.

Hi Christian,
if you read my initial post. You should see that I tried to bisect the issue.
But it is very problematic because on each step I see different symptomes.
And if mark different symptoms with skip step we got at end lot of
possible commits:
Here is my bisect from initial post: https://pastebin.com/AhLMNfyv

If you want that I ended bisection successfully please help how to fix
this oops:
[    8.291177] page:00000000af2b6334 refcount:0 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0x102a000
[    8.291202] head:00000000af2b6334 order:0 compound_mapcount:-1226
compound_pincount:0
[    8.291221] flags: 0x17ffffc0010000(head|node=0|zone=2|lastcpupid=0x1fffff)
[    8.291239] raw: 0017ffffc0010000 fffffb35c0a80008 fffffb35c0a80008
0000000000000000
[    8.291257] raw: 0000000000000000 0000000000000000 00000000ffffffff
0000000000000000
[    8.291275] page dumped because: VM_BUG_ON_PAGE(compound &&
compound_order(page) != order)
[    8.291298] ------------[ cut here ]------------
[    8.291309] kernel BUG at mm/page_alloc.c:1329!
[    8.291324] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[    8.291328] CPU: 8 PID: 599 Comm: systemd-udevd Not tainted
5.18.0-rc2-003-790b45f1bc6736a8dd48ba5731b6871e0217311e+ #361
[    8.291333] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[    8.291338] RIP: 0010:free_pcp_prepare+0x58d/0x5a0
[    8.291343] Code: c6 18 a2 85 a7 e8 d3 b7 fc ff 0f 0b 31 f6 48 89
df e8 97 cf 06 00 e9 29 ff ff ff 48 c7 c6 00 f1 85 a7 48 89 df e8 b3
b7 fc ff <0f> 0b 48 c7 c6 58 92 85 a7 e8 a5 b7 fc ff 0f 0b 0f 1f 00 0f
1f 44
[    8.291351] RSP: 0018:ffffb07c023ab9d8 EFLAGS: 00010296
[    8.291354] RAX: 000000000000004e RBX: fffffb35c0a80000 RCX: 0000000000000000
[    8.291358] RDX: 0000000000000001 RSI: ffffffffa789dbaf RDI: 00000000ffffffff
[    8.291361] RBP: 0000000000000009 R08: 0000000000000000 R09: ffffb07c023ab7c0
[    8.291365] R10: 0000000000000003 R11: ffff92ee2e2fffe8 R12: 0000000000000000
[    8.291368] R13: ffff92ee2a55d180 R14: 00000000fffffe00 R15: fffffb35c0a80000
[    8.291371] FS:  00007f80aa398680(0000) GS:ffff92edda200000(0000)
knlGS:0000000000000000
[    8.291376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.291379] CR2: 00007f80aa38e616 CR3: 000000017d726000 CR4: 0000000000350ee0
[    8.291382] Call Trace:
[    8.291384]  <TASK>
[    8.291386]  ? find_held_lock+0x32/0x80
[    8.291391]  free_unref_page+0x25/0x2a0
[    8.291395]  __vunmap+0x261/0x3d0
[    8.291399]  drm_fbdev_cleanup+0x6b/0xc0
[    8.291403]  drm_fbdev_fb_destroy+0x15/0x30
[    8.291407]  unregister_framebuffer+0x2e/0x40
[    8.291411]  drm_client_dev_unregister+0x6e/0xe0
[    8.291416]  drm_dev_unregister+0x34/0x90
[    8.291419]  drm_dev_unplug+0x24/0x40
[    8.291422]  simpledrm_remove+0x11/0x20
[    8.291426]  platform_remove+0x1f/0x40
[    8.291429]  device_release_driver_internal+0x1b8/0x220
[    8.291433]  bus_remove_device+0xef/0x160
[    8.291437]  device_del+0x18c/0x3f0
[    8.291440]  platform_device_del.part.0+0x13/0x70
[    8.291444]  platform_device_unregister+0x1c/0x30
[    8.291447]  drm_aperture_detach_drivers+0xa3/0xd0
[    8.291452]  drm_aperture_remove_conflicting_pci_framebuffers+0x3f/0x70
[    8.291457]  amdgpu_pci_probe+0x126/0x3c0 [amdgpu]
[    8.291599]  local_pci_probe+0x41/0x80
[    8.291604]  pci_device_probe+0xaa/0x200
[    8.291607]  really_probe+0x1a0/0x370
[    8.291611]  __driver_probe_device+0xfb/0x170
[    8.291615]  driver_probe_device+0x1f/0x90
[    8.291618]  __driver_attach+0xbe/0x1a0
[    8.291622]  ? __device_attach_driver+0xe0/0xe0
[    8.291625]  bus_for_each_dev+0x65/0x90
[    8.291629]  bus_add_driver+0x150/0x1f0
[    8.291632]  driver_register+0x89/0xd0
[    8.291636]  ? 0xffffffffc067b000
[    8.291641]  do_one_initcall+0x69/0x350
[    8.291645]  ? do_init_module+0x22/0x260
[    8.291650]  ? rcu_read_lock_sched_held+0x3b/0x70
[    8.291654]  ? trace_kmalloc+0x3b/0x100
[    8.291658]  ? kmem_cache_alloc_trace+0x1eb/0x3a0
[    8.291662]  do_init_module+0x4a/0x260
[    8.291666]  __do_sys_finit_module+0x93/0xf0
[    8.291673]  do_syscall_64+0x3a/0x80
[    8.291677]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[    8.291681] RIP: 0033:0x7f80aaf4507d
[    8.291685] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e
fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 dd 0c 00 f7 d8 64 89
01 48
[    8.291694] RSP: 002b:00007ffe43973ce8 EFLAGS: 00000246 ORIG_RAX:
0000000000000139
[    8.291699] RAX: ffffffffffffffda RBX: 000055ce603a3fe0 RCX: 00007f80aaf4507d
[    8.291702] RDX: 0000000000000000 RSI: 000055ce60395ac0 RDI: 0000000000000011
[    8.291706] RBP: 000055ce60395ac0 R08: 0000000000000000 R09: 00007f80ab013c80
[    8.291709] R10: 0000000000000011 R11: 0000000000000246 R12: 0000000000020000
[    8.291713] R13: 000055ce60387c30 R14: 0000000000000000 R15: 000055ce6038ede0
[    8.291718]  </TASK>
[    8.291719] Modules linked in: amdgpu(+) drm_ttm_helper ttm
crct10dif_pclmul crc32_pclmul iommu_v2 crc32c_intel gpu_sched ucsi_ccg
typec_ucsi nvme drm_buddy igb ccp ghash_clmulni_intel typec
drm_dp_helper sp5100_tco nvme_core dca wmi ip6_tables ip_tables
ipmi_devintf ipmi_msghandler fuse
[    8.291740] ---[ end trace 0000000000000000 ]---


> Thanks for reporting. This bug has been fixed in
>
>
> https://cgit.freedesktop.org/drm/drm/commit/?h=drm-fixes&id=ee7a69aa38d87a3bbced7b8245c732c05ed0c6ec
>
> The patch should reach mainline next week or so.

Hi Thomas,
thanks for the patch, this patch fixes oops
But this patch does not fix the initial issue when a lot of processes
blocked by mutex which applied by amdgpu_ctx_mgr_entity_flush
[  249.491425] INFO: task (brt-dbus):1634 blocked for more than 122 seconds.
[  249.491520]       Tainted: G        W    L   --------  ---
5.19.0-0.rc5.20220707git9f09069cde34.43.fc37.x86_64 #1
[  249.491526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  249.491529] task:(brt-dbus)      state:D stack:14504 pid: 1634
ppid:     1 flags:0x00000002
[  249.491541] Call Trace:
[  249.491545]  <TASK>
[  249.491556]  __schedule+0x492/0x1620
[  249.491565]  ? lock_is_held_type+0xe8/0x140
[  249.491575]  ? find_held_lock+0x32/0x80
[  249.491590]  schedule+0x4e/0xb0
[  249.491597]  schedule_preempt_disabled+0x14/0x20
[  249.491603]  __mutex_lock+0x423/0x890
[  249.491609]  ? __lock_acquire+0x387/0x1ee0
[  249.491618]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.491849]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.492040]  amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.492237]  amdgpu_flush+0x25/0x40 [amdgpu]
[  249.492420]  filp_close+0x31/0x70
[  249.492429]  __close_range+0x1f3/0x490
[  249.492441]  __x64_sys_close_range+0x13/0x20
[  249.492446]  do_syscall_64+0x5b/0x80
[  249.492452]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.492459]  ? do_syscall_64+0x67/0x80
[  249.492467]  ? asm_exc_page_fault+0x27/0x30
[  249.492473]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.492480]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  249.492486] RIP: 0033:0x7f789d4c23cb
[  249.492512] RSP: 002b:00007ffe10773198 EFLAGS: 00000246 ORIG_RAX:
00000000000001b4
[  249.492518] RAX: ffffffffffffffda RBX: 00007ffe107731a0 RCX: 00007f789d4c23cb
[  249.492523] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: 0000000000000027
[  249.492527] RBP: 00007ffe10773220 R08: 0000000000000000 R09: 00007ffe10773270
[  249.492531] R10: 00007ffe107730e0 R11: 0000000000000246 R12: 0000000000000002
[  249.492534] R13: 00007ffe10773230 R14: 0000000000000000 R15: 0000000000000002
[  249.492555]  </TASK>
[  249.492559] INFO: task (time-dir):1640 blocked for more than 122 seconds.
[  249.492564]       Tainted: G        W    L   --------  ---
5.19.0-0.rc5.20220707git9f09069cde34.43.fc37.x86_64 #1
[  249.492568] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  249.492571] task:(time-dir)      state:D stack:14504 pid: 1640
ppid:     1 flags:0x00000002
[  249.492580] Call Trace:
[  249.492584]  <TASK>
[  249.492592]  __schedule+0x492/0x1620
[  249.492597]  ? lock_is_held_type+0xe8/0x140
[  249.492605]  ? find_held_lock+0x32/0x80
[  249.492620]  schedule+0x4e/0xb0
[  249.492627]  schedule_preempt_disabled+0x14/0x20
[  249.492632]  __mutex_lock+0x423/0x890
[  249.492638]  ? __lock_acquire+0x387/0x1ee0
[  249.492646]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.492859]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.493049]  amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.493245]  amdgpu_flush+0x25/0x40 [amdgpu]
[  249.493428]  filp_close+0x31/0x70
[  249.493436]  __close_range+0x1f3/0x490
[  249.493447]  __x64_sys_close_range+0x13/0x20
[  249.493452]  do_syscall_64+0x5b/0x80
[  249.493457]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.493465]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  249.493470] RIP: 0033:0x7f789d4c23cb
[  249.493478] RSP: 002b:00007ffe10773198 EFLAGS: 00000246 ORIG_RAX:
00000000000001b4
[  249.493484] RAX: ffffffffffffffda RBX: 00007ffe107731a0 RCX: 00007f789d4c23cb
[  249.493488] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: 0000000000000027
[  249.493492] RBP: 00007ffe10773220 R08: 0000000000000000 R09: 00007ffe10773270
[  249.493496] R10: 00007ffe107730e0 R11: 0000000000000246 R12: 0000000000000002
[  249.493499] R13: 00007ffe10773230 R14: 0000000000000000 R15: 0000000000000002
[  249.493519]  </TASK>
[  249.493528]
               Showing all locks held in the system:
[  249.493537] 1 lock held by khungtaskd/182:
[  249.493542]  #0: ffffffffba168e20 (rcu_read_lock){....}-{1:2}, at:
debug_show_all_locks+0x15/0x16b
[  249.493565] 1 lock held by systemd-journal/879:
[  249.493575] 3 locks held by gnome-shell/1633:
[  249.493579]  #0: ffff9bd4be4f8c00
(&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0x3c/0x880
[  249.493593]  #1: ffff9bd4be4f8ca8
(&sig->exec_update_lock){++++}-{3:3}, at: begin_new_exec+0x384/0xca0
[  249.493607]  #2: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.493807] 1 lock held by (brt-dbus)/1634:
[  249.493811]  #0: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.494025] 1 lock held by (time-dir)/1640:
[  249.494029]  #0: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.494229] 1 lock held by (ostnamed)/1723:
[  249.494233]  #0: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.494432] 1 lock held by (pcscd)/1748:
[  249.494436]  #0: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]

[  249.494639] =============================================

Here is pastebin from initial post: https://pastebin.com/0YHs6wyB