[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e62c6c4a-b8d2-4f7e-09c2-b5151fe12046@amd.com>
Date: Wed, 3 Feb 2021 15:37:32 +0100
From: Christian König <christian.koenig@....com>
To: Daniel Gomez <daniel@...c.com>, amd-gfx@...ts.freedesktop.org,
dri-devel@...ts.freedesktop.org
Cc: linux-kernel@...r.kernel.org, alexander.deucher@....com
Subject: Re: [amdgpu] deadlock
Hi Daniel,
I've talked a bit with our internal team.
The problem is that the 20.20 release still uses the older OpenCL stack
which obviously has a bug here and causes a hang.
The best approach I can give you is to switch to the ROCm stack instead.
Regards,
Christian.
Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> Hi all,
>
> I have a deadlock with the amdgpu mainline driver when running in parallel two
> OpenCL applications. So far, we've been able to replicate it easily by executing
> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> opencl-samples so, if you have any other suggestion for testing I'd be very
> happy to test it as well.
>
> How to replicate the issue:
>
> # while true; do /usr/bin/MatrixMultiplication --device gpu \
> --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> # while true; do clinfo; done
>
> Output:
>
> After a minute or less (sometimes could be more) I can see that
> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> goes up from ~35% to ~96%.
>
> clinfo keeps printing:
> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
>
> And MatrixMultiplication prints the following (strace) if you try to
> kill the process:
>
> sched_yield() = 0
> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
> <detached ...>
>
> After this, the gpu is not functional at all and you'd need a power cycle reset
> to restore the system.
>
> Hardware info:
> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
>
> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> (rev 83)
> DeviceName: Broadcom 5762
> Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> [Radeon Vega Series / Radeon Vega Mobile Series]
> Kernel driver in use: amdgpu
> Kernel modules: amdgpu
>
> Linux kernel info:
>
> root@...222:~# uname -a
> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> 2021 x86_64 x86_64 x86_64 GNU/Linux
>
> By enabling the kernel locks stats I could see the MatrixMultiplication is
> hanged in the amdgpu_mn_invalidate_gfx function:
>
> [ 738.359202] 1 lock held by MatrixMultiplic/653:
> [ 738.359206] #0: ffff88810e364fe0
> (&adev->notifier_lock){+.+.}-{3:3}, at:
> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
>
> I can see in the the amdgpu_mn_invalidate_gfx function: the
> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> guess the code gets stuck there waiting forever. According to the
> documentation: "When somebody tries to invalidate the page tables we block the
> update until all operations on the pages in question are completed, then those
> pages are marked as accessed and also dirty if it wasn’t a read only access."
> Looks like the fences are deadlocked and therefore, it never returns. Could it
> be possible? any hint to where can I look to fix this?
>
> Thank you in advance.
>
> Here the full dmesg output:
>
> [ 738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> [ 738.344937] Not tainted 5.11.0-rc6-qtec-standard #2
> [ 738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 738.358240] task:MatrixMultiplic state:D stack: 0 pid: 653
> ppid: 1 flags:0x00004000
> [ 738.358254] Call Trace:
> [ 738.358261] ? dma_fence_default_wait+0x1eb/0x230
> [ 738.358276] __schedule+0x370/0x960
> [ 738.358291] ? dma_fence_default_wait+0x117/0x230
> [ 738.358297] ? dma_fence_default_wait+0x1eb/0x230
> [ 738.358305] schedule+0x51/0xc0
> [ 738.358312] schedule_timeout+0x275/0x380
> [ 738.358324] ? dma_fence_default_wait+0x1eb/0x230
> [ 738.358332] ? mark_held_locks+0x4f/0x70
> [ 738.358341] ? dma_fence_default_wait+0x117/0x230
> [ 738.358347] ? lockdep_hardirqs_on_prepare+0xd4/0x180
> [ 738.358353] ? _raw_spin_unlock_irqrestore+0x39/0x40
> [ 738.358362] ? dma_fence_default_wait+0x117/0x230
> [ 738.358370] ? dma_fence_default_wait+0x1eb/0x230
> [ 738.358375] dma_fence_default_wait+0x214/0x230
> [ 738.358384] ? dma_fence_release+0x1a0/0x1a0
> [ 738.358396] dma_fence_wait_timeout+0x105/0x200
> [ 738.358405] dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> [ 738.358421] amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> [ 738.358688] __mmu_notifier_release+0x1bb/0x210
> [ 738.358710] exit_mmap+0x2f/0x1e0
> [ 738.358723] ? find_held_lock+0x34/0xa0
> [ 738.358746] mmput+0x39/0xe0
> [ 738.358756] do_exit+0x5c3/0xc00
> [ 738.358763] ? find_held_lock+0x34/0xa0
> [ 738.358780] do_group_exit+0x47/0xb0
> [ 738.358791] get_signal+0x15b/0xc50
> [ 738.358807] arch_do_signal_or_restart+0xaf/0x710
> [ 738.358816] ? lockdep_hardirqs_on_prepare+0xd4/0x180
> [ 738.358822] ? _raw_spin_unlock_irqrestore+0x39/0x40
> [ 738.358831] ? ktime_get_mono_fast_ns+0x50/0xa0
> [ 738.358844] ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> [ 738.359044] exit_to_user_mode_prepare+0xf2/0x1b0
> [ 738.359054] syscall_exit_to_user_mode+0x19/0x60
> [ 738.359062] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 738.359069] RIP: 0033:0x7f6b89a51887
> [ 738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [ 738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> [ 738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> [ 738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> [ 738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> [ 738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> [ 738.359129]
> Showing all locks held in the system:
> [ 738.359141] 1 lock held by khungtaskd/54:
> [ 738.359148] #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> debug_show_all_locks+0x15/0x183
> [ 738.359187] 1 lock held by systemd-journal/174:
> [ 738.359202] 1 lock held by MatrixMultiplic/653:
> [ 738.359206] #0: ffff88810e364fe0
> (&adev->notifier_lock){+.+.}-{3:3}, at:
> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
>
> Daniel
Powered by blists - more mailing lists