[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHzEqDkFAiGkTFF3C--2NKt+ALjtfiNpWYca-Y-p=sekjQXGpw@mail.gmail.com>
Date: Mon, 14 Nov 2022 13:16:05 +1100
From: Mani Milani <mani@...omium.org>
To: Thomas Hellström
<thomas.hellstrom@...ux.intel.com>
Cc: Matthew Auld <matthew.auld@...el.com>,
LKML <linux-kernel@...r.kernel.org>,
Tvrtko Ursulin <tvrtko.ursulin@...ux.intel.com>,
Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Chris Wilson <chris@...is-wilson.co.uk>,
Christian König <christian.koenig@....com>,
Daniel Vetter <daniel@...ll.ch>,
David Airlie <airlied@...il.com>,
Jani Nikula <jani.nikula@...ux.intel.com>,
Joonas Lahtinen <joonas.lahtinen@...ux.intel.com>,
Niranjana Vishwanathapura <niranjana.vishwanathapura@...el.com>,
Nirmoy Das <nirmoy.das@...el.com>,
Rodrigo Vivi <rodrigo.vivi@...el.com>,
dri-devel@...ts.freedesktop.org, intel-gfx@...ts.freedesktop.org
Subject: Re: [PATCH] drm/i915: Fix unhandled deadlock in grab_vma()
Thank you for your comments.
To Thomas's point, the crash always seems to happen when the following
sequence of events occurs:
1. When inside "i915_gem_evict_vm()", the call to
"i915_gem_object_trylock(vma->obj, ww)" fails (due to deadlock), and
eviction of a vma is skipped as a result. Basically if the code
reaches here:
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/i915/i915_gem_evict.c#L468
And here is the stack dump for this scenario:
Call Trace:
<TASK>
dump_stack_lvl+0x68/0x95
i915_gem_evict_vm+0x1d2/0x369
eb_validate_vmas+0x54a/0x6ae
eb_relocate_parse+0x4b/0xdb
i915_gem_execbuffer2_ioctl+0x6f5/0xab6
? i915_gem_object_prepare_write+0xfb/0xfb
drm_ioctl_kernel+0xda/0x14d
drm_ioctl+0x27f/0x3b7
? i915_gem_object_prepare_write+0xfb/0xfb
__se_sys_ioctl+0x7a/0xbc
do_syscall_64+0x56/0xa1
? exit_to_user_mode_prepare+0x3d/0x8c
entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x78302de5fae7
Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
</TASK>
It is worth noting that "i915_gem_evict_vm()" still returns success in
this case.
2. After step 1 occurs, the next call to "grab_vma()" always fails
(with "i915_gem_object_trylock(vma->obj, ww)" failing also due to
deadlock), which then results in the crash.
Here is the stack dump for this scenario:
Call Trace:
<TASK>
dump_stack_lvl+0x68/0x95
grab_vma+0x6c/0xd0
i915_gem_evict_for_node+0x178/0x23b
i915_gem_gtt_reserve+0x5a/0x82
i915_vma_insert+0x295/0x29e
i915_vma_pin_ww+0x41e/0x5c7
eb_validate_vmas+0x5f5/0x6ae
eb_relocate_parse+0x4b/0xdb
i915_gem_execbuffer2_ioctl+0x6f5/0xab6
? i915_gem_object_prepare_write+0xfb/0xfb
drm_ioctl_kernel+0xda/0x14d
drm_ioctl+0x27f/0x3b7
? i915_gem_object_prepare_write+0xfb/0xfb
__se_sys_ioctl+0x7a/0xbc
do_syscall_64+0x56/0xa1
? exit_to_user_mode_prepare+0x3d/0x8c
entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x78302de5fae7
Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c
89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48>
3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7
RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d
RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950
R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d
R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469
</TASK>
My Notes:
- I verified the two "i915_gem_object_trylock()" failures I mentioned
above are due to deadlock by slightly modifying the code to call
"i915_gem_object_lock()" only in those exact cases and subsequent to
the trylock failure, only to look at the return error code.
- The two cases mentioned above, are the only cases where
"i915_gem_object_trylock(obj, ww)" is called with the second argument
not being forced to NULL.
- When in either of the two cases above (i.e. inside "grab_vma()" or
"i915_gem_evict_vm") I replace calling "i915_gem_object_trylock" with
"i915_gem_object_lock", the issue gets resolved (because deadlock is
detected and resolved).
So if this could matches the design better, another solution could be
for "grab_vma" to continue to call "i915_gem_object_trylock", but for
"i915_gem_evict_vm" to call "i915_gem_object_lock" instead.
Further info:
- Would you like any further info on the crash? If so, could you
please advise 1) what exactly you need and 2) how I can share with you
especially if it is big dumps?
Thanks.
Powered by blists - more mailing lists