linux-kernel - Re: [PATCH v5 2/3] drm/sched: Adjust outdated docu for run

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <688b5665-496d-470d-9835-0c6eadfa5569@gmail.com>
Date: Tue, 4 Mar 2025 10:05:05 +0100
From: Christian König <ckoenig.leichtzumerken@...il.com>
To: Matthew Brost <matthew.brost@...el.com>,
 Danilo Krummrich <dakr@...nel.org>
Cc: Maíra Canal <mcanal@...lia.com>, phasta@...nel.org,
 Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
 Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
 David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
 Tvrtko Ursulin <tvrtko.ursulin@...lia.com>, dri-devel@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5 2/3] drm/sched: Adjust outdated docu for run_job()

Am 24.02.25 um 17:25 schrieb Matthew Brost:
> On Mon, Feb 24, 2025 at 03:43:49PM +0100, Danilo Krummrich wrote:
>> On Mon, Feb 24, 2025 at 10:29:26AM -0300, Maíra Canal wrote:
>>> On 20/02/25 12:28, Philipp Stanner wrote:
>>>> On Thu, 2025-02-20 at 10:28 -0300, Maíra Canal wrote:
>>>>> Would it be possible to add a comment that `run_job()` must check if
>>>>> `s_fence->finished.error` is different than 0? If you increase the
>>>>> karma
>>>>> of a job and don't check for `s_fence->finished.error`, you might run
>>>>> a
>>>>> cancelled job.
>>>> s_fence->finished is only signaled and its error set once the hardware
>>>> fence got signaled; or when the entity is killed.
>>> If you have a timeout, increase the karma of that job with
>>> `drm_sched_increase_karma()` and call `drm_sched_resubmit_jobs()`, the
>>> latter will flag an error in the dma fence. If you don't check for it in
>>> `run_job()`, you will run the guilty job again.
>> Considering that drm_sched_resubmit_jobs() is deprecated I don't think we need
>> to add this hint to the documentation; the drivers that are still using the API
>> hopefully got it right.
>>
>>> I'm still talking about `drm_sched_resubmit_jobs()`, because I'm
>>> currently fixing an issue in V3D with the GPU reset and we still use
>>> `drm_sched_resubmit_jobs()`. I read the documentation of `run_job()` and
>>> `timeout_job()` and the information I commented here (which was crucial
>>> to fix the bug) wasn't available there.
>> Well, hopefully... :-)
>>
>>> `drm_sched_resubmit_jobs()` was deprecated in 2022, but Xe introduced a
>>> new use in 2023
>> Yeah, that's a bit odd, since Xe relies on a firmware scheduler and uses a 1:1
>> scheduler - entity setup. I'm a bit surprised Xe does use this function.
>>
> To clarify Xe's usage. We use this function to resubmit jobs after
> device reset for queues which had nothing to do with the device reset.
> In practice, a device should never occur as we have per-queue resets in
> our harwdare. If a per-queue reset occurs, we ban the queue rather than
> doing a resubmit.

That's still invalid usage. Re-submitting jobs by the scheduler is a completely broken concept in general.

What you can do is to re-create the queue content after device reset inside your driver, but *never* use drm_sched_resubmit_jobs() for that.

>
> Matt  
>
>>> for example. The commit that deprecated it just
>>> mentions AMD's case, but do we know if the function works as expected
>>> for the other users?
>> I read the comment [1] you're referring to differently. It says that
>> "Re-submitting jobs was a concept AMD came up as cheap way to implement recovery
>> after a job timeout".
>>
>> It further explains that "there are many problem with the dma_fence
>> implementation and requirements. Either the implementation is risking deadlocks
>> with core memory management or violating documented implementation details of
>> the dma_fence object", which doesn't give any hint to me that the conceptual
>> issues are limited to amdgpu.
>>
>>> For V3D, it does. Also, we need to make it clear which
>>> are the dma fence requirements that the functions violates.
>> This I fully agree with, unfortunately the comment does not explain what's the
>> issue at all.
>>
>> While I do think I have a vague idea of what's the potential issue with this
>> approach, I think it would be way better to get Christian, as the expert for DMA
>> fence rules to comment on this.
>>
>> @Christian: Can you please shed some light on this?
>>
>>> If we shouldn't use `drm_sched_resubmit_jobs()`, would it be possible to
>>> provide a common interface for job resubmission?
>> I wonder why this question did not come up when drm_sched_resubmit_jobs() was
>> deprecated two years ago, did it?

Exactly that's the point why drm_sched_resubmit_jobs() was deprecated.

It is not possible to provide a common interface to re-submit jobs (with switching of hardware dma_fences) without breaking dma_fence rules.

The idea behind the scheduler is that you pack your submission state into a job object which as soon as it is picked up is converted into a hardware dma_fence for execution. This hardware dma_fence is then the object which represents execution of the submission on the hardware.

So on re-submission you either use the same dma_fence multiple times which results in a *horrible* kref_init() on an already initialized reference (It's a wonder that this doesn't crashes all the time in amdgpu). Or you do things like starting to allocate memory while the memory management potentially waits for the reset to complete.

What we could do is to provide a helper for the device drivers in the form of an iterator which gives you all the hardware fences the scheduler is waiting for, but in general device drivers should have this information by themselves.

>>
>> Anyway, let's shed some light on the difficulties with drm_sched_resubmit_jobs()
>> and then we can figure out how we can do better.
>>
>> I think it would also be interesting to know how amdgpu handles job from
>> unrelated entities being discarded by not re-submitting them when a job from
>> another entitiy hangs the HW ring.

Quite simple this case never happens in the first place.

When you have individual queues for each process (e.g. like Xe and upcomming amdgpu HW generation) you should always be able to reset the device without loosing everything.

Otherwise things like userspace queues also doesn't work at all because then neither the kernel nor the DRM scheduler is involved in the submission any more.

Regards,
Christian.

>>
>> [1] https://patchwork.freedesktop.org/patch/msgid/20221109095010.141189-5-christian.koenig@amd.com