[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9c3dc2ad-11e4-6004-7230-8ca752e3d9f7@asahilina.net>
Date: Thu, 9 Mar 2023 04:45:58 +0900
From: Asahi Lina <lina@...hilina.net>
To: Christian König <christian.koenig@....com>,
Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Maxime Ripard <mripard@...nel.org>,
Thomas Zimmermann <tzimmermann@...e.de>,
David Airlie <airlied@...il.com>,
Daniel Vetter <daniel@...ll.ch>,
Miguel Ojeda <ojeda@...nel.org>,
Alex Gaynor <alex.gaynor@...il.com>,
Wedson Almeida Filho <wedsonaf@...il.com>,
Boqun Feng <boqun.feng@...il.com>, Gary Guo <gary@...yguo.net>,
Björn Roy Baron <bjorn3_gh@...tonmail.com>,
Sumit Semwal <sumit.semwal@...aro.org>,
Luben Tuikov <luben.tuikov@....com>,
Jarkko Sakkinen <jarkko@...nel.org>,
Dave Hansen <dave.hansen@...ux.intel.com>
Cc: Alyssa Rosenzweig <alyssa@...enzweig.io>,
Karol Herbst <kherbst@...hat.com>,
Ella Stanforth <ella@...unix.org>,
Faith Ekstrand <faith.ekstrand@...labora.com>,
Mary <mary@...y.zone>, linux-kernel@...r.kernel.org,
dri-devel@...ts.freedesktop.org, rust-for-linux@...r.kernel.org,
linux-media@...r.kernel.org, linaro-mm-sig@...ts.linaro.org,
linux-sgx@...r.kernel.org, asahi@...ts.linux.dev
Subject: Re: [PATCH RFC 10/18] drm/scheduler: Add can_run_job callback
On 09/03/2023 04.12, Christian König wrote:
> Am 08.03.23 um 20:05 schrieb Asahi Lina:
>> [SNIP]
>>> Well it's not the better way, it's the only way that works.
>>>
>>> I have to admit that my bet on your intentions was wrong, but even that
>>> use case doesn't work correctly.
>>>
>>> See when your callback returns false it is perfectly possible that all
>>> hw fences are signaled between returning that information and processing it.
>>>
>>> The result would be that the scheduler goes to sleep and never wakes up
>>> again.
>> That can't happen, because it will just go into another iteration of the
>> drm_sched main loop since there is an entity available still.
>>
>> Rather there is probably the opposite bug in this patch: the can_run_job
>> logic should be moved into the wait_event_interruptible() condition
>> check, otherwise I think it can end up busy-looping since the condition
>> itself can be true even when the can_run_job check blocks it.
>>
>> But there is no risk of it going to sleep and never waking up because
>> job completions will wake up the waitqueue by definition, and that
>> happens after the driver-side queues are popped. If this problem could
>> happen, then the existing hw_submission_limit logic would be broken in
>> the same way. It is logically equivalent in how it works.
>>
>> Basically, if properly done in wait_event_interruptible, it is exactly
>> the logic of that macro that prevents this race condition and makes
>> everything work at all. Without it, drm_sched would be completely broken.
>>
>>> As I said we exercised those ideas before and yes this approach here
>>> came up before as well and no it doesn't work.
>> It can never deadlock with this patch as it stands (though it could busy
>> loop), and if properly moved into the wait_event_interruptible(), it
>> would also never busy loop and work entirely as intended. The actual API
>> change is sound.
>>
>> I don't know why you're trying so hard to convince everyone that this
>> approach is fundamentally broken... It might be a bad idea for other
>> reasons, it might encourage incorrect usage, it might not be the best
>> option, there are plenty of arguments you can make... but you just keep
>> trying to make an argument that it just can't work at all for some
>> reason. Why? I already said I'm happy dropping it in favor of the fences...
>
> Well because it is broken.
>
> When you move the check into the wait_event_interruptible condition then
> who is going to call wait_event_interruptible when the condition changes?
I think you mean wake_up_interruptible(). That would be
drm_sched_job_done(), on the fence callback when a job completes, which
as I keep saying is the same logic used for
hw_rq_count/hw_submission_limit tracking.
Please think about it for a second, it's really not that complicated to
see why it works:
- Driver pops off completed commands <-- can_run_job condition satisfied
- Driver signals fence
- drm_sched_job_done_cb()
- drm_sched_job_done()
- atomic_dec(&sched->hw_rq_count); <-- hw_submission_limit satisfied
- ...
- wake_up_interruptible(&sched->wake_up_worker);
^- happens after both conditions are potentially satisfied
It really is completely equivalent to just making the hw_rq_count logic
customizable by the driver. The actual flow is the same. As long as the
driver guarantees it satisfies the can_run_job() condition before
signaling the completion fence that triggered that change, it works fine.
> As I said this idea came up before and was rejected multiple times.
Maybe it was a different idea, or maybe it was rejected for other
reasons, or maybe it was wrongly rejected for being broken when it isn't ^^
~~ Lina
Powered by blists - more mailing lists