linux-kernel - Re: [PATCH RFC 10/18] drm/scheduler: Add can_run

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20502c4c-c987-3117-119a-2fd38ae5f607@amd.com>
Date:   Wed, 5 Apr 2023 16:14:11 +0200
From:   Christian König <christian.koenig@....com>
To:     Asahi Lina <lina@...hilina.net>,
        Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
        Maxime Ripard <mripard@...nel.org>,
        Thomas Zimmermann <tzimmermann@...e.de>,
        David Airlie <airlied@...il.com>,
        Miguel Ojeda <ojeda@...nel.org>,
        Alex Gaynor <alex.gaynor@...il.com>,
        Wedson Almeida Filho <wedsonaf@...il.com>,
        Boqun Feng <boqun.feng@...il.com>, Gary Guo <gary@...yguo.net>,
        Björn Roy Baron <bjorn3_gh@...tonmail.com>,
        Sumit Semwal <sumit.semwal@...aro.org>,
        Luben Tuikov <luben.tuikov@....com>,
        Jarkko Sakkinen <jarkko@...nel.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Alyssa Rosenzweig <alyssa@...enzweig.io>,
        Karol Herbst <kherbst@...hat.com>,
        Ella Stanforth <ella@...unix.org>,
        Faith Ekstrand <faith.ekstrand@...labora.com>,
        Mary <mary@...y.zone>, linux-kernel@...r.kernel.org,
        dri-devel@...ts.freedesktop.org, rust-for-linux@...r.kernel.org,
        linux-media@...r.kernel.org, linaro-mm-sig@...ts.linaro.org,
        linux-sgx@...r.kernel.org, asahi@...ts.linux.dev
Subject: Re: [PATCH RFC 10/18] drm/scheduler: Add can_run_job callback

Am 05.04.23 um 15:40 schrieb Daniel Vetter:
> On Tue, Mar 07, 2023 at 11:25:35PM +0900, Asahi Lina wrote:
>> Some hardware may require more complex resource utilization accounting
>> than the simple job count supported by drm_sched internally. Add a
>> can_run_job callback to allow drivers to implement more logic before
>> deciding whether to run a GPU job.
>>
>> Signed-off-by: Asahi Lina <lina@...hilina.net>
> Ok scheduler rules, or trying to summarize the entire discussion:
>
> dma_fence rules are very tricky. The two main chapters in the docs are
>
> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#dma-fence-cross-driver-contract
> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#indefinite-dma-fences
>
> Unforutunately I don't think it's possible to check this at compile time,
> thus far all we can do is validate at runtime. I've posted two patches for
> this:
>
> https://lore.kernel.org/dri-devel/20201023122216.2373294-17-daniel.vetter@ffwll.ch/
> https://lore.kernel.org/dri-devel/20201023122216.2373294-20-daniel.vetter@ffwll.ch/
>
> Unfortunately most drivers are buggy and get this completely wrong, so
> realistically we'd need to make this a per-driver opt-out and annotate all
> current drivers. Well except amdgpu is correct by now I think (they'd
> still need to test that).

There is still one potential memory allocation in the run_job callback 
in amdgpu which I wasn't able to fix yet.

But that one is purely academic and could potentially be trivially 
replaced with using GFP_ATOMIC if we ever have to.

Christian.

>   And Rob Clark is working on patches to fix up
> msm.
>
> I think best here is if you work together with Rob to make sure these
> annotations are mandatory for any rust drivers (I don't want new buggy
> drivers at least). Would also be great to improve the kerneldoc for all
> the driver hooks to explain these restrictions and link to the relevant
> kerneldocs (there's also one for the dma_fence signalling annotations
> which might be worth linking too).
>
> I don't see any way to make this explicit in rust types, it's really only
> something runtime tests (using lockdep) can catch. Somewhat disappointing.
>
> For the other things discussed here:
>
> - Option<Dma_Fence> as the return value for ->prepare_job makes sense to
>    me.
>
> - I don't see any way a driver can use ->can_run_job without breaking the
>    above rules, that really doesn't sound like a good idea to me.
>
> Cheers, Daniel
>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
>>   include/drm/gpu_scheduler.h            |  8 ++++++++
>>   2 files changed, 18 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 4e6ad6e122bc..5c0add2c7546 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
>>   		if (!entity)
>>   			continue;
>>   
>> +		if (sched->ops->can_run_job) {
>> +			sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
>> +			if (!sched_job) {
>> +				complete_all(&entity->entity_idle);
>> +				continue;
>> +			}
>> +			if (!sched->ops->can_run_job(sched_job))
>> +				continue;
>> +		}
>> +
>>   		sched_job = drm_sched_entity_pop_job(entity);
>>   
>>   		if (!sched_job) {
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index 9db9e5e504ee..bd89ea9507b9 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
>>   	struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
>>   					 struct drm_sched_entity *s_entity);
>>   
>> +	/**
>> +	 * @can_run_job: Called before job execution to check whether the
>> +	 * hardware is free enough to run the job.  This can be used to
>> +	 * implement more complex hardware resource policies than the
>> +	 * hw_submission limit.
>> +	 */
>> +	bool (*can_run_job)(struct drm_sched_job *sched_job);
>> +
>>   	/**
>>            * @run_job: Called to execute the job once all of the dependencies
>>            * have been resolved.  This may be called multiple times, if
>>
>> -- 
>> 2.35.1
>>