[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b1a8f459-93dd-4b6c-b29e-c68fba19f6fc@damsy.net>
Date: Mon, 1 Sep 2025 15:14:05 +0200
From: Pierre-Eric Pelloux-Prayer <pierre-eric@...sy.net>
To: phasta@...nel.org,
Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@....com>,
Matthew Brost <matthew.brost@...el.com>, Danilo Krummrich <dakr@...nel.org>,
Christian König <ckoenig.leichtzumerken@...il.com>,
Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>
Cc: dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1 2/2] drm/sched: limit sched score update to jobs change
Le 25/08/2025 à 15:13, Philipp Stanner a écrit :
> On Fri, 2025-08-22 at 15:43 +0200, Pierre-Eric Pelloux-Prayer wrote:
>> Currently, the scheduler score is incremented when a job is pushed to an
>> entity and when an entity is attached to the scheduler.
>
> It's indeed awkward why attaching is treated equivalently to job
> submission.
>
> Can you expand the documentation for drm_sched_init_args a bit so that
> it gets clearer what the score is supposed to do?
drm_sched_init_args.score is the feature allowing multiple schedulers to share a
score so I suppose you meant drm_gpu_scheduler.score?
The doc currently says "score to help loadbalancer pick a idle sched" which is a
bit vague. It could be modified to become:
@score: holds the number of yet-to-be-completed jobs pushed to each scheduler.
It's used when load balancing between different schedulers.
What do you think?
>
>>
>> This leads to some bad scheduling decision where the score value is
>> largely made of idle entities.
>>
>> For instance, a scenario with 2 schedulers and where 10 entities submit
>> a single job, then do nothing, each scheduler will probably end up with
>> a score of 5.
>> Now, 5 userspace apps exit, so their entities will be dropped.
>>
>
> "entities will be dropped" == "drm_sched_entity_kill() gets called",
> right?
Yes.
>> In
>> the worst case, these apps' entities where all attached to the same
>
> s/where/were
>
> or better yet: "could be"
Will fix, thanks.
>
>> scheduler and we end up with score=5 (the 5 remaining entities) and
>> score=0, despite the 2 schedulers being idle.
>
> Sounds indeed like a (small) problem to me.
>
>
>> When new entities show up, they will all select the second scheduler
>> based on its low score value, instead of alternating between the 2.
>>
>> Some amdgpu rings depended on this feature, but the previous commit
>> implemented the same thing in amdgpu directly so it can be safely
>> removed from drm/sched.
>
> Can we be that sure that other drivers don't depend on it, though? I
> suspect it's likely that it's just amdgpu, but…
>
Aside from the new "rocket" as pointed out by Tvrtko, amdgpu is the only driver
passing more than one schedulers to entities so they're the only ones that could
be affected.
I verified amdgpu and Tvrtko pinged the rocket maintainers in the other thread.
>
>
> BTW, since you're cleaning up related stuff currently: I saw that it
> seems that the only driver that sets &struct drm_sched_init_args.score
> is amdgpu. Would be cool if you can take a look whether that's still
> needed.
It cannot really be removed yet as it's useful when a single hardware block is
exposed through different schedulers (so pushing jobs to one of the schedulers
should increase the load of the underlying hw).
Thanks,
Pierre-Eric
>
>
> P.
>
>>
>> Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@....com>
>> ---
>> drivers/gpu/drm/scheduler/sched_main.c | 2 --
>> 1 file changed, 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 5a550fd76bf0..e6d232a8ec58 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -206,7 +206,6 @@ void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
>> if (!list_empty(&entity->list))
>> return;
>>
>> - atomic_inc(rq->sched->score);
>> list_add_tail(&entity->list, &rq->entities);
>> }
>>
>> @@ -228,7 +227,6 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>
>> spin_lock(&rq->lock);
>>
>> - atomic_dec(rq->sched->score);
>> list_del_init(&entity->list);
>>
>> if (rq->current_entity == entity)
Powered by blists - more mailing lists