[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <782b5c27-8a1c-06a3-9fc2-6c5f9aef4a13@amd.com>
Date: Fri, 25 Aug 2023 16:48:08 +0530
From: "Yadav, Arvind" <arvyadav@....com>
To: "Lazar, Lijo" <lijo.lazar@....com>,
Arvind Yadav <Arvind.Yadav@....com>, Christian.Koenig@....com,
alexander.deucher@....com, shashank.sharma@....com,
Xinhui.Pan@....com, airlied@...il.com, daniel@...ll.ch,
Felix.Kuehling@....com, amd-gfx@...ts.freedesktop.org
Cc: linux-kernel@...r.kernel.org, dri-devel@...ts.freedesktop.org
Subject: Re: [PATCH v2 3/7] drm/amdgpu: Add new function to put GPU power
profile
On 8/22/2023 6:16 PM, Lazar, Lijo wrote:
>
>
> On 8/22/2023 5:41 PM, Yadav, Arvind wrote:
>> Hi Lijo,
>>
>> The *_set function will set the GPU power profile and the *_put
>> function will schedule the
>> smu_delayed_work task after 100ms delay. This smu_delayed_work task
>> will clear a GPU
>> power profile if any new jobs are not scheduled within 100 ms. But if
>> any new job comes within 100ms
>> then the *_workload_profile_set function will cancel this work and
>> set the GPU power profile based on
>> preferences.
>>
>> Please see the below case.
>>
>> case 1 - only same profile jobs run. It will take 100ms to clear the
>> profile once all jobs complete.
>>
>> wl = VIDEO <100ms>
>> workload _________|`````````````````````````````````````|____
>>
>> Jobs (VIDEO) ________|```|__|```|___|````|___________
>>
>>
>> Case2 - two jobs of two different profile. job1 profile will be set
>> but when job2 will arrive it will be moved
>> to higher profile.
>>
>> wl = VIDEO -> wl = COMPUTE
>> <100ms>
>> workload
>> ___|``````````````````````````````````````````````````````````````````|____
>>
>> Jobs (VIDEO) ___|```|__|```|___|````|___|````|_______
>>
>> Jobs (COMPUTE) ______________|```|___|````|___|````|_________
>>
>>
>>
>> Case3 - two jobs of two different profile. job1 profile will be set
>> but when job2 will arrive it will not be moved
>> to lower profile. When compute job2 will complete then only it will
>> move to lower profile.
>>
>> wl = COMPUTE
>> -> wl = VIDEO <100ms>
>> workload
>> _________|``````````````````````````````````````````````````````````````````|____
>>
>>
>> Jobs (COMPUTE) ____|```|__|```|___|````|___|````|_______
>>
>> Jobs (VIDEO)
>> ___________________|```|___|````|___|````|___|````|___________
>>
>
> swsmu layer maintains a workload mask based on priority. So once you
> have set the mask, until you unset it (i.e when refcount = 0), the
> mask will be set in the lower layer. swsmu layer will take care of
> requesting FW the highest priority. I don't think that needs to be
> repeated at this level.
>
> At this layer, all you need is to refcount the requests and make the
> request.
>
> When refcount of a profile becomes non-zero (only one-time), place one
> request for that profile. As swsmu layer maintains the workload mask,
> it will take the new profile also into consideration while requesting
> for the one with the highest priority.
>
> When refcount of a profile becomes zero, place a request to clear it.
> This is controlled by your idle work. As I see, it keeps an additional
> 100ms tolerance before placing a clear request. In that way, there is
> no need to cancel that work.
>
> Inside idle work handler -
> Loop through the profiles that are set and clear those profiles whose
> refcount is zero.
>
> Thus if a job starts during the 100ms delay, idle work won't see the
> ref count as zero and then it won't place a request to clear out that
> profile.
>
Hi Liji,
Thank you for your comment. We would be considering your comment but we
would retain the same design.
~Arvind.
>> On 8/22/2023 10:21 AM, Lazar, Lijo wrote:
>>>
>>>
>>> On 8/21/2023 12:17 PM, Arvind Yadav wrote:
>>>> This patch adds a function which will clear the GPU
>>>> power profile after job finished.
>>>>
>>>> This is how it works:
>>>> - schedular will set the GPU power profile based on ring_type.
>>>> - Schedular will clear the GPU Power profile once job finished.
>>>> - Here, the *_workload_profile_set function will set the GPU
>>>> power profile and the *_workload_profile_put function will
>>>> schedule the smu_delayed_work task after 100ms delay. This
>>>> smu_delayed_work task will clear a GPU power profile if any
>>>> new jobs are not scheduled within 100 ms. But if any new job
>>>> comes within 100ms then the *_workload_profile_set function
>>>> will cancel this work and set the GPU power profile based on
>>>> preferences.
>>>>
>>>> v2:
>>>> - Splitting workload_profile_set and workload_profile_put
>>>> into two separate patches.
>>>> - Addressed review comment.
>>>>
>>>> Cc: Shashank Sharma <shashank.sharma@....com>
>>>> Cc: Christian Koenig <christian.koenig@....com>
>>>> Cc: Alex Deucher <alexander.deucher@....com>
>>>> Signed-off-by: Arvind Yadav <Arvind.Yadav@....com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_workload.c | 97
>>>> +++++++++++++++++++
>>>> drivers/gpu/drm/amd/include/amdgpu_workload.h | 3 +
>>>> 2 files changed, 100 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_workload.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_workload.c
>>>> index e661cc5b3d92..6367eb88a44d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_workload.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_workload.c
>>>> @@ -24,6 +24,9 @@
>>>> #include "amdgpu.h"
>>>> +/* 100 millsecond timeout */
>>>> +#define SMU_IDLE_TIMEOUT msecs_to_jiffies(100)
>>>> +
>>>> static enum PP_SMC_POWER_PROFILE
>>>> ring_to_power_profile(uint32_t ring_type)
>>>> {
>>>> @@ -59,6 +62,80 @@ amdgpu_power_profile_set(struct amdgpu_device
>>>> *adev,
>>>> return ret;
>>>> }
>>>> +static int
>>>> +amdgpu_power_profile_clear(struct amdgpu_device *adev,
>>>> + enum PP_SMC_POWER_PROFILE profile)
>>>> +{
>>>> + int ret = amdgpu_dpm_switch_power_profile(adev, profile, false);
>>>> +
>>>> + if (!ret) {
>>>> + /* Clear the bit for the submitted workload profile */
>>>> + adev->smu_workload.submit_workload_status &= ~(1 << profile);
>>>> + }
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> +static void
>>>> +amdgpu_power_profile_idle_work_handler(struct work_struct *work)
>>>> +{
>>>> +
>>>> + struct amdgpu_smu_workload *workload = container_of(work,
>>>> + struct amdgpu_smu_workload,
>>>> + smu_delayed_work.work);
>>>> + struct amdgpu_device *adev = workload->adev;
>>>> + bool reschedule = false;
>>>> + int index = fls(workload->submit_workload_status);
>>>> + int ret;
>>>> +
>>>> + mutex_lock(&workload->workload_lock);
>>>> + for (; index > 0; index--) {
>>>
>>> Why not use for_each_set_bit?
>>
>> We are clearing which we have only set it. We will clear first higher
>> profile then lower.
>>
>
> You don't need to do take care of this. swsmu layer will take care of
> the priority. It is not the job of this layer to take care of
> priority. swsmu is the layer that could be altered specific to each
> SOC, and that can take care of any priority changes accordingly. This
> layer only needs to ref count the requests and place accordingly.
>
>>
>>>
>>>> + int val = atomic_read(&workload->power_profile_ref[index]);
>>>> +
>>>> + if (val) {
>>>> + reschedule = true;
>>>
>>> Why do you need to do reschedule? For each put(), a schedule is
>>> called. If refcount is not zero, that means some other job has
>>> already set the profile. It is supposed to call put() and at that
>>> time, this job will be run to clear it anyway, right?
>>>
>> Yes, I have got the comment for this I am going to remove this.
>> Noted.
>>
>>>> + } else {
>>>> + if (workload->submit_workload_status &
>>>> + (1 << index)) {
>>>> + ret = amdgpu_power_profile_clear(adev, index);
>>>> + if (ret) {
>>>> + DRM_WARN("Failed to clear workload %s,error =
>>>> %d\n",
>>>> + amdgpu_workload_mode_name[index], ret);
>>>> + goto exit;
>>>> + }
>>>> + }
>>>> + }
>>>> + }
>>>> + if (reschedule)
>>>> + schedule_delayed_work(&workload->smu_delayed_work,
>>>> + SMU_IDLE_TIMEOUT);
>>>> +exit:
>>>> + mutex_unlock(&workload->workload_lock);
>>>> +}
>>>> +
>>>> +void amdgpu_workload_profile_put(struct amdgpu_device *adev,
>>>> + uint32_t ring_type)
>>>> +{
>>>> + struct amdgpu_smu_workload *workload = &adev->smu_workload;
>>>> + enum PP_SMC_POWER_PROFILE profile =
>>>> ring_to_power_profile(ring_type);
>>>> +
>>>> + if (profile == PP_SMC_POWER_PROFILE_BOOTUP_DEFAULT)
>>>> + return;
>>>> +
>>>> + mutex_lock(&workload->workload_lock);
>>>> +
>>>> + if (!atomic_read(&workload->power_profile_ref[profile])) {
>>>> + DRM_WARN("Power profile %s ref. count error\n",
>>>> + amdgpu_workload_mode_name[profile]);
>>>> + } else {
>>>> + atomic_dec(&workload->power_profile_ref[profile]);
>>>> + schedule_delayed_work(&workload->smu_delayed_work,
>>>> + SMU_IDLE_TIMEOUT);
>>>> + }
>>>> +
>>>> + mutex_unlock(&workload->workload_lock);
>>>> +}
>>>> +
>>>> void amdgpu_workload_profile_set(struct amdgpu_device *adev,
>>>> uint32_t ring_type)
>>>> {
>>>> @@ -70,13 +147,30 @@ void amdgpu_workload_profile_set(struct
>>>> amdgpu_device *adev,
>>>> return;
>>>> mutex_lock(&workload->workload_lock);
>>>> + cancel_delayed_work_sync(&workload->smu_delayed_work);
>>>
>>> This is a potential deadlock. You already hold the mutex and then
>>> waiting for idle work to finish. Idle work could now be at the point
>>> where it is waiting for the same mutex. Suggest not to call cancel
>>> here and let the mutex take care of the sequence.
>> We cannot cancel if idle work is running. So we have to wait until
>> ideal work is complete. If *put function arrived before ideal work is
>> not stated then we can cancel it. but if it is running work thread we
>> should wait.
>
> No need to wait, because you already have a mutex. So you will be
> waiting naturally for the mutex lock to be released (if at all idle
> work already grabbed it). If a request comes in at the time while idle
> work is running it is only a timing issue.
>
> Also you have a deadlock here. Here you acquired the mutex first and
> then waiting for the idle work to finish. The idle work function would
> have just started at that point and reached to the place where it is
> going to grab mutex. That is a deadlock. This function is waiting for
> idle work to finish and idle work is waiting to get the mutex.
>
> Nevertheless, this function doesn't even need to take care of such
> fancy things. It only grabs the mutex and increases the refcount,
> places a request if refcount became non-zero.
>
> At whatever point, idle work runs, it will see that the refcount is
> not zero and skips placing a request to clear that profile.
>
>>>
>>>> ret = amdgpu_power_profile_set(adev, profile);
>>>> if (ret) {
>>>> DRM_WARN("Failed to set workload profile to %s, error =
>>>> %d\n",
>>>> amdgpu_workload_mode_name[profile], ret);
>>>> + goto exit;
>>>> + }
>>>> +
>>>> + /* Clear the already finished jobs of higher power profile*/
>>>> + for (int index = fls(workload->submit_workload_status);
>>>> + index > profile; index--) {
>>>> + if (!atomic_read(&workload->power_profile_ref[index]) &&
>>>> + workload->submit_workload_status & (1 << index)) {
>>>> + ret = amdgpu_power_profile_clear(adev, index);
>>>> + if (ret) {
>>>> + DRM_WARN("Failed to clear workload %s, err = %d\n",
>>>> + amdgpu_workload_mode_name[profile], ret);
>>>> + goto exit;
>>>> + }
>>>> + }
>>>
>>> If you follow the earlier comment, that will keep this logic only at
>>> one place - i.e, at idle work handler. Basically just let the idle
>>> work handle its duty. If some job starts running during the clear
>>> call, it's just unfortunate timing and let the next set() take the
>>> lock and request profile again.
>>
>> So basically for every millisecond new jobs are coming and
>> completing it to the same or different profile . Suppose we are
>> running higher profile jobs and before it completes if a lower job
>> arrives, this check will help to move the higher profile to lower
>> profile once higher profile finishes it. If we are not checking here
>> then it will stuck on higher profile until then other jobs will also
>> not complete. Please refer case3 scenario.
>>
>
> As mentioned before, this is not the place to take care of SOC
> specific power profile priorities. We already have swsmu layer doing
> that job. This layer just needs to do a ref count and place requests
> accordingly.
>
> Thanks,
> Lijo
>
>>
>>> Thanks,
>>> Lijo
>>>
>>>> }
>>>> +exit:
>>>> mutex_unlock(&workload->workload_lock);
>>>> }
>>>> @@ -87,6 +181,8 @@ void amdgpu_workload_profile_init(struct
>>>> amdgpu_device *adev)
>>>> adev->smu_workload.initialized = true;
>>>> mutex_init(&adev->smu_workload.workload_lock);
>>>> + INIT_DELAYED_WORK(&adev->smu_workload.smu_delayed_work,
>>>> + amdgpu_power_profile_idle_work_handler);
>>>> }
>>>> void amdgpu_workload_profile_fini(struct amdgpu_device *adev)
>>>> @@ -94,6 +190,7 @@ void amdgpu_workload_profile_fini(struct
>>>> amdgpu_device *adev)
>>>> if (!adev->smu_workload.initialized)
>>>> return;
>>>> + cancel_delayed_work_sync(&adev->smu_workload.smu_delayed_work);
>>>> adev->smu_workload.submit_workload_status = 0;
>>>> adev->smu_workload.initialized = false;
>>>> mutex_destroy(&adev->smu_workload.workload_lock);
>>>> diff --git a/drivers/gpu/drm/amd/include/amdgpu_workload.h
>>>> b/drivers/gpu/drm/amd/include/amdgpu_workload.h
>>>> index 5022f28fc2f9..ee1f87257f2d 100644
>>>> --- a/drivers/gpu/drm/amd/include/amdgpu_workload.h
>>>> +++ b/drivers/gpu/drm/amd/include/amdgpu_workload.h
>>>> @@ -46,6 +46,9 @@ static const char * const
>>>> amdgpu_workload_mode_name[] = {
>>>> "Window3D"
>>>> };
>>>> +void amdgpu_workload_profile_put(struct amdgpu_device *adev,
>>>> + uint32_t ring_type);
>>>> +
>>>> void amdgpu_workload_profile_set(struct amdgpu_device *adev,
>>>> uint32_t ring_type);
Powered by blists - more mailing lists