linux-kernel - Re: [PATCH v2 3/7] drm/amdgpu: Add new function to put GPU power profile

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <782b5c27-8a1c-06a3-9fc2-6c5f9aef4a13@amd.com>
Date:   Fri, 25 Aug 2023 16:48:08 +0530
From:   "Yadav, Arvind" <arvyadav@....com>
To:     "Lazar, Lijo" <lijo.lazar@....com>,
        Arvind Yadav <Arvind.Yadav@....com>, Christian.Koenig@....com,
        alexander.deucher@....com, shashank.sharma@....com,
        Xinhui.Pan@....com, airlied@...il.com, daniel@...ll.ch,
        Felix.Kuehling@....com, amd-gfx@...ts.freedesktop.org
Cc:     linux-kernel@...r.kernel.org, dri-devel@...ts.freedesktop.org
Subject: Re: [PATCH v2 3/7] drm/amdgpu: Add new function to put GPU power
 profile


On 8/22/2023 6:16 PM, Lazar, Lijo wrote:
>
>
> On 8/22/2023 5:41 PM, Yadav, Arvind wrote:
>> Hi Lijo,
>>
>> The *_set function will set the GPU power profile and the *_put 
>> function will  schedule the
>> smu_delayed_work task after 100ms delay. This smu_delayed_work task 
>> will clear a GPU
>> power profile if any new jobs are not scheduled within 100 ms. But if 
>> any new job  comes within 100ms
>> then the *_workload_profile_set function  will cancel this work and 
>> set the GPU power profile based on
>> preferences.
>>
>> Please see the below case.
>>
>> case 1 - only same profile jobs run. It will take 100ms to clear the 
>> profile once all jobs complete.
>>
>>                                             wl = VIDEO <100ms>
>> workload _________|`````````````````````````````````````|____
>>
>> Jobs (VIDEO) ________|```|__|```|___|````|___________
>>
>>
>> Case2 - two jobs of two different profile. job1 profile will be set 
>> but when job2 will arrive it will be moved
>>          to higher profile.
>>
>>                                   wl = VIDEO  ->    wl = COMPUTE   
>> <100ms>
>> workload 
>> ___|``````````````````````````````````````````````````````````````````|____
>>
>> Jobs (VIDEO) ___|```|__|```|___|````|___|````|_______
>>
>> Jobs (COMPUTE) ______________|```|___|````|___|````|_________
>>
>>
>>
>> Case3 - two jobs of two different profile. job1 profile will be set 
>> but when job2 will arrive it will not be moved
>> to lower profile. When compute job2 will complete then only it will 
>> move to lower profile.
>>
>>                                               wl = COMPUTE 
>> ->               wl = VIDEO  <100ms>
>> workload 
>> _________|``````````````````````````````````````````````````````````````````|____ 
>>
>>
>> Jobs (COMPUTE)    ____|```|__|```|___|````|___|````|_______
>>
>> Jobs (VIDEO) 
>> ___________________|```|___|````|___|````|___|````|___________
>>
>
> swsmu layer maintains a workload mask based on priority. So once you 
> have set the mask, until you unset it (i.e when refcount = 0), the 
> mask will be set in the lower layer. swsmu layer will take care of 
> requesting FW the highest priority. I don't think that needs to be 
> repeated at this level.
>
> At this layer, all you need is to refcount the requests and make the 
> request.
>
> When refcount of a profile becomes non-zero (only one-time), place one 
> request for that profile. As swsmu layer maintains the workload mask, 
> it will take the new profile also into consideration while requesting 
> for the one  with the highest priority.
>
> When refcount of a profile becomes zero, place a request to clear it. 
> This is controlled by your idle work. As I see, it keeps an additional 
> 100ms tolerance before placing a clear request. In that way, there is 
> no need to cancel that work.
>
> Inside idle work handler -
> Loop through the profiles that are set and clear those profiles whose 
> refcount is zero.
>
> Thus if a job starts during the 100ms delay, idle work won't see the 
> ref count as zero and then it won't place a request to clear out that 
> profile.
>
Hi Liji,

Thank you for your comment. We would be considering your comment but we 
would retain the same design.

~Arvind.

>> On 8/22/2023 10:21 AM, Lazar, Lijo wrote:
>>>
>>>
>>> On 8/21/2023 12:17 PM, Arvind Yadav wrote:
>>>> This patch adds a function which will clear the GPU
>>>> power profile after job finished.
>>>>
>>>> This is how it works:
>>>> - schedular will set the GPU power profile based on ring_type.
>>>> - Schedular will clear the GPU Power profile once job finished.
>>>> - Here, the *_workload_profile_set function will set the GPU
>>>>    power profile and the *_workload_profile_put function will
>>>>    schedule the smu_delayed_work task after 100ms delay. This
>>>>    smu_delayed_work task will clear a GPU power profile if any
>>>>    new jobs are not scheduled within 100 ms. But if any new job
>>>>    comes within 100ms then the *_workload_profile_set function
>>>>    will cancel this work and set the GPU power profile based on
>>>>    preferences.
>>>>
>>>> v2:
>>>> - Splitting workload_profile_set and workload_profile_put
>>>>    into two separate patches.
>>>> - Addressed review comment.
>>>>
>>>> Cc: Shashank Sharma <shashank.sharma@....com>
>>>> Cc: Christian Koenig <christian.koenig@....com>
>>>> Cc: Alex Deucher <alexander.deucher@....com>
>>>> Signed-off-by: Arvind Yadav <Arvind.Yadav@....com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_workload.c  | 97 
>>>> +++++++++++++++++++
>>>>   drivers/gpu/drm/amd/include/amdgpu_workload.h |  3 +
>>>>   2 files changed, 100 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_workload.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_workload.c
>>>> index e661cc5b3d92..6367eb88a44d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_workload.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_workload.c
>>>> @@ -24,6 +24,9 @@
>>>>     #include "amdgpu.h"
>>>>   +/* 100 millsecond timeout */
>>>> +#define SMU_IDLE_TIMEOUT    msecs_to_jiffies(100)
>>>> +
>>>>   static enum PP_SMC_POWER_PROFILE
>>>>   ring_to_power_profile(uint32_t ring_type)
>>>>   {
>>>> @@ -59,6 +62,80 @@ amdgpu_power_profile_set(struct amdgpu_device 
>>>> *adev,
>>>>       return ret;
>>>>   }
>>>>   +static int
>>>> +amdgpu_power_profile_clear(struct amdgpu_device *adev,
>>>> +               enum PP_SMC_POWER_PROFILE profile)
>>>> +{
>>>> +    int ret = amdgpu_dpm_switch_power_profile(adev, profile, false);
>>>> +
>>>> +    if (!ret) {
>>>> +        /* Clear the bit for the submitted workload profile */
>>>> +        adev->smu_workload.submit_workload_status &= ~(1 << profile);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static void
>>>> +amdgpu_power_profile_idle_work_handler(struct work_struct *work)
>>>> +{
>>>> +
>>>> +    struct amdgpu_smu_workload *workload = container_of(work,
>>>> +                              struct amdgpu_smu_workload,
>>>> +                              smu_delayed_work.work);
>>>> +    struct amdgpu_device *adev = workload->adev;
>>>> +    bool reschedule = false;
>>>> +    int index  = fls(workload->submit_workload_status);
>>>> +    int ret;
>>>> +
>>>> +    mutex_lock(&workload->workload_lock);
>>>> +    for (; index > 0; index--) {
>>>
>>> Why not use for_each_set_bit?
>>
>> We are clearing which we have only set it. We will clear first higher 
>> profile then lower.
>>
>
> You don't need to do take care of this. swsmu layer will take care of 
> the priority. It is not the job of this layer to take care of 
> priority. swsmu is the layer that could be altered specific to each 
> SOC, and that can take care of any priority changes accordingly. This 
> layer only needs to ref count the requests and place accordingly.
>
>>
>>>
>>>> +        int val = atomic_read(&workload->power_profile_ref[index]);
>>>> +
>>>> +        if (val) {
>>>> +            reschedule = true;
>>>
>>> Why do you need to do reschedule? For each put(), a schedule is 
>>> called. If refcount is not zero, that means some other job has 
>>> already set the profile. It is supposed to call put() and at that 
>>> time, this job will be run to clear it anyway, right?
>>>
>> Yes, I have got the comment for this I am going to remove this.
>> Noted.
>>
>>>> +        } else {
>>>> +            if (workload->submit_workload_status &
>>>> +                (1 << index)) {
>>>> +                ret = amdgpu_power_profile_clear(adev, index);
>>>> +                if (ret) {
>>>> +                    DRM_WARN("Failed to clear workload %s,error = 
>>>> %d\n",
>>>> +                         amdgpu_workload_mode_name[index], ret);
>>>> +                    goto exit;
>>>> +                }
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +    if (reschedule)
>>>> + schedule_delayed_work(&workload->smu_delayed_work,
>>>> +                      SMU_IDLE_TIMEOUT);
>>>> +exit:
>>>> +    mutex_unlock(&workload->workload_lock);
>>>> +}
>>>> +
>>>> +void amdgpu_workload_profile_put(struct amdgpu_device *adev,
>>>> +                 uint32_t ring_type)
>>>> +{
>>>> +    struct amdgpu_smu_workload *workload = &adev->smu_workload;
>>>> +    enum PP_SMC_POWER_PROFILE profile = 
>>>> ring_to_power_profile(ring_type);
>>>> +
>>>> +    if (profile == PP_SMC_POWER_PROFILE_BOOTUP_DEFAULT)
>>>> +        return;
>>>> +
>>>> +    mutex_lock(&workload->workload_lock);
>>>> +
>>>> +    if (!atomic_read(&workload->power_profile_ref[profile])) {
>>>> +        DRM_WARN("Power profile %s ref. count error\n",
>>>> +             amdgpu_workload_mode_name[profile]);
>>>> +    } else {
>>>> + atomic_dec(&workload->power_profile_ref[profile]);
>>>> + schedule_delayed_work(&workload->smu_delayed_work,
>>>> +                      SMU_IDLE_TIMEOUT);
>>>> +    }
>>>> +
>>>> +    mutex_unlock(&workload->workload_lock);
>>>> +}
>>>> +
>>>>   void amdgpu_workload_profile_set(struct amdgpu_device *adev,
>>>>                    uint32_t ring_type)
>>>>   {
>>>> @@ -70,13 +147,30 @@ void amdgpu_workload_profile_set(struct 
>>>> amdgpu_device *adev,
>>>>           return;
>>>>         mutex_lock(&workload->workload_lock);
>>>> + cancel_delayed_work_sync(&workload->smu_delayed_work);
>>>
>>> This is a potential deadlock. You already hold the mutex and then 
>>> waiting for idle work to finish. Idle work could now be at the point 
>>> where it is waiting for the same mutex. Suggest not to call cancel 
>>> here and let the mutex take care of the sequence.
>> We cannot cancel if idle work is running. So we have to wait until 
>> ideal work is complete. If *put function arrived before ideal work is 
>> not stated then we can cancel it. but if it is running work thread we 
>> should wait.
>
> No need to wait, because you already have a mutex. So you will be 
> waiting naturally for the mutex lock to be released (if at all idle 
> work already grabbed it). If a request comes in at the time while idle 
> work is running it is only a timing issue.
>
> Also you have a deadlock here. Here you acquired the mutex first and 
> then waiting for the idle work to finish. The idle work function would 
> have just started at that point and reached to the place where it is 
> going to grab mutex. That is a deadlock. This function is waiting for 
> idle work to finish and idle work is waiting to get the mutex.
>
> Nevertheless, this function doesn't even need to take care of such 
> fancy things. It only grabs the mutex and increases the refcount, 
> places a request if refcount became non-zero.
>
> At whatever point, idle work runs, it will see that the refcount is 
> not zero and skips placing a request to clear that profile.
>
>>>
>>>>         ret = amdgpu_power_profile_set(adev, profile);
>>>>       if (ret) {
>>>>           DRM_WARN("Failed to set workload profile to %s, error = 
>>>> %d\n",
>>>>                amdgpu_workload_mode_name[profile], ret);
>>>> +        goto exit;
>>>> +    }
>>>> +
>>>> +    /* Clear the already finished jobs of higher power profile*/
>>>> +    for (int index = fls(workload->submit_workload_status);
>>>> +         index > profile; index--) {
>>>> +        if (!atomic_read(&workload->power_profile_ref[index]) &&
>>>> +            workload->submit_workload_status & (1 << index)) {
>>>> +            ret = amdgpu_power_profile_clear(adev, index);
>>>> +            if (ret) {
>>>> +                DRM_WARN("Failed to clear workload %s, err = %d\n",
>>>> +                     amdgpu_workload_mode_name[profile], ret);
>>>> +                goto exit;
>>>> +            }
>>>> +        }
>>>
>>> If you follow the earlier comment, that will keep this logic only at 
>>> one place - i.e, at idle work handler. Basically just let the idle 
>>> work handle its duty. If some job starts running during the clear 
>>> call, it's just unfortunate timing and let the next set() take the 
>>> lock and request profile again.
>>
>> So basically for every millisecond  new jobs are coming and 
>> completing it to the same or different profile . Suppose we are 
>> running higher profile jobs and  before it completes if a lower job 
>> arrives, this check will help to move the higher profile to lower 
>> profile once higher profile finishes it. If we are not checking here 
>> then it will stuck on higher profile until then other jobs will also 
>> not complete. Please refer case3 scenario.
>>
>
> As mentioned before, this is not the place to take care of SOC 
> specific power profile priorities. We already have swsmu layer doing 
> that job. This layer just needs to do a ref count and place requests 
> accordingly.
>
> Thanks,
> Lijo
>
>>
>>> Thanks,
>>> Lijo
>>>
>>>>       }
>>>>   +exit:
>>>>       mutex_unlock(&workload->workload_lock);
>>>>   }
>>>>   @@ -87,6 +181,8 @@ void amdgpu_workload_profile_init(struct 
>>>> amdgpu_device *adev)
>>>>       adev->smu_workload.initialized = true;
>>>> mutex_init(&adev->smu_workload.workload_lock);
>>>> + INIT_DELAYED_WORK(&adev->smu_workload.smu_delayed_work,
>>>> +              amdgpu_power_profile_idle_work_handler);
>>>>   }
>>>>     void amdgpu_workload_profile_fini(struct amdgpu_device *adev)
>>>> @@ -94,6 +190,7 @@ void amdgpu_workload_profile_fini(struct 
>>>> amdgpu_device *adev)
>>>>       if (!adev->smu_workload.initialized)
>>>>           return;
>>>>   + cancel_delayed_work_sync(&adev->smu_workload.smu_delayed_work);
>>>>       adev->smu_workload.submit_workload_status = 0;
>>>>       adev->smu_workload.initialized = false;
>>>> mutex_destroy(&adev->smu_workload.workload_lock);
>>>> diff --git a/drivers/gpu/drm/amd/include/amdgpu_workload.h 
>>>> b/drivers/gpu/drm/amd/include/amdgpu_workload.h
>>>> index 5022f28fc2f9..ee1f87257f2d 100644
>>>> --- a/drivers/gpu/drm/amd/include/amdgpu_workload.h
>>>> +++ b/drivers/gpu/drm/amd/include/amdgpu_workload.h
>>>> @@ -46,6 +46,9 @@ static const char * const 
>>>> amdgpu_workload_mode_name[] = {
>>>>       "Window3D"
>>>>   };
>>>>   +void amdgpu_workload_profile_put(struct amdgpu_device *adev,
>>>> +                 uint32_t ring_type);
>>>> +
>>>>   void amdgpu_workload_profile_set(struct amdgpu_device *adev,
>>>>                    uint32_t ring_type);