linux-kernel - Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <284c0e65-6204-7747-b294-1f9fa23b1f91@amd.com>
Date:   Fri, 18 Mar 2022 16:14:47 -0400
From:   Andrey Grodzovsky <andrey.grodzovsky@....com>
To:     Rob Clark <robdclark@...il.com>
Cc:     Christian König <christian.koenig@....com>,
        dri-devel <dri-devel@...ts.freedesktop.org>,
        freedreno <freedreno@...ts.freedesktop.org>,
        linux-arm-msm <linux-arm-msm@...r.kernel.org>,
        Rob Clark <robdclark@...omium.org>,
        Sean Paul <sean@...rly.run>,
        Abhinav Kumar <quic_abhinavk@...cinc.com>,
        David Airlie <airlied@...ux.ie>,
        Akhil P Oommen <quic_akhilpo@...cinc.com>,
        Jonathan Marek <jonathan@...ek.ca>,
        AngeloGioacchino Del Regno 
        <angelogioacchino.delregno@...labora.com>,
        Bjorn Andersson <bjorn.andersson@...aro.org>,
        Vladimir Lypak <vladimir.lypak@...il.com>,
        open list <linux-kernel@...r.kernel.org>,
        "Alexander.Deucher@....com" <Alexander.Deucher@....com>
Subject: Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system
 suspend


On 2022-03-18 13:22, Rob Clark wrote:
> On Fri, Mar 18, 2022 at 9:27 AM Andrey Grodzovsky
> <andrey.grodzovsky@....com> wrote:
>>
>> On 2022-03-18 12:20, Rob Clark wrote:
>>> On Fri, Mar 18, 2022 at 9:04 AM Andrey Grodzovsky
>>> <andrey.grodzovsky@....com> wrote:
>>>> On 2022-03-17 16:35, Rob Clark wrote:
>>>>> On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky
>>>>> <andrey.grodzovsky@....com> wrote:
>>>>>> On 2022-03-17 14:25, Rob Clark wrote:
>>>>>>> On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
>>>>>>> <andrey.grodzovsky@....com> wrote:
>>>>>>>> On 2022-03-17 13:35, Rob Clark wrote:
>>>>>>>>> On Thu, Mar 17, 2022 at 9:45 AM Christian König
>>>>>>>>> <christian.koenig@....com> wrote:
>>>>>>>>>> Am 17.03.22 um 17:18 schrieb Rob Clark:
>>>>>>>>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
>>>>>>>>>>> <christian.koenig@....com> wrote:
>>>>>>>>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
>>>>>>>>>>>>> [SNIP]
>>>>>>>>>>>>> userspace frozen != kthread frozen .. that is what this patch is
>>>>>>>>>>>>> trying to address, so we aren't racing between shutting down the hw
>>>>>>>>>>>>> and the scheduler shoveling more jobs at us.
>>>>>>>>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
>>>>>>>>>>>> more jobs at us until it is empty.
>>>>>>>>>>>>
>>>>>>>>>>>> Thinking more about it we will then keep some dma_fence instance
>>>>>>>>>>>> unsignaled and that is and extremely bad idea since it can lead to
>>>>>>>>>>>> deadlocks during suspend.
>>>>>>>>>>> Hmm, perhaps that is true if you need to migrate things out of vram?
>>>>>>>>>>> It is at least not a problem when vram is not involved.
>>>>>>>>>> No, it's much wider than that.
>>>>>>>>>>
>>>>>>>>>> See what can happen is that the memory management shrinkers want to wait
>>>>>>>>>> for a dma_fence during suspend.
>>>>>>>>> we don't wait on fences in shrinker, only purging or evicting things
>>>>>>>>> that are already ready.  Actually, waiting on fences in shrinker path
>>>>>>>>> sounds like a pretty bad idea.
>>>>>>>>>
>>>>>>>>>> And if you stop the scheduler they will just wait forever.
>>>>>>>>>>
>>>>>>>>>> What you need to do instead is to drain the scheduler, e.g. call
>>>>>>>>>> drm_sched_entity_flush() with a proper timeout for each entity you have
>>>>>>>>>> created.
>>>>>>>>> yeah, it would work to drain the scheduler.. I guess that might be the
>>>>>>>>> more portable approach as far as generic solution for suspend.
>>>>>>>>>
>>>>>>>>> BR,
>>>>>>>>> -R
>>>>>>>> I am not sure how this drains the scheduler ? Suppose we done the
>>>>>>>> waiting in drm_sched_entity_flush,
>>>>>>>> what prevents someone to push right away another job into the same
>>>>>>>> entity's queue  right after that ?
>>>>>>>> Shouldn't we first disable further pushing of jobs into entity before we
>>>>>>>> wait for  sched->job_scheduled ?
>>>>>>>>
>>>>>>> In the system suspend path, userspace processes will have already been
>>>>>>> frozen, so there should be no way to push more jobs to the scheduler,
>>>>>>> unless they are pushed from the kernel itself.
>>>>>>> amdgpu_device_suspend
>>>>>> It was my suspicion but I wasn't sure about it.
>>>>>>
>>>>>>
>>>>>>> We don't do that in
>>>>>>> drm/msm, but maybe you need to to move things btwn vram and system
>>>>>>> memory?
>>>>>> Exactly, that was my main concern - if we use this method we have to use
>>>>>> it in a point in
>>>>>> suspend sequence when all the in kernel job submissions activity already
>>>>>> suspended
>>>>>>
>>>>>>> But even in that case, if the # of jobs you push is bounded I
>>>>>>> guess that is ok?
>>>>>> Submissions to scheduler entities are using unbounded queue, the bounded
>>>>>> part is when
>>>>>> you extract next job from entity to submit to HW ring and it rejects if
>>>>>> submission limit reached (drm_sched_ready)
>>>>>>
>>>>>> In general - It looks to me at least that what we what we want her is
>>>>>> more of a drain operation then flush (i.e.
>>>>>> we first want to disable any further job submission to entity's queue
>>>>>> and then flush all in flight ones). As example
>>>>>> for this i was looking at  flush_workqueue vs. drain_workqueue
>>>>> Would it be possible for amdgpu to, in the system suspend task,
>>>>>
>>>>> 1) first queue up all the jobs needed to migrate bos out of vram, and
>>>>> whatever other housekeeping jobs are needed
>>>>> 2) then drain gpu scheduler's queues
>>>>> 3) and then finally wait for jobs executing on GPU to complete
>>>> We already do most of it in amdgpu_device_suspend,
>>>> amdgpu_device_ip_suspend_phase1
>>>> followed by amdgpu_device_evict_resources followed by
>>>> amdgpu_fence_driver_hw_fini is
>>>> exactly steps 1 + 3. What we are missing is step 2). For this step I
>>>> suggest adding a function
>>>> called drm_sched_entity_drain which basically sets entity->stopped =
>>>> true and then calls drm_sched_entity_flush.
>>>> This will both reject any new insertions into entity's job queue and
>>>> will flush all pending job submissions to HW from that entity.
>>>> One point is we need to make make drm_sched_entity_push_job return value
>>>> so the caller knows about job enqueue
>>>> rejection.
>>> Hmm, seems like job enqueue that is rejected because we are in the
>>> process of suspending should be more of a WARN_ON() sort of thing?
>>> Not sure if there is something sensible to do for the caller at that
>>> point?
>>
>> What about the job's fence the caller is waiting on ? If we rejected
>> job submission the caller must know about it to not get stuck waiting
>> on that fence.
>>
> Hmm, perhaps I'm not being imaginative enough, but this sort of
> scenario seems like it should only arise from a bug in the driver's
> suspend path, Ie. not doing all the job submission before shutting
> down the scheduler.  I don't think anything good is going to result
> either way, which is why I was thinking you'd want a WARN_ON() to help
> debug/fix that case.


Yes, I just wanted the code to not allow such bugs to go through 
unnoticed. I guess
WARN_ON should give laud enough warning anyway

Andrey


>>>> What about runtime suspend ? I guess same issue with scheduler racing
>>>> against HW susppend is relevant there ?
>>> Runtime suspend should be ok, as long as the driver holds a runpm
>>> reference whenever the hw needs to be awake.  The problem with system
>>> suspend (at least if you are using pm_runtime_force_suspend() or doing
>>> something equivalent) is that it bypasses the runpm reference.
>>> (Which, IMO, seems like a bad design..)
>>
>> I am not totally clear  yet - can you expand a bit why one case is ok
>> but the other
>> problematic ?
>>
> Sure, normally pm_runtime_get/put increment a reference count, as long
> as there have been more get's than puts, the device won't runtime
> suspend.  So, for ex, msm's run_job fxn does a pm_runtime_get_sync().
> And retire_submit() which runs after job completes on GPU does a
> pm_runtime_put_autosuspend().
>
> System suspend, OTOH, bypasses this reference counting.  Which is why
> extra care is needed.
>
> BR,
> -R
>
>
>> Andrey
>>
>>
>>>> Also, could you point to a particular buggy scenario where the race
>>>> between SW shceduler and suspend is causing a problem ?
>>> I wrote a piglit test[1] to try to trigger this scenario.. it isn't
>>> really that easy to hit
>>>
>>> BR,
>>> -R
>>>
>>> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fmesa%2Fpiglit%2F-%2Fmerge_requests%2F643&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C35f0d7d9282044651c9708da0903d4f4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637832209324217553%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=dwjPEVAYgCI%2BtEyzBirfAQkJjZax2NdiLQfNeFfImtU%3D&amp;reserved=0
>>>
>>>> Andrey
>>>>
>>>>
>>>>> BR,
>>>>> -R
>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>> BR,
>>>>>>> -R