[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <BL0PR12MB2465DC4AC9C28B4B950CDB42F1519@BL0PR12MB2465.namprd12.prod.outlook.com>
Date: Tue, 11 Jan 2022 03:20:22 +0000
From: "Chen, Guchun" <Guchun.Chen@....com>
To: "Deucher, Alexander" <Alexander.Deucher@....com>,
"Koenig, Christian" <Christian.Koenig@....com>,
Len Brown <lenb@...nel.org>,
"torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>,
"Grodzovsky, Andrey" <Andrey.Grodzovsky@....com>
CC: "linux-pm@...r.kernel.org" <linux-pm@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Len Brown <len.brown@...el.com>,
"stable@...r.kernel.org" <stable@...r.kernel.org>
Subject: RE: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler when
calling hw_fini (v2)"
[Public]
Hi Alex/Christian,
This patch is to put drm_sched_stop to stop scheduler before amdgpu_fence_wait_empty, otherwise, there is possibly a race problem that drm scheduler will keep submitting commands to hardware in suspend, so amdgpu_fence_wait_empty has no chance to get empty. This is based on the discussion with Andrey before.
In Brown's case, without this patch, his test can run well by a 10-hour duration. However, with this patch applied, issue occurs in under an hour. I guess this patch exposes another underlying problem, as if it's totally faulty, the test with the patch applied will break in the first round suspend/resume test instead of failed after several rounds suspend/resume test.
https://bugzilla.kernel.org/show_bug.cgi?id=215315
Anyway, we can revert it for now, and I will continue the investigation to the root cause.
Regards,
Guchun
-----Original Message-----
From: Deucher, Alexander <Alexander.Deucher@....com>
Sent: Tuesday, January 11, 2022 12:26 AM
To: Koenig, Christian <Christian.Koenig@....com>; Len Brown <lenb@...nel.org>; torvalds@...ux-foundation.org; Chen, Guchun <Guchun.Chen@....com>; Grodzovsky, Andrey <Andrey.Grodzovsky@....com>
Cc: linux-pm@...r.kernel.org; linux-kernel@...r.kernel.org; Len Brown <len.brown@...el.com>; stable@...r.kernel.org
Subject: RE: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler when calling hw_fini (v2)"
[Public]
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@....com>
> Sent: Monday, January 10, 2022 11:16 AM
> To: Deucher, Alexander <Alexander.Deucher@....com>; Len Brown
> <lenb@...nel.org>; torvalds@...ux-foundation.org; Chen, Guchun
> <Guchun.Chen@....com>; Grodzovsky, Andrey <Andrey.Grodzovsky@....com>
> Cc: linux-pm@...r.kernel.org; linux-kernel@...r.kernel.org; Len Brown
> <len.brown@...el.com>; stable@...r.kernel.org
> Subject: Re: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler
> when calling hw_fini (v2)"
>
> Am 10.01.22 um 17:08 schrieb Deucher, Alexander:
> > [Public]
> >
> >> -----Original Message-----
> >> From: Len Brown <lenb417@...il.com> On Behalf Of Len Brown
> >> Sent: Sunday, January 9, 2022 1:12 PM
> >> To: torvalds@...ux-foundation.org
> >> Cc: linux-pm@...r.kernel.org; linux-kernel@...r.kernel.org; Len
> >> Brown <len.brown@...el.com>; Chen, Guchun <Guchun.Chen@....com>;
> >> Grodzovsky, Andrey <Andrey.Grodzovsky@....com>; Koenig, Christian
> >> <Christian.Koenig@....com>; Deucher, Alexander
> >> <Alexander.Deucher@....com>; stable@...r.kernel.org
> >> Subject: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler
> when
> >> calling hw_fini (v2)"
> >>
> >> From: Len Brown <len.brown@...el.com>
> >>
> >> This reverts commit f7d6779df642720e22bffd449e683bb8690bd3bf.
> >>
> >> This bisected regression has impacted suspend-resume stability
> >> since
> >> 5.15- rc1. It regressed -stable via 5.14.10.
> >>
> >>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbug
> >> z
> illa.kernel.org%2Fshow_bug.cgi%3Fid%3D215315&data=04%7C01%7Cal
> >>
> exander.deucher%40amd.com%7Ccf790be4827f4df9f2d808d9d39b81af%7C3
> >>
> dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637773487569442716%7C
> >>
> Unknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJB
> >>
> TiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AX0TXkyoMhy%2BZqE
> >> VgRSWMkKd5nPa4WOv%2B1FZHLSErSw%3D&reserved=0
> >>
> >> Fixes: f7d6779df64 ("drm/amdgpu: stop scheduler when calling
> >> hw_fini
> >> (v2)")
> >> Cc: Guchun Chen <guchun.chen@....com>
> >> Cc: Andrey Grodzovsky <andrey.grodzovsky@....com>
> >> Cc: Christian Koenig <christian.koenig@....com>
> >> Cc: Alex Deucher <alexander.deucher@....com>
> >> Cc: <stable@...r.kernel.org> # 5.14+
> >> Signed-off-by: Len Brown <len.brown@...el.com>
> > @Chen, Guchun, @Grodzovsky, Andrey, @Koenig, Christian
> >
> > Any ideas? What's the consequence of reverting this patch? Didn't
> > this
> patch fix another suspend/resume issue?
>
> I think Guchun was just trying to adapt that we removed the scheduler
> stop from the fence driver hw fini path.
>
> Not sure if that actually fixed something or was just a precaution.
Thanks. I'll wait for feedback from Guchun and Andrey and if they are ok with it, I'll apply the revert.
Alex
>
> Regards,
> Christian.
>
> >
> > Alex
> >
> >> ---
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 --------
> >> 1 file changed, 8 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >> index 9afd11ca2709..45977a72b5dd 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >> @@ -547,9 +547,6 @@ void amdgpu_fence_driver_hw_fini(struct
> >> amdgpu_device *adev)
> >> if (!ring || !ring->fence_drv.initialized)
> >> continue;
> >>
> >> - if (!ring->no_scheduler)
> >> - drm_sched_stop(&ring->sched, NULL);
> >> -
> >> /* You can't wait for HW to signal if it's gone */
> >> if (!drm_dev_is_unplugged(adev_to_drm(adev)))
> >> r = amdgpu_fence_wait_empty(ring); @@ -609,11
> +606,6 @@ void
> >> amdgpu_fence_driver_hw_init(struct
> >> amdgpu_device *adev)
> >> if (!ring || !ring->fence_drv.initialized)
> >> continue;
> >>
> >> - if (!ring->no_scheduler) {
> >> - drm_sched_resubmit_jobs(&ring->sched);
> >> - drm_sched_start(&ring->sched, true);
> >> - }
> >> -
> >> /* enable the interrupt */
> >> if (ring->fence_drv.irq_src)
> >> amdgpu_irq_get(adev, ring->fence_drv.irq_src,
> >> --
> >> 2.25.1
Powered by blists - more mailing lists