[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CABXGCsNYEa33cvm_rfuORYvKAbUFk6ZNOhmMAsQk-7xQVy4xDw@mail.gmail.com>
Date: Wed, 3 May 2023 00:28:58 +0500
From: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>
To: "Chen, Guchun" <Guchun.Chen@....com>
Cc: "Koenig, Christian" <Christian.Koenig@....com>,
Daniel Vetter <daniel.vetter@...ll.ch>,
dri-devel <dri-devel@...ts.freedesktop.org>,
amd-gfx list <amd-gfx@...ts.freedesktop.org>,
Linux List Kernel Mailing <linux-kernel@...r.kernel.org>
Subject: Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
On Wed, Apr 26, 2023 at 7:00 AM Chen, Guchun <Guchun.Chen@....com> wrote:
>
> After reviewing this whole history, maybe attached patch is able to fix your problem. Can you have a try please?
>
> Regards,
> Guchun
>
Thanks, I tested this patch for 6 days.
And the error "BUG: KASAN: null-ptr-deref in
drm_sched_job_cleanup+0x96" never appears any more.
But instead I began to note GPU hangs which happen randomly after
"[gfxhub] page fault".
Not sure if there is anything useful to seen in page fault message:
amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40
vmid:1 pasid:32779, for process steamwebhelper pid 15552 thread
steamwebhe:cs0 pid 15832)
amdgpu 0000:03:00.0: amdgpu: in page starting at address
0x00008001012c3000 from client 0x1b (UTCL2)
amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00141051
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x5
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x1
amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24
vmid:2 pasid:32794, for process EvilDead-Win64- pid 12883 thread
EvilDead-W:cs0 pid 13035)
amdgpu 0000:03:00.0: amdgpu: in page starting at address
0x00008001e62a5000 from client 10
amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201030
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x0
amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24
vmid:1 pasid:32770, for process Xwayland pid 3706 thread Xwayland:cs0
pid 3713)
amdgpu 0000:03:00.0: amdgpu: in page starting at address
0x0000800100c04000 from client 10
amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00101031
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x0
amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40
vmid:2 pasid:32784, for process thedivision.exe pid 168608 thread
thedivision.exe pid 168733)
amdgpu 0000:03:00.0: amdgpu: in page starting at address
0x0000800000372000 from client 10
amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00240C51
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPG (0x6)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x5
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x1
amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24
vmid:5 pasid:32797, for process thedivision.exe pid 9902 thread
thedivision.exe pid 9962)
amdgpu 0000:03:00.0: amdgpu: in page starting at address
0x000080013b3cc000 from client 10
amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00500830
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPF (0x4)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x0
Since the hangs have a random nature, it is very difficult to relate
them with any changes.
I really want to add Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>
but I'm not sure if I have the right to do so if for some unknown
reason the GPU is not stable yet.
All full kernel logs are attached below.
On Wed, Apr 26, 2023 at 4:50 PM Christian König
<ckoenig.leichtzumerken@...il.com> wrote:
>
> Sending that once more from my mailing list address since AMD internal
> servers are blocking the mail.
>
> Regards,
> Christian.
>
> Am 26.04.23 um 13:48 schrieb Christian König:
> > WTF? I own you a beer!
> >
> > I've fixed exactly that problem during the review process of the
> > cleanup patch and because of this didn't considered that the code is
> > still there.
> >
> > It also explains why we don't see that in our testing.
> >
> > @Mikhail can you test that patch with drm-misc-next?
Christian, in the drm-misc-next I should test the Guchun's patch or
something else?
I already tested Guchun's patch on top of 6.4-git58390c8ce1bd and
shared my result above.
--
Best Regards,
Mike Gavrilov.
Download attachment "dmesg-gfxhub-page-fault-7.tar.xz" of type "application/octet-stream" (42600 bytes)
Download attachment "dmesg-gfxhub-page-fault-6.tar.xz" of type "application/octet-stream" (49012 bytes)
Download attachment "dmesg-gfxhub-page-fault-5.tar.xz" of type "application/octet-stream" (44816 bytes)
Download attachment "dmesg-gfxhub-page-fault-4.tar.xz" of type "application/octet-stream" (48236 bytes)
Download attachment "dmesg-gfxhub-page-fault-3.tar.xz" of type "application/octet-stream" (54836 bytes)
Powered by blists - more mailing lists