linux-kernel - Re: [git pull] drm for 6.1-rc1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAPM=9tyjMUxAQnJJBVnXXc0tQTjywiK8PLxbJ_Jz4T_pcEospA@mail.gmail.com>
Date:   Fri, 7 Oct 2022 12:54:02 +1000
From:   Dave Airlie <airlied@...il.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Arvind.Yadav@....com
Cc:     Alex Deucher <alexdeucher@...il.com>,
        Alex Deucher <alexander.deucher@....com>,
        Christian König <christian.koenig@....com>,
        Daniel Vetter <daniel.vetter@...ll.ch>,
        LKML <linux-kernel@...r.kernel.org>,
        dri-devel <dri-devel@...ts.freedesktop.org>
Subject: Re: [git pull] drm for 6.1-rc1

On Fri, 7 Oct 2022 at 12:45, Dave Airlie <airlied@...il.com> wrote:
>
> On Fri, 7 Oct 2022 at 09:45, Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
> >
> > On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airlied@...il.com> wrote:
> > >
> > >
> > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088
> > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
> >
> > As far as I can tell, that's the line
> >
> >         struct drm_gpu_scheduler *sched = s_fence->sched;
> >
> > where 's_fence' is NULL. The code is
> >
> >    0: 0f 1f 44 00 00        nopl   0x0(%rax,%rax,1)
> >    5: 41 54                push   %r12
> >    7: 55                    push   %rbp
> >    8: 53                    push   %rbx
> >    9: 48 89 fb              mov    %rdi,%rbx
> >    c:* 48 8b af 88 00 00 00 mov    0x88(%rdi),%rbp <-- trapping instruction
> >   13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp)
> >   1a: 48 8b 85 80 01 00 00 mov    0x180(%rbp),%rax
> >
> > and that next 'lock decl' instruction would have been the
> >
> >         atomic_dec(&sched->hw_rq_count);
> >
> > at the top of drm_sched_job_done().
> >
> > Now, as to *why* you'd have a NULL s_fence, it would seem that
> > drm_sched_job_cleanup() was called with an active job. Looking at that
> > code, it does
> >
> >         if (kref_read(&job->s_fence->finished.refcount)) {
> >                 /* drm_sched_job_arm() has been called */
> >                 dma_fence_put(&job->s_fence->finished);
> >         ...
> >
> > but then it does
> >
> >         job->s_fence = NULL;
> >
> > anyway, despite the job still being active. The logic of that kind of
> > "fake refcount" escapes me. The above looks fundamentally racy, not to
> > say pointless and wrong (a refcount is a _count_, not a flag, so there
> > could be multiple references to it, what says that you can just
> > decrement one of them and say "I'm done").
> >
> > Now, _why_ any of that happens, I have no idea. I'm just looking at
> > the immediate "that pointer is NULL" thing, and reacting to what looks
> > like a completely bogus refcount pattern.
> >
> > But that odd refcount pattern isn't new, so it's presumably some user
> > on the amd gpu side that changed.
> >
> > The problem hasn't happened again for me, but that's not saying a lot,
> > since it was very random to begin with.
>
> I chased down the culprit to a drm sched patch, I'll send you a pull
> with a revert in it.
>
> commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86
> Author: Arvind Yadav <Arvind.Yadav@....com>
> Date:   Wed Sep 14 22:13:20 2022 +0530
>
>     drm/sched: Use parent fence instead of finished
>
>     Using the parent fence instead of the finished fence
>     to get the job status. This change is to avoid GPU
>     scheduler timeout error which can cause GPU reset.
>
>     Signed-off-by: Arvind Yadav <Arvind.Yadav@....com>
>     Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@....com>
>     Link: https://patchwork.freedesktop.org/patch/msgid/20220914164321.2156-6-Arvind.Yadav@amd.com
>     Signed-off-by: Christian König <christian.koenig@....com>
>
> I'll let Arvind and Christian maybe work out what is going wrong there.

I do spy two changes queued for -next that might be relevant, so I
might try just pulling those instead.

I'll send a PR in next hour once I test it.

Dave.