lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 6 Oct 2022 16:45:15 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Dave Airlie <airlied@...il.com>
Cc:     Alex Deucher <alexdeucher@...il.com>,
        Alex Deucher <alexander.deucher@....com>,
        Christian König <christian.koenig@....com>,
        Daniel Vetter <daniel.vetter@...ll.ch>,
        LKML <linux-kernel@...r.kernel.org>,
        dri-devel <dri-devel@...ts.freedesktop.org>
Subject: Re: [git pull] drm for 6.1-rc1

On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airlied@...il.com> wrote:
>
>
> [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088
> [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]

As far as I can tell, that's the line

        struct drm_gpu_scheduler *sched = s_fence->sched;

where 's_fence' is NULL. The code is

   0: 0f 1f 44 00 00        nopl   0x0(%rax,%rax,1)
   5: 41 54                push   %r12
   7: 55                    push   %rbp
   8: 53                    push   %rbx
   9: 48 89 fb              mov    %rdi,%rbx
   c:* 48 8b af 88 00 00 00 mov    0x88(%rdi),%rbp <-- trapping instruction
  13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp)
  1a: 48 8b 85 80 01 00 00 mov    0x180(%rbp),%rax

and that next 'lock decl' instruction would have been the

        atomic_dec(&sched->hw_rq_count);

at the top of drm_sched_job_done().

Now, as to *why* you'd have a NULL s_fence, it would seem that
drm_sched_job_cleanup() was called with an active job. Looking at that
code, it does

        if (kref_read(&job->s_fence->finished.refcount)) {
                /* drm_sched_job_arm() has been called */
                dma_fence_put(&job->s_fence->finished);
        ...

but then it does

        job->s_fence = NULL;

anyway, despite the job still being active. The logic of that kind of
"fake refcount" escapes me. The above looks fundamentally racy, not to
say pointless and wrong (a refcount is a _count_, not a flag, so there
could be multiple references to it, what says that you can just
decrement one of them and say "I'm done").

Now, _why_ any of that happens, I have no idea. I'm just looking at
the immediate "that pointer is NULL" thing, and reacting to what looks
like a completely bogus refcount pattern.

But that odd refcount pattern isn't new, so it's presumably some user
on the amd gpu side that changed.

The problem hasn't happened again for me, but that's not saying a lot,
since it was very random to begin with.

                 Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ