linux-kernel - Re: [RFC PATCH] drm/amdgpu: Avoid unnecessary Call Traces in amdgpu_irq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADnq5_MZ9Bk+nMEuA0sfLofHy1+0SmRHTutX-yDR9OzFxbqYmQ@mail.gmail.com>
Date: Fri, 16 Jan 2026 11:34:28 -0500
From: Alex Deucher <alexdeucher@...il.com>
To: Christian König <christian.koenig@....com>
Cc: Tiezhu Yang <yangtiezhu@...ngson.cn>, Alex Deucher <alexander.deucher@....com>, 
	Alan Liu <haoping.liu@....com>, amd-gfx@...ts.freedesktop.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] drm/amdgpu: Avoid unnecessary Call Traces in amdgpu_irq_put()

On Fri, Jan 16, 2026 at 5:42 AM Christian König
<christian.koenig@....com> wrote:
>
> On 1/16/26 02:20, Tiezhu Yang wrote:
> > On 2026/1/15 下午9:47, Christian König wrote:
> >> On 1/15/26 02:28, Tiezhu Yang wrote:
> >>> Currently, there are many Call Traces when booting kernel on LoongArch,
> >>> here are the trimmed kernel log messages:
> >>>
> >>>    amdgpu 0000:07:00.0: amdgpu: hw_init of IP block <gfx_v6_0> failed -110
> >>>    amdgpu 0000:07:00.0: amdgpu: amdgpu_device_ip_init failed
> >>>    amdgpu 0000:07:00.0: amdgpu: Fatal error during GPU init
> >>>    amdgpu 0000:07:00.0: amdgpu: amdgpu: finishing device.
> >>>    ------------[ cut here ]------------
> >>>    WARNING: drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:639 at amdgpu_irq_put+0xb0/0x140 [amdgpu], CPU#0: kworker/0:0/9
> >>>    ...
> >>>    Call Trace:
> >>>    [<90000000047a8524>] show_stack+0x64/0x190
> >>>    [<90000000047a0614>] dump_stack_lvl+0x6c/0x9c
> >>>    [<90000000047cef34>] __warn+0xa4/0x1b0
> >>>    [<90000000060a4884>] __report_bug+0xa4/0x1d0
> >>>    [<90000000060a4a88>] report_bug+0x38/0xd0
> >>>    [<90000000060df330>] do_bp+0x260/0x410
> >>>    [<90000000047a6bc0>] handle_bp+0x120/0x1c0
> >>>    [<ffff8000028bff40>] amdgpu_irq_put+0xb0/0x140 [amdgpu]
> >>>    [<ffff8000027b1a8c>] amdgpu_fence_driver_hw_fini+0x12c/0x180 [amdgpu]
> >>>    [<ffff800002f2c04c>] amdgpu_device_fini_hw+0xf0/0x3fc [amdgpu]
> >>>    [<ffff80000279e2ac>] amdgpu_driver_load_kms+0x7c/0xa0 [amdgpu]
> >>>    [<ffff800002791128>] amdgpu_pci_probe+0x298/0x810 [amdgpu]
> >>>    [<90000000054d04a4>] local_pci_probe+0x44/0xc0
> >>>    [<90000000047f4ab0>] work_for_cpu_fn+0x20/0x40
> >>>    [<90000000047f93e0>] process_one_work+0x170/0x4e0
> >>>    [<90000000047fa14c>] worker_thread+0x3ac/0x4e0
> >>>    [<9000000004806824>] kthread+0x154/0x170
> >>>    [<90000000060df5b4>] ret_from_kernel_thread+0x24/0xd0
> >>>    [<90000000047a62a4>] ret_from_kernel_thread_asm+0xc/0x88
> >>>
> >>>    ---[ end trace 0000000000000000 ]---
> >>>    amdgpu 0000:07:00.0: probe with driver amdgpu failed with error -110
> >>>    amdgpu 0000:07:00.0: amdgpu: amdgpu: ttm finalized
> >>>
> >>> This is because amdgpu_irq_enabled() is false in amdgpu_irq_put(), then
> >>> the condition of WARN_ON() is true.
> >>>
> >>> In order to avoid the unnecessary Call Traces, it can remove the check of
> >>> amdgpu_irq_enabled() and only check atomic_read(&src->enabled_types[type]
> >>> for three reasons:
> >>>
> >>> (1) The aim is to prevent refcount from being less than 0, it was added in
> >>> commit 1fa8d710573f ("drm/amdgpu: Fix desktop freezed after gpu-reset").
> >>> (2) There are already many useful failed log before the Call Trace, there
> >>> is no need to WARN_ON().
> >>
> >> Well completely disagree. The call trace here is absolutely intentional.
> >
> > If so, since the call trace is same, is it enough to use WARN_ON_ONCE()
> > here?
>
> I also don't see a justification for that.
>
> >> That you get a lot of other backtraces because the driver doesn't initialize at all isn't a good rational to remove this one here.
> >>
> >> Regards,
> >> Christian.
> >>
> >>> (3) The following checks in amdgpu_irq_put() are same with the checks in
> >>> amdgpu_irq_enabled(), there is no need to do the redundant operations.
> >>>
> >>>     if (!adev->irq.installed)
> >>>         return -ENOENT;
> >>>
> >>>     if (type >= src->num_types)
> >>>         return -EINVAL;
> >>>
> >>>     if (!src->enabled_types || !src->funcs->set)
> >>>         return -EINVAL;
> >
> > Is this reasonable? Only check atomic_read(&src->enabled_types[type]?
>
> No, absolutely not. That are two completely different things.
>
> > That is to say, does it make sense to do the following change?
>
> The warning can basically only be triggered by two conditions:
> 1. A fatal problem while loading the driver and the error handling is not 100% clean.
> 2. A driver coding error.
>
> And we really need to catch all of those, so there is no real rational to limit the warning.
>
> I mean when you run into any of those they should potentially be fixed at some point.

The problem is there are a lot of error paths which don't get this
right that no one has fixed for years.  I'm not sure if there is much
value anymore; most people just ignore them.

Alex

>
> Regards,
> Christian.
>
> >
> > ----->8-----
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> > index 8112ffc85995..d10d6fcc525e 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> > @@ -636,7 +636,7 @@ int amdgpu_irq_put(struct amdgpu_device *adev, struct amdgpu_irq_src *src,
> >         if (!src->enabled_types || !src->funcs->set)
> >                 return -EINVAL;
> >
> > -       if (WARN_ON(!amdgpu_irq_enabled(adev, src, type)))
> > +       if (WARN_ON_ONCE(!atomic_read(&src->enabled_types[type])))
> >                 return -EINVAL;
> >
> >         if (atomic_dec_and_test(&src->enabled_types[type]))
> >
> > Thanks,
> > Tiezhu
> >
>