linux-kernel - Re: [PATCH v1 1/6] drm/lima: fix devfreq refcount imbalance for job timeouts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAK4VdL3Tp33Wi5fmsh92XFWP8GE3TZMa473a=FeZajgnHn2mbA@mail.gmail.com>
Date: Wed, 24 Jan 2024 00:19:16 +0100
From: Erico Nunes <nunes.erico@...il.com>
To: Qiang Yu <yuq825@...il.com>
Cc: dri-devel@...ts.freedesktop.org, lima@...ts.freedesktop.org, 
	anarsoul@...il.com, Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>, 
	Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>, 
	David Airlie <airlied@...il.com>, Daniel Vetter <daniel@...ll.ch>, 
	Sumit Semwal <sumit.semwal@...aro.org>, christian.koenig@....com, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1 1/6] drm/lima: fix devfreq refcount imbalance for job timeouts

On Fri, Jan 19, 2024 at 2:50 AM Qiang Yu <yuq825@...il.com> wrote:
>
> On Thu, Jan 18, 2024 at 7:14 PM Erico Nunes <nunes.erico@...il.com> wrote:
> >
> > On Thu, Jan 18, 2024 at 2:36 AM Qiang Yu <yuq825@...il.com> wrote:
> > >
> > > So this is caused by same job trigger both done and timeout handling?
> > > I think a better way to solve this is to make sure only one handler
> > > (done or timeout) process the job instead of just making lima_pm_idle()
> > > unique.
> >
> > It's not very clear to me how to best ensure that, with the drm_sched
> > software timeout and the irq happening potentially at the same time.
> This could be done by stopping scheduler run more job and disable
> GP/PP interrupt. Then after sync irq, there should be no more new
> irq gets in when we handling timeout.
>
> > I think patch 4 in this series describes and covers the most common
> > case that this would be hit. So maybe now this patch could be dropped
> > in favour of just that one.
> Yes.

After dropping the patch while testing to send v2, I managed to
reproduce this issue again.
Looking at a trace it seems that this actually happened with the test workload:
lima_sched_timedout_job -> (fence by is not signaled and new fence
check is passed) -> irq happens preempting lima_sched_timedout_job,
fence is signaled and lima_pm_idle called once at
lima_sched_pipe_task_done -> lima_sched_timedout_job continues and
calls lima_pm_idle again

So I think we still need this patch to at least prevent the bug with
the current software timeout. If we move to the hardware watchdog
timeout later we might be able to remove it anyway, but that will
still require separate work and testing.

Erico