lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK4VdL2oRsvCMAWsMyvT+MpPz5e=7=mVtMWAbJKopLEOM0_A+w@mail.gmail.com>
Date: Sun, 21 Jan 2024 16:11:41 +0100
From: Erico Nunes <nunes.erico@...il.com>
To: Qiang Yu <yuq825@...il.com>
Cc: Hillf Danton <hdanton@...a.com>, dri-devel@...ts.freedesktop.org, 
	lima@...ts.freedesktop.org, David Airlie <airlied@...il.com>, 
	Daniel Vetter <daniel@...ll.ch>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency

On Sun, Jan 21, 2024 at 12:20 PM Qiang Yu <yuq825@...il.com> wrote:
>
> On Sun, Jan 21, 2024 at 5:56 PM Hillf Danton <hdanton@...a.com> wrote:
> >
> > On Wed, 17 Jan 2024 04:12:10 +0100 Erico Nunes <nunes.erico@...il.com>
> > >
> > > @@ -401,9 +399,33 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
> > >       struct lima_sched_pipe *pipe = to_lima_pipe(job->sched);
> > >       struct lima_sched_task *task = to_lima_task(job);
> > >       struct lima_device *ldev = pipe->ldev;
> > > +     struct lima_ip *ip = pipe->processor[0];
> > > +
> > > +     /*
> > > +      * If the GPU managed to complete this jobs fence, the timeout is
> > > +      * spurious. Bail out.
> > > +      */
> > > +     if (dma_fence_is_signaled(task->done_fence)) {
> > > +             DRM_WARN("%s spurious timeout\n", lima_ip_name(ip));
> > > +             return DRM_GPU_SCHED_STAT_NOMINAL;
> > > +     }
> >
> > Given 500ms in lima_sched_pipe_init(), no timeout is spurious by define,
> > and stop selling bandaid like this because you have options like locating
> > the reasons behind timeout.
>
> This chang do look like to aim for 2FPS apps. Maybe 500ms is too short
> for week mali4x0 gpus (2FPS apps appear more likely). AMD/NV GPU uses
> 10s timeout. So increasing the timeout seems to be an equivalent and better
> way?

Indeed 500ms might be too optimistic for the sort of applications that
users expect to run on this hardware currently. For a more similar
reference though, other embedded drivers like v3d and panfrost do
still set it to 500ms. Note that this patch is just exactly the same
as exists in Panfrost today and was already discussed with some common
arguments in the patches of this series:
https://patchwork.freedesktop.org/series/120820/

But I would agree to bump the timeout to a higher value for lima. Some
distributions are already doing this with module parameters anyway to
even be able to run some more demanding application stacks on a Mali
400.

Another thing we might consider (probably in a followup patchset to
not delay these fixes forever for the people hitting this issue) is to
configure the Mali hardware watchdog to the value that we would like
as a timeout. That way we would get timeout jobs going through the
same error irq path as other hardware error jobs and might be able to
delete(?)/simplify this software timeout code.

In the meantime for v2 of this series I'll make the change to account
for the multiple pp irqs. So do you agree it is ok to leave
drm_sched_stop() as it is at least for this series?

Thanks all for the reviews

Erico

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ