[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKGbVbttU7Ru7-AuB-+sLCUZRvb8nmGZAE-mwq0WU7-3p=VA=w@mail.gmail.com>
Date: Wed, 16 Apr 2025 09:53:19 +0800
From: Qiang Yu <yuq825@...il.com>
To: Erico Nunes <nunes.erico@...il.com>
Cc: Christian König <christian.koenig@....com>,
Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>, Maxime Ripard <mripard@...nel.org>,
Thomas Zimmermann <tzimmermann@...e.de>, David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
dri-devel@...ts.freedesktop.org, lima@...ts.freedesktop.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/1] drm/lima: implement the file flush callback
On Mon, Apr 14, 2025 at 9:19 PM Erico Nunes <nunes.erico@...il.com> wrote:
>
> On Thu, Apr 10, 2025 at 4:04 PM Christian König
> <christian.koenig@....com> wrote:
> >
> > Am 10.04.25 um 15:56 schrieb Qiang Yu:
> > >>>> This prevents applications with multiple contexts from running into a
> > >>>> race condition between running tasks and context destroy when
> > >>>> terminating.
> > >>>>
> > > If implementing flush callback fix the problem, it must not be when SIGKILL.
> >
> > SIGKILL also calls flush (again eventually), but we can detect that in the scheduler code.
> >
> > See the code and comment in drm_sched_entity_flush(). We could potentially improve the comment cause it's not 100% correct, but it should work under all cases.
> >
> > > Could you describe the problem and how this fix solves it? As I don't understand
> > > how the above difference could fix a race bug.
> >
> > That was the point I wasn't sure about either. It should *not* fix any race, it's just gives a nicer SIGKILL experience.
>
> Thanks for this feedback. So as mentioned in the initial letter, I'm
> also trying to understand if this is the correct fix.
>
> This problem is unfortunately not trivial to reproduce, there is only
> one application which can reproduce this so far and it is a
> complicated one with multiple contexts and relatively lenghty jobs.
> What I observed so far is that as it is right now, a context might be
> destroyed while a job is running (e.g. by killing the application at
> the right time), and a context destroy appears to release buffers that
> are still being used by the running job which is what causes the read
> faults.
Free buffer in context destroy when job running is the bug, we should
dig into it how it happens. And if multiple context play a key role.
> This added flush made it so that the jobs always finish before the
> context destroy gets called, which prevents the issue for this
> application in my attempts. But I suppose it might also just be that
> it had some more time to finish now. If this is not the correct fix
> then there might be some more missing synchronization between running
> job and context destroy in lima?
>
I'm afraid this patch does not fix the root cause, we should find out
the root cause first. This patch could be added as SIGKILL improvement
later.
> Thanks
>
> Erico
Powered by blists - more mailing lists