lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK4VdL38fkraY2sujHCVnuhLJ8pQ4f-MejUcfsFTbdxusys3Kg@mail.gmail.com>
Date: Mon, 14 Apr 2025 15:19:21 +0200
From: Erico Nunes <nunes.erico@...il.com>
To: Christian König <christian.koenig@....com>
Cc: Qiang Yu <yuq825@...il.com>, Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>, 
	Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>, 
	David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>, dri-devel@...ts.freedesktop.org, 
	lima@...ts.freedesktop.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/1] drm/lima: implement the file flush callback

On Thu, Apr 10, 2025 at 4:04 PM Christian König
<christian.koenig@....com> wrote:
>
> Am 10.04.25 um 15:56 schrieb Qiang Yu:
> >>>> This prevents applications with multiple contexts from running into a
> >>>> race condition between running tasks and context destroy when
> >>>> terminating.
> >>>>
> > If implementing flush callback fix the problem, it must not be when SIGKILL.
>
> SIGKILL also calls flush (again eventually), but we can detect that in the scheduler code.
>
> See the code and comment in drm_sched_entity_flush(). We could potentially improve the comment cause it's not 100% correct, but it should work under all cases.
>
> > Could you describe the problem and how this fix solves it? As I don't understand
> > how the above difference could fix a race bug.
>
> That was the point I wasn't sure about either. It should *not* fix any race, it's just gives a nicer SIGKILL experience.

Thanks for this feedback. So as mentioned in the initial letter, I'm
also trying to understand if this is the correct fix.

This problem is unfortunately not trivial to reproduce, there is only
one application which can reproduce this so far and it is a
complicated one with multiple contexts and relatively lenghty jobs.
What I observed so far is that as it is right now, a context might be
destroyed while a job is running (e.g. by killing the application at
the right time), and a context destroy appears to release buffers that
are still being used by the running job which is what causes the read
faults.
This added flush made it so that the jobs always finish before the
context destroy gets called, which prevents the issue for this
application in my attempts. But I suppose it might also just be that
it had some more time to finish now. If this is not the correct fix
then there might be some more missing synchronization between running
job and context destroy in lima?

Thanks

Erico

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ