linux-kernel - Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140805081638.GD8727@phenom.ffwll.local>
Date:	Tue, 5 Aug 2014 10:16:38 +0200
From:	Daniel Vetter <daniel@...ll.ch>
To:	Christian König <christian.koenig@....com>
Cc:	Maarten Lankhorst <maarten.lankhorst@...onical.com>,
	airlied@...ux.ie, thellstrom@...are.com,
	nouveau@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
	dri-devel@...ts.freedesktop.org, bskeggs@...hat.com,
	alexander.deucher@....com
Subject: Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

On Mon, Aug 04, 2014 at 07:04:46PM +0200, Christian König wrote:
> Am 04.08.2014 um 17:09 schrieb Maarten Lankhorst:
> >op 04-08-14 17:04, Christian König schreef:
> >>Am 04.08.2014 um 16:58 schrieb Maarten Lankhorst:
> >>>op 04-08-14 16:45, Christian König schreef:
> >>>>Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:
> >>>>>op 04-08-14 16:37, Christian König schreef:
> >>>>>>>It'a pain to deal with gpu reset.
> >>>>>>Yeah, well that's nothing new.
> >>>>>>
> >>>>>>>I've now tried other solutions but that would mean reverting to the old style during gpu lockup recovery, and only running the delayed work when !lockup.
> >>>>>>>But this meant that the timeout was useless to add. I think the cleanest is keeping the v2 patch, because potentially any waiting code can be called during lockup recovery.
> >>>>>>The lockup code itself should never call any waiting code and V2 doesn't seem to handle a couple of cases correctly either.
> >>>>>>
> >>>>>>How about moving the fence waiting out of the reset code?
> >>>>>What cases did I miss then?
> >>>>>
> >>>>>I'm curious how you want to move the fence waiting out of reset, when there are so many places that could potentially wait, like radeon_ib_get can call radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on radeon_fence_wait_next, etc.
> >>>>The IB test itself doesn't needs to be protected by the exclusive lock. Only everything between radeon_save_bios_scratch_regs and radeon_ring_restore.
> >>>I'm not sure about that, what do you want to do if the ring tests fail? Do you have to retake the exclusive lock?
> >>Just set need_reset again and return -EAGAIN, that should have mostly the same effect as what we are doing right now.
> >Yeah, except for the locking the ttm delayed workqueue, but that bool should be easy to save/restore.
> >I think this could work.
> 
> Actually you could activate the delayed workqueue much earlier as well.
> 
> Thinking more about it that sounds like a bug in the current code, because
> we probably want the workqueue activated before waiting for the fence.

We've actually had a similar issue on i915 where when userspace never
waited for rendering (some shitty userspace drivers did that way back) we
never noticed that the gpu died. So launching the hangcheck/stuck wait
worker (we have both too) right away is what we do now.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/