linux-kernel - [ 57/79] drm/i915: Fix spurious -EIO/SIGBUS on wedged gpus

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20130611195323.607655857@linuxfoundation.org>
Date:	Tue, 11 Jun 2013 13:03:23 -0700
From:	Greg Kroah-Hartman <gregkh@...uxfoundation.org>
To:	linux-kernel@...r.kernel.org
Cc:	Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	stable@...r.kernel.org, Chris Wilson <chris@...is-wilson.co.uk>,
	Daniel Vetter <daniel.vetter@...ll.ch>,
	Damien Lespiau <damien.lespiau@...el.com>
Subject: [ 57/79] drm/i915: Fix spurious -EIO/SIGBUS on wedged gpus

3.9-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Daniel Vetter <daniel.vetter@...ll.ch>

commit 7abb690a0e095717420ba78dcab4309abbbec78a upstream.

Chris Wilson noticed that since

commit 1f83fee08d625f8d0130f9fe5ef7b17c2e022f3c [v3.9]
Author: Daniel Vetter <daniel.vetter@...ll.ch>
Date:   Thu Nov 15 17:17:22 2012 +0100

    drm/i915: clear up wedged transitions

X can again get -EIO when it does not expect it. And even worse score
a SIGBUS when accessing gtt mmaps. The established ABI is that we
_only_ return an -EIO from execbuf - all other ioctls should just
work. And since the reset code moves all bos out of gpu domains and
clears out all the last_seqno/ring tracking there really shouldn't be
any reason for non-execbuf code to ever touch the hw and see an -EIO.

After some extensive discussions we've noticed that these spurios -EIO
are caused by i915_gem_wait_for_error:

http://www.mail-archive.com/intel-gfx@lists.freedesktop.org/msg20540.html

That is easy to fix by returning 0 instead of -EIO, since grabbing the
dev->struct_mutex does not yet mean that we actually want to touch the
hw. And so there is no reason at all to fail with -EIO.

But that's not the entire since, since often (at least it's easily
googleable) dmesg indicates that the reset fails and we declare the
gpu wedged. Then, quite a bit later X wakes up with the "Timed out
waiting for the gpu reset to complete" DRM_ERROR message in
wait_for_errror and brings down the desktop with an -EIO/SIGBUS.

So clearly we're missing a wakeup somewhere, since the gpu reset just
doesn't take 10 seconds to complete. And indeed we're do handle the
terminally wedged state wrong.

Fix this all up.

Reviewed-by: Chris Wilson <chris@...is-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@...ll.ch>
Cc: Damien Lespiau <damien.lespiau@...el.com>
Signed-off-by: Daniel Vetter <daniel.vetter@...ll.ch>
Signed-off-by: Greg Kroah-Hartman <gregkh@...uxfoundation.org>

---
 drivers/gpu/drm/i915/i915_gem.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -91,14 +91,11 @@ i915_gem_wait_for_error(struct i915_gpu_
 {
 	int ret;

-#define EXIT_COND (!i915_reset_in_progress(error))
+#define EXIT_COND (!i915_reset_in_progress(error) || \
+		   i915_terminally_wedged(error))
 	if (EXIT_COND)
 		return 0;

-	/* GPU is already declared terminally dead, give up. */
-	if (i915_terminally_wedged(error))
-		return -EIO;
-
 	/*
 	 * Only wait 10 seconds for the gpu reset to complete to avoid hanging
 	 * userspace. If it takes that long something really bad is going on and

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/