linux-kernel - Re: Re: [Intel-gfx] GPU hang with kernel 4.10rc3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Thu, 11 May 2017 23:08:17 +0200
From:   Pavel Machek <pavel@....cz>
To:     Juergen Gross <jgross@...e.com>
Cc:     Chris Wilson <chris@...is-wilson.co.uk>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        dri-devel@...ts.freedesktop.org,
        intel-gfx <intel-gfx@...ts.freedesktop.org>, airlied@...ux.ie,
        daniel.vetter@...el.com
Subject: Re: Re: [Intel-gfx] GPU hang with kernel 4.10rc3

On Mon 2017-01-23 10:39:27, Juergen Gross wrote:
> On 13/01/17 15:41, Juergen Gross wrote:
> > On 12/01/17 10:21, Chris Wilson wrote:
> >> On Thu, Jan 12, 2017 at 07:03:25AM +0100, Juergen Gross wrote:
> >>> On 11/01/17 18:08, Chris Wilson wrote:
> >>>> On Wed, Jan 11, 2017 at 05:33:34PM +0100, Juergen Gross wrote:
> >>>>> With kernel 4.10rc3 running as Xen dm0 I get at each boot:
> >>>>>
> >>>>> [   49.213697] [drm] GPU HANG: ecode 7:0:0x3d1d3d3d, in gnome-shell
> >>>>> [1431], reason: Hang on render ring, action: reset
> >>>>> [   49.213699] [drm] GPU hangs can indicate a bug anywhere in the entire
> >>>>> gfx stack, including userspace.
> >>>>> [   49.213700] [drm] Please file a _new_ bug report on
> >>>>> bugs.freedesktop.org against DRI -> DRM/Intel
> >>>>> [   49.213700] [drm] drm/i915 developers can then reassign to the right
> >>>>> component if it's not a kernel issue.
> >>>>> [   49.213700] [drm] The gpu crash dump is required to analyze gpu
> >>>>> hangs, so please always attach it.
> >>>>> [   49.213701] [drm] GPU crash dump saved to /sys/class/drm/card0/error
> >>>>> [   49.213755] drm/i915: Resetting chip after gpu hang
> >>>>> [   60.213769] drm/i915: Resetting chip after gpu hang
> >>>>> [   71.189737] drm/i915: Resetting chip after gpu hang
> >>>>> [   82.165747] drm/i915: Resetting chip after gpu hang
> >>>>> [   93.205727] drm/i915: Resetting chip after gpu hang
> >>>>>
> >>>>> The dump is attached.
> >>>>
> >>>> That's a nasty one. The first couple of pages of the batchbuffer appear
> >>>> to be overwritten. (Full of 0xc2c2c2c2, i.e. probably pixel data.) That
> >>>> may be a concurrent write by either the GPU or CPU, or we may have
> >>>> incorrected mapped a set of pages. That it doesn't recovered suggests
> >>>> that the corruption occurs frequently, probably on every request/batch.
> >>>
> >>> I hoped someone would have an idea already.
> >>
> >> Sorry, first report of something like this in a long time (that I can
> >> remember at least). And the problem is that it can be anything from a
> >> coherency to a concurrency issue, so no one patch springs to mind.
> >> Thankfully it appears to be kernel related.
> >> -Chris
> >>
> > 
> > Bisecting took longer than I thought, but I had to cherry pick some
> > patches and rebase one of them multiple times...
> > 
> > Finally I found the commit to blame: 920cf4194954ec ("drm/i915:
> > Introduce an internal allocator for disposable private objects")
> > 
> > In case you need me to produce some more data or test a patch
> > feel free to reach out.
> 
> Anything new for this severe regression?
> 
> Without a fix 4.10 will be unusable with Xen on a machine with i915
> graphics!

Did this get solved?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html