[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1409771156.911.23.camel@jlt4.sipsolutions.net>
Date: Wed, 03 Sep 2014 21:05:56 +0200
From: Johannes Berg <johannes@...solutions.net>
To: Daniel Vetter <daniel.vetter@...ll.ch>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Seth Forshee <seth.forshee@...onical.com>,
Emmanuel Grumbach <emmanuel.grumbach@...el.com>,
luca@...lho.fi, kvalo@...rom.com
Subject: Re: [RFC] firmware coredump: add new firmware coredump class
On Wed, 2014-09-03 at 16:19 +0200, Daniel Vetter wrote:
> [super-embarrassing resend, the previous one contained html gunk.]
>
> If the idea is to also convert gpu crash dumps to this we should add
> dri-devel. And there the crashes are usually not due to firmware, but
> because the shaders and command batches userspace submitted have
> issues, so this should also be renamed to dev_coredump I think.
I don't know if the idea is to convert gpu crash dumps - I was just
wondering if you could and would want to use such a generic framework.
If the answer turns out to be no, that's perfectly reasonable I think.
However, renaming seems easy to do anyway :)
> On the overall design I wonder whether this shouldn't work more like a
> real core dump and dump to a real file. At least currently the dumps
> i915 creates are only useful as a general guide to where things went
> wrong, but if we actually want to submit them as traces to the
> hardware people we need to dump a _lot_ more. Otoh with the future of
> shared virtual address spaces between gpu/cpu we might just do a real
> core dump, so maybe this use case should be out of scope for your
> patch here.
I'm not really sure I'd want to actually sys_write() to a file here -
sounds like a big can of worms. If you have direct access (like shared
memory space) it seems we could still use the same mechanisms with the
coredumpm() method, no?
> On the logic itself I'm not sure whether the timeout is all that
> useful - at least in i915 our crash recovery works well enough that
> reporters often don't realize right away when it happened, but only
> later on when looking through logs to explain the tiny corruptions. If
> the crashdupm has evapored meanwhile that's not that useful.
Right. We might want to make it configurable, maybe even in Kconfig. I
was thinking that there would be userspace that would (automatically)
pick it up, and if such userspace doesn't exist or isn't running then
we'd want to free the memory eventually.
> Also, at least for gpus it's usually not interesting to grab
> subsequent dumps: Often the gpu is in a bad mood due to the first
> crash, or it's just a massive row of duplicated dumps. So in i915 we
> only record the first crash and keep it around forever. And tooling
> can still free it by writing to the file. This also ensures that we
> don't waste excessive amounts of memory with crash dumps.
Right, we discussed this but then I completely forgot. I think keeping
the first one is reasonable. If userspace has already picked it up
you'll still get multiple and maybe want to have a policy there as well.
> And if we want to use this for i915 we need some way for tools to go
> from the i915 drm class device node to the error state, not just from
> the error state back to the device.
Interesting. That's probably not all that difficult to do (maybe even
set up a child/parent relationship?) but I actually wanted to avoid a
hard dependency since there may be cases where the failing device
disappears, e.g. in the case of USB. I have to think about this case
more, I guess.
johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists