linux-kernel - Re: [RFC] firmware coredump: add new firmware coredump class

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1409771156.911.23.camel@jlt4.sipsolutions.net>
Date:	Wed, 03 Sep 2014 21:05:56 +0200
From:	Johannes Berg <johannes@...solutions.net>
To:	Daniel Vetter <daniel.vetter@...ll.ch>
Cc:	Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Seth Forshee <seth.forshee@...onical.com>,
	Emmanuel Grumbach <emmanuel.grumbach@...el.com>,
	luca@...lho.fi, kvalo@...rom.com
Subject: Re: [RFC] firmware coredump: add new firmware coredump class

On Wed, 2014-09-03 at 16:19 +0200, Daniel Vetter wrote:
> [super-embarrassing resend, the previous one contained html gunk.]
> 
> If the idea is to also convert gpu crash dumps to this we should add
> dri-devel. And there the crashes are usually not due to firmware, but
> because the shaders and command batches userspace submitted have
> issues, so this should also be renamed to dev_coredump I think.

I don't know if the idea is to convert gpu crash dumps - I was just
wondering if you could and would want to use such a generic framework.
If the answer turns out to be no, that's perfectly reasonable I think.

However, renaming seems easy to do anyway :)

> On the overall design I wonder whether this shouldn't work more like a
> real core dump and dump to a real file. At least currently the dumps
> i915 creates are only useful as a general guide to where things went
> wrong, but if we actually want to submit them as traces to the
> hardware people we need to dump a _lot_ more. Otoh with the future of
> shared virtual address spaces between gpu/cpu we might just do a real
> core dump, so maybe this use case should be out of scope for your
> patch here.

I'm not really sure I'd want to actually sys_write() to a file here -
sounds like a big can of worms. If you have direct access (like shared
memory space) it seems we could still use the same mechanisms with the
coredumpm() method, no?

> On the logic itself I'm not sure whether the timeout is all that
> useful - at least in i915 our crash recovery works well enough that
> reporters often don't realize right away when it happened, but only
> later on when looking through logs to explain the tiny corruptions. If
> the crashdupm has evapored meanwhile that's not that useful.

Right. We might want to make it configurable, maybe even in Kconfig. I
was thinking that there would be userspace that would (automatically)
pick it up, and if such userspace doesn't exist or isn't running then
we'd want to free the memory eventually.

> Also, at least for gpus it's usually not interesting to grab
> subsequent dumps: Often the gpu is in a bad mood due to the first
> crash, or it's just a massive row of duplicated dumps. So in i915 we
> only record the first crash and keep it around forever. And tooling
> can still free it by writing to the file. This also ensures that we
> don't waste excessive amounts of memory with crash dumps.

Right, we discussed this but then I completely forgot. I think keeping
the first one is reasonable. If userspace has already picked it up
you'll still get multiple and maybe want to have a policy there as well.

> And if we want to use this for i915 we need some way for tools to go
> from the i915 drm class device node to the error state, not just from
> the error state back to the device.

Interesting. That's probably not all that difficult to do (maybe even
set up a child/parent relationship?) but I actually wanted to avoid a
hard dependency since there may be cases where the failing device
disappears, e.g. in the case of USB. I have to think about this case
more, I guess.

johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/