linux-kernel - Re: [Patch 1/4][kernel][slimdump] Add new elf-note of type NT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20111005124113.GB509@gere.osrc.amd.com>
Date:	Wed, 5 Oct 2011 14:41:14 +0200
From:	Borislav Petkov <bp@...en8.de>
To:	"K.Prasad" <prasad@...ux.vnet.ibm.com>
Cc:	"Eric W. Biederman" <ebiederm@...ssion.com>,
	linux-kernel@...r.kernel.org, crash-utility@...hat.com,
	kexec@...ts.infradead.org, Vivek Goyal <vgoyal@...hat.com>,
	Andi Kleen <andi@...stfloor.org>,
	"Luck, Tony" <tony.luck@...el.com>, anderson@...hat.com,
	tachibana@....nes.nec.co.jp, oomichi@....nes.nec.co.jp
Subject: Re: [Patch 1/4][kernel][slimdump] Add new elf-note of type
 NT_NOCOREDUMP to capture slimdump

On Wed, Oct 05, 2011 at 03:17:27PM +0530, K.Prasad wrote:
> We don't want to capture memory dump when the machine crashes due to
> faulty cache, because the end-user derives no benefit by receiving a
> bulky vmcore and running crash analysis tools over them. Instead a
> 'slimdump' that contains a meaningful message about the origin of crash
> (and which can be understood by his analysis tools) would be better, or
> so I thought.

Ok, this makes sense, a meaningful message along with the MCE decoded
properly in userfriendly language so that one can understand why the
system has not captured vmcore.

> There are possibly several hardware errors which cause system crash and
> the kdump would capture full vmcore, although it doesn't make sense (I
> wouldn't have cared about the second example, you cited, if they did not
> generate MCE, but a different exception). In an ideal situation, each of
> these error paths would 'subscribe' to slimdump and add a meaningful
> message in the NT_NOCOREDUMP note instead of letting the user-space copy
> the old kernel memory.

Yep, I see.

> Fine with me. I see that the various IA32_MCi_Status registers will hold
> information about the error and use that to classify MCEs.
> 
> I think the best way to go about is to retain NT_NOCOREDUMP for non-DRAM
> errors also, but use the note-name field in the elf-note and distinguish the
> various types of errors...say, by using names such as "PANIC_MCE_DRAM",
> "PANIC_MCE_CACHE", etc (similar to the error codes described in the Intel
> manual). The upstream tools like 'makedumpfile' and 'crash' will have to
> be taught to parse the elf-note name and act accordingly.

Right, so Valdis had the right question in the other mail, let me
generalize it here: does it ever make sense to save vmcore on a hardware
error?

With DRAM errors, you probably could use the additional info coming with
the MCE do decode to the physical address and map back to the DIMM and
swap it. Any other use cases?

Thanks.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/