linux-kernel - Re: [Patch 1/4][kernel][slimdump] Add new elf-note of type NT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 5 Oct 2011 09:31:11 +0200
From:	Borislav Petkov <bp@...en8.de>
To:	"K.Prasad" <prasad@...ux.vnet.ibm.com>
Cc:	"Eric W. Biederman" <ebiederm@...ssion.com>,
	linux-kernel@...r.kernel.org, crash-utility@...hat.com,
	kexec@...ts.infradead.org, Vivek Goyal <vgoyal@...hat.com>,
	Andi Kleen <andi@...stfloor.org>,
	"Luck, Tony" <tony.luck@...el.com>, anderson@...hat.com,
	tachibana@....nes.nec.co.jp, oomichi@....nes.nec.co.jp
Subject: Re: [Patch 1/4][kernel][slimdump] Add new elf-note of type
 NT_NOCOREDUMP to capture slimdump

On Wed, Oct 05, 2011 at 12:37:28PM +0530, K.Prasad wrote:
> > Well, there are MCE types for which we need to panic but we don't
> > necessarily corrupt memory. Your approach is to unconditionally avoid
> > dumping core whenever we panic while you should look at the MCE
> > signature and decide then whether to capture crashed kernel memory or
> > not.
> > 
> > For example, if the MCE signature says UC DRAM error, then you can
> > be pretty sure that there is a landmine somewhere in the DRAM region
> > mapping the crashed kernel. If it is, say, a UC when doing data fills
> > from L2 to L1, that doesn't necessarily mean that DRAM is corrupted. But
> > even in the first case, you can evaluate the MCi_ADDR reported with the
> > UC DRAM error and simply skip that particular cacheline when dumping the
> > core instead of not capturing anything at all.
> > 
> 
> True. Like stated by me earlier, there could be two possible outcomes
> from capturing memory dump in such cases - they're either dangerous or
> doesn't make sense.

Why, in the second example the only corruption is to the L2 cache so
your memory image is intact. Why wouldn't you want to capture a memory
dump then? It is business as usual in that case.

> It is best to avoid a normal kdump in both cases,
> although the elf-note doesn't distinguish between the two.
> 
> NT_NOCOREDUMP, in my opinion, is just the first step towards introducing
> a framework where different code paths that lead to panic() can
> 'opt-out' from kdump by adding an elf-note.
> 
> We can modify this to add more fine-grained messages using different elf-note
> types (or use the elf-note name under the NT_NOCOREDUMP type) to
> indicate the cause/type of crash.
> 
> I'd like to hear further from you and the rest of the community to see if
> there's a need felt for such a change.

I'd make this conditional on whether you have had memory corruption or
not by evaluating MCE signatures and acting accordingly.

> > Btw, the doublefault example you give above - is this something you
> > experience on real hardware or just a theoretical thing?
> >
> 
> Unfortunately, I still haven't been able to try injecting memory errors
> and study the behaviour (trying to get access to machine with
> appropriate firmware). I'll have a reply to this after some experiments
> with memory error injection.

Right, this might be much more helpful than theoretical discussions on
what to do. :-)

Thanks.

-- 
Regards/Gruss,
    Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/