[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20111012104429.GA11983@gere.osrc.amd.com>
Date: Wed, 12 Oct 2011 12:44:29 +0200
From: Borislav Petkov <bp@...en8.de>
To: "K.Prasad" <prasad@...ux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@...hat.com>, linux-kernel@...r.kernel.org,
crash-utility@...hat.com, kexec@...ts.infradead.org,
Andi Kleen <andi@...stfloor.org>,
"Luck, Tony" <tony.luck@...el.com>,
"Eric W. Biederman" <ebiederm@...ssion.com>, anderson@...hat.com,
tachibana@....nes.nec.co.jp, oomichi@....nes.nec.co.jp,
Valdis.Kletnieks@...edu, Nick Bowler <nbowler@...iptictech.com>
Subject: Re: [Patch 1/4][kernel][slimdump] Add new elf-note of type
NT_NOCOREDUMP to capture slimdump
On Wed, Oct 12, 2011 at 12:14:34AM +0530, K.Prasad wrote:
> The MC4_CTL_MASK doesn't appear to be defined in the kernel. Looking at
> http://support.amd.com/us/Processor_TechDocs/26094.PDF, Page 196, it
> states that "This register is typically programmed by BIOS and not by
> the Kernel software".
Oh, this is K8 BKDG, thus pretty old. For AMD docs, you could use
developer.amd.com, and more specifically
http://developer.amd.com/documentation/Pages/default.aspx
So if we look at the F10h manual:
http://support.amd.com/us/Processor_TechDocs/31116.pdf
there's this section "2.12.1.2.1 Machine Check Error Logging and
Reporting" on p. 167 which explains all the modalities around switching
MCE on/off.
And if you clear CR4.MCE, the machine would shutdown on a fatal MCE as
an additional precation when running software which doesn't support
MCE (fully) but you still don't want to corrupt your data: "If error
reporting is enabled but CR4.MCE is disabled, a reportable error will
cause the system to enter shutdown."
Thus clearing the MCi_CTL_MASK bit should help you.
> So, in any case we may not be able to disable machine-check exceptions
> (MCEs) only within the context of kexec'ed kernel. Let me know if I've
> missed something here.
I'm not sure it is advisable to completely disable MCA for the whole
duration of the image dumping, especially on a system which has already
booted into the second kernel due to an MCE.
> > But, regardless, according to Vivek, the "makedumpfile" tool should be
> > able to jump over poisoned pages and you don't need all the hoopla above
> > at all, right?
> >
>
> In short, the answer is yes. We could add a new string, say
> "CRASH_REASON=PANIC_MCE" to VMCOREINFO elf-note which can be parsed by
> 'makedumpfile' and get away without adding the new NT_NOCOREDUMP
> elf-note. Parsing through the log_buf to lookout for panic string from
> inside 'makedumpfile' appears to be a clumsy solution though.
Why, 'makedumpfile' reportedly supports some dmesg parsing already -
why would you need additional functionality when it can be done with
in-house means already. Maybe Vivek should comment on whether this makes
sense but I'm basically reiterating what he said.
> i) Scenario1: System crashes because of a fatal MCE
>
> Proposed Solution: Add a new string in the VMCOREINFO elf-note from
> within the MCE panic path to indicate cause of crash. 'makedumpfile'
> recognises this string to collect a slimdump instead of the normal dump.
see above.
> ii) Scenario2: System with PG_hwpoison (or landmine!) pages crashes because
> of a software bug. In this case, kexec kernel would normally reboot because
> of reading the PG_poison page. I'll soon get a new version of the patchset
> implementing this.
>
> Solution: Maintain a linked list of PFNs when the corresponding 'struct page'
> has been marked PG_hwpoison. We could export/put this list to use in
> quite a few ways.
Let me stop you right there: again, according to Vivek:
http://marc.info/?l=kexec&m=131805679405076&w=2
makedumpfile can iterate over the struct page arrays and skip over
PG_hwpoison pages. I think this should be enough of functionality....
--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists