[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <20130814210532.18fd280b@concha.lan>
Date: Wed, 14 Aug 2013 21:05:32 -0300
From: Mauro Carvalho Chehab <m.chehab@...sung.com>
To: Borislav Petkov <bp@...en8.de>
Cc: "Luck, Tony" <tony.luck@...el.com>,
"Naveen N. Rao" <naveen.n.rao@...ux.vnet.ibm.com>,
"bhelgaas@...gle.com" <bhelgaas@...gle.com>,
"rostedt@...dmis.org" <rostedt@...dmis.org>,
"rjw@...k.pl" <rjw@...k.pl>,
"lance.ortiz@...com" <lance.ortiz@...com>,
"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 3/3] mce: acpi/apei: trace: Enable ghes memory error trace
event
Em Wed, 14 Aug 2013 07:43:22 +0200
Borislav Petkov <bp@...en8.de> escreveu:
> On Tue, Aug 13, 2013 at 08:13:56PM +0000, Luck, Tony wrote:
> > Generic tracepoints are architected to be able to fire at very high
> > rates and log huge amounts of information. So we'd need something
> > special to say just log these special tracepoints to network/serial.
> >
> > > Which reminds me, pstore could also be a good thing to use, in addition.
> > > Only put error info there as it is limited anyway.
> >
> > Yes - space is very limited. I don't know how to assign priority for logging
> > the dmesg data vs. some error logs.
>
> Didn't we say at some point, "log only the panic messsage which kills
> the machine"?
EDAC core allows those kind of things, and even panic when errors arrive:
$ modinfo edac_core
filename: /lib/modules/3.10.5-201.fc19.x86_64/kernel/drivers/edac/edac_core.ko
...
parm: edac_pci_panic_on_pe:Panic on PCI Bus Parity error: 0=off 1=on (int)
parm: edac_mc_panic_on_ue:Panic on uncorrected error: 0=off 1=on (int)
parm: edac_mc_log_ue:Log uncorrectable error to console: 0=off 1=on (int)
parm: edac_mc_log_ce:Log correctable error to console: 0=off 1=on (int)
Those have 644 permission, so they can be changed at runtime.
Of course, there are space for improvements.
> However, we probably could use more the messages before that
> catastrophic event because they could give us hints about what lead to
> the panic but in that case maybe a limited pstore is the wrong logging
> medium.
>
> Actually, I can imagine the full serial/network logs of "special"
> tracepoints + dmesg to be the optimal thing.
>
> > If we just "printk()" the most important parts - then that data will
> > automatically flow to the serial console and to pstore.
>
> Actually, does the pstore act like a circular buffer? Because if it
> contains the last N relevant messages (for an arbitrary definition of
> relevant) before the system dies, then that could more helpful than only
> the error messages.
>
> And with the advent of UEFI, pretty much every system has a pstore. Too
> bad that we have to limit it to 50% of size so that the boxes don't
> brick. :-P
>
> > Then we have multiple paths for the critical bits of the error log
> > - and the tracepoints give us more details for the cases where the
> > machine doesn't spontaneously explode.
>
> Ok, let's sort:
>
> * First we have the not-so-critical hw error messages. We want to carry
> those out-of-band, i.e. not in dmesg so that people don't have to parse
> and collect dmesg but have a specialized solution which gives them
> structured logs and tools can analyze, collect and ... those errors.
>
> * When a critical error happens, the above usage is not necessarily
> advantageous anymore in the sense that, in order to debug what caused
> the machine to crash, we don't simply necessarily want only the crash
> message but also the whole system activity that lead to it.
>
> In which case, we probably actually want to turn off/ignore the error
> logging tracepoints and write *only* to dmesg which goes out over serial
> and to pstore. Right?
>
> Because in such cases I want to have *all* *relevant* messages that lead
> to the explosion + the explosion message itself.
>
> Makes sense? Yes, no? Aspects I've missed?
Makes sense to me.
>
> Thanks.
>
--
Cheers,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists