[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <20131016085558.19fe143a@samsung.com>
Date: Wed, 16 Oct 2013 08:55:58 -0300
From: Mauro Carvalho Chehab <m.chehab@...sung.com>
To: Borislav Petkov <bp@...en8.de>
Cc: "Naveen N. Rao" <naveen.n.rao@...ux.vnet.ibm.com>,
"Chen, Gong" <gong.chen@...ux.intel.com>, tony.luck@...el.com,
linux-kernel@...r.kernel.org, linux-acpi@...r.kernel.org,
Aristeu Rozanski Filho <arozansk@...hat.com>,
Steven Rostedt <srostedt@...hat.com>
Subject: Re: [PATCH 8/8] ACPI / trace: Add trace interface for eMCA driver
Em Wed, 16 Oct 2013 12:42:21 +0200
Borislav Petkov <bp@...en8.de> escreveu:
> On Wed, Oct 16, 2013 at 07:35:39AM -0300, Mauro Carvalho Chehab wrote:
> > Well, try to write some code on userspace to discover what's the error.
> >
> > An error threshold mechanism on userspace will only work if userspace
> > knows that the error belongs to the same DIMM.
>
> Just read the first mail again:
>
> <idle>-0 [000] d.h. 56068.488759: extlog_mem_event: 3 corrected errors:unknown on Memriser1 CHANNEL A DIMM 0(FRU: 00000000-0000
> -0000-0000-000000000000 physical addr: 0x0000000851fe0000 node: 0 card: 0 module: 0 rank: 0 bank: 0 row: 28927 column: 1296)
On that log, "physical addr: 0x0000000851fe0000 node: 0 card: 0 module: 0 rank: 0 bank: 0 row: 28927 column: 1296"
is a string, instead of an hierarchical position, like what it is provided
on EDAC.
Worse than that, not all data may be available, as CPER allows to
ommit some data.
Also, I suspect that, if an error happens to affect more than one DIMM
(e. g. part of the location is not available for a given error),
that the DIMM label will also not be properly shown.
Also, writing the userspace counterpart that would work properly is
extremely hard, if the information about the memory layout is not known
in advance. So, in practice, if the above memory error is provided, all
userspace will likely be able to do is to store it and require someone
to manually identify what's happening.
On the other hand, if node, channel and dimm number information is
properly filled (like it happens on EDAC), usersapce can rely on those
data, in order to apply per dimm, per channel and per node thresholds.
It may even use the physical address to identify if the problem is only on
a certain region of a physical DIMM and poison that region, while it is
not possible to replace the damaged component.
Regards,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists