[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130218122429.239584aa@redhat.com>
Date: Mon, 18 Feb 2013 12:24:29 -0300
From: Mauro Carvalho Chehab <mchehab@...hat.com>
To: Borislav Petkov <bp@...en8.de>
Cc: linux-acpi@...r.kernel.org, Huang Ying <ying.huang@...el.com>,
Tony Luck <tony.luck@...el.com>,
Linux Edac Mailing List <linux-edac@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH EDAC 07/13] edac: add support for raw error reports
Em Mon, 18 Feb 2013 14:52:51 +0100
Borislav Petkov <bp@...en8.de> escreveu:
> On Sun, Feb 17, 2013 at 07:44:04AM -0300, Mauro Carvalho Chehab wrote:
> > We could do it for the location. The space for label, however, depends on
> > how many DIMMs are in the system, as multiple dimm's may be present, and
> > the core will point to all possible affected DIMMs.
> >
> > Ok, perhaps we could just allocate one big area for it (like one page),
> > as this would very likely be enough for it, and change the logic to take
> > the buffer size into account when filling it.
>
> Or, in the case where ->label is all dimms on the mci, you simply put
> "All DIMMs on MCI%d" in there and done. Simple.
The core does this already when it has no glue at all about where is the
error.
The core is prepared to the case where the location is only half-filled,
as this is a common scenario on the drivers, and important enough on
some memory controllers.
As already discussed, on most memory controllers nowadays, the memory
controller can't point to a single DIMM, as the error correction code
takes 128 bits (2 DIMMs). It is impossible for the error correction
code to determine on what DIMM an uncorrected error happened[1].
With Nehalem memory controllers, depending on the memory configuration,
the minimal DIMM granularity for an uncorrected error can be even worse:
4 DIMMs, if 128-bits error correction code and mirror mode are both enabled.
There are some border cases where the driver can simply not discover on
what channel or on what dimm(or csrow) inside a channel the error
happened. The error could be associated with some failure at the logic
or at the bus that communicated with the Advanced Memory Buffers on an
FB-DIMM memory controller, for example.
So, the real core's worse case scenario would be if the driver can't
determine on what DIMM inside a channel the error happened. As a channel
can have a large number of DIMMs[2] the allocated area for the label
should be conservative.
(16? Not sure what's the worse case),
[1] such error can even not be fatal, if that particular address is
unused.
[2] Currently, up to 8, according with:
$for i in $(git grep "layers.*size\s*=" drivers/edac|perl -ne 'print "$1 " if (m/\=\s*([A-Z][^\s]+);/);'); do echo $i; git grep $i drivers/edac; done|grep define|perl -ne 'print "$1 " if (m/define\s+[^\s]+\s(\d+)/)'
8 8 2 2 4 2 3 3 3 8 4 4 3 3 1 1 4
and
$ git grep "layers.*size\s*=" drivers/edac|perl -ne 'print "$1 " if (m/\=\s*(\d+);/);'
1 1 1 1 2 2 8 4 1 1 1 1
Nothing prevents that a driver would have more than 8 DIMMs per layer
in the future.
--
Cheers,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists