[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1327764771-28649-1-git-send-email-mchehab@redhat.com>
Date: Sat, 28 Jan 2012 13:32:35 -0200
From: Mauro Carvalho Chehab <mchehab@...hat.com>
To: unlisted-recipients:; (no To-header on input)
Cc: Mauro Carvalho Chehab <mchehab@...hat.com>,
Linux Edac Mailing List <linux-edac@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
lwang@...hat.com, bp@...64.org, tony.luck@...el.com
Subject: [PATCH RFCv2 00/16] This is the version 2 of the HERM patches
This patch series is there to address some troubles with the
EDAC subsystem.
There are two groups of change in this series:
a) a trace-based class of events for hardware errors is
added (Hardware Events Report Mecanism - HERM);
The need of moving for a tracepoint-based approach were
widely discussed already at the ML. Basically, it offers
more flexibility than message dumps at the console, allowing
events filtering and other sorts of improvements.
The long-term target is that memory errors will generate
events like:
Corrected error: memory read error on DIMM_1A (row 1, channel 0, rank=5, cpu=0, Err=0001:0090, addr = 0x7a789f03e)
Uncorrected error: memory write error on DIMM_2B (row 2, channel 3, rank=4, cpu=1, Err=0001:0091, addr = 0xdeadbeef)
E. g. putting the user-relevant information first while
keeping the technical details that could help the
hardware manufacturers and the ones that might want to replace
a DRAM chip in parenthesis.
b) the edac core was changed to better support memory
controllers that aren't able to see csrows.
The EDAC subsystem were originally written to work with
memory controllers directly connected to the DIMM chips.
Not all memory architectures use this concept. For example,
FBDIMM memories are connected via a buffer, called AMB [1].
When an AMB is present, the memory controller only sees
its communication bus, called "channel". This has nothing
to do with the "csrow channel" concept, widely used at
the subsystem, and mandatory. All drivers that work with
such architectures currently need to fake data, lying to
the edac core, in order for them to work.
Lying to the subsystem in general is not a good idea ;)
So, this series addresses it by splitting the DIMM information
from the EDAC csrow_info struct, and creating a new set of
DIMM-oriented sysfs nodes:
/sys/devices/system/edac/mc/mc0
├── dimm0
│ ├── dimm_dev_type
│ ├── dimm_edac_mode
│ ├── dimm_label
│ ├── dimm_location
│ ├── dimm_mem_type
│ └── dimm_size
...
└── dimm3
├── dimm_dev_type
├── dimm_edac_mode
├── dimm_label
├── dimm_location
├── dimm_mem_type
└── dimm_size
The DIMM description looks like:
dimm_dev_type:x8
dimm_edac_mode:S8ECD8ED
dimm_label:DIMM_3A
dimm_location:branch 1 channel 0 dimm 1
dimm_mem_type:Unbuffered-DDR3
dimm_size:1024
Currently, the existing struct was not touched. The next step
(as indicated at the last patch on this series) is to
create the error counters.
Currently, is still an RFC, as it is not complete, and some
changes will require more test. Also, didn't try to compile
it yet on non x86 archs.
[1] http://www.interfacebus.com/Memory_Module_DDR2_FB_DIMM.html
Please review.
Thanks!
Mauro
-
Mauro Carvalho Chehab (16):
events/hw_event: Create a Hardware Events Report Mecanism (HERM)
events/hw_event: use __string() trace macros for events
hw_event: Consolidate uncorrected/corrected error msgs into one
drivers/edac: rename channel_info to csrow_channel_info
edac: Create a dimm struct and move the labels into it
edac_mc_sysfs: Fix error handling
edac: Add per dimm's sysfs nodes
edac: Prepare to push down to drivers the filling of the dimm_info
i5400_edac: Convert it to report memory with the new location
i7300_edac: Convert it to report memory with the new location
edac: move dimm properties to struct dimm_info
edac: Don't initialize csrow's first_page & friends when not needed
edac: move nr_pages to dimm struct
edac: Add per-dimm sysfs show nodes
edac: DIMM location cleanup
edac: Add an error scope logic
drivers/edac/amd64_edac.c | 72 +++-------
drivers/edac/amd76x_edac.c | 14 +-
drivers/edac/cell_edac.c | 18 ++-
drivers/edac/cpc925_edac.c | 70 +++++-----
drivers/edac/e752x_edac.c | 48 ++++---
drivers/edac/e7xxx_edac.c | 49 ++++---
drivers/edac/edac_mc.c | 168 ++++++++++++++++++-----
drivers/edac/edac_mc_sysfs.c | 283 ++++++++++++++++++++++++++++++++++++---
drivers/edac/i3000_edac.c | 24 ++--
drivers/edac/i3200_edac.c | 24 ++--
drivers/edac/i5000_edac.c | 31 ++---
drivers/edac/i5100_edac.c | 67 +++++-----
drivers/edac/i5400_edac.c | 46 +++----
drivers/edac/i7300_edac.c | 47 ++++---
drivers/edac/i7core_edac.c | 46 +++----
drivers/edac/i82443bxgx_edac.c | 15 ++-
drivers/edac/i82860_edac.c | 13 +-
drivers/edac/i82875p_edac.c | 22 ++-
drivers/edac/i82975x_edac.c | 28 +++--
drivers/edac/mpc85xx_edac.c | 16 ++-
drivers/edac/mv64x60_edac.c | 22 ++--
drivers/edac/pasemi_edac.c | 24 ++--
drivers/edac/ppc4xx_edac.c | 25 ++--
drivers/edac/r82600_edac.c | 13 +-
drivers/edac/sb_edac.c | 44 ++++---
drivers/edac/tile_edac.c | 17 +--
drivers/edac/x38_edac.c | 24 ++--
include/linux/edac.h | 90 +++++++++++--
include/trace/events/hw_event.h | 133 ++++++++++++++++++
29 files changed, 1018 insertions(+), 475 deletions(-)
create mode 100644 include/trace/events/hw_event.h
--
1.7.8
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists