[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4F9692AD.8090000@redhat.com>
Date: Tue, 24 Apr 2012 08:46:53 -0300
From: Mauro Carvalho Chehab <mchehab@...hat.com>
To: Borislav Petkov <bp@...64.org>
CC: Linux Edac Mailing List <linux-edac@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Doug Thompson <norsk5@...oo.com>
Subject: Re: [EDAC PATCH v13 6/7] edac.h: Prepare to handle with generic layers
Em 24-04-2012 07:40, Borislav Petkov escreveu:
> On Mon, Apr 23, 2012 at 06:30:54PM +0000, Mauro Carvalho Chehab wrote:
>>>> +};
>>>> +
>>>> +/**
>>>> + * struct edac_mc_layer - describes the memory controller hierarchy
>>>> + * @layer: layer type
>>>> + * @size:maximum size of the layer
>>>> + * @is_csrow: This layer is part of the "csrow" when old API
>>>> + * compatibility mode is enabled. Otherwise, it is
>>>> + * a channel
>>>> + */
>>>> +struct edac_mc_layer {
>>>> + enum edac_mc_layer_type type;
>>>> + unsigned size;
>>>> + bool is_csrow;
>>>> +};
>>>
>>> Huh, why do you need is_csrow? Can't do
>>>
>>> type = EDAC_MC_LAYER_CHIP_SELECT;
>>>
>>> ?
>>
>> No, that's different. For a csrow-based memory controller, is_csrow is equal to
>> type == EDAC_MC_LAYER_CHIP_SELECT, but, for the other memory controllers, this
>> is used to mark with layers will be used for the "fake csrow" exported by the
>> EDAC core by the legacy API.
>
> I don't understand this, do you mean: "this will be used to mark which
> layer will be used to fake a csrow"...?
I've already explained this dozens of times: on x86, except for amd64_edac and
the drivers for legacy hardware (+7 years old), the information filled at struct
csrow_info is FAKE. That's basically one of the main reasons for this patchset.
There's no csrow signals accessed by the memory controller on FB-DIMM/RAMBUS, and on DDR3
Intel memory controllers, it is possible to fill memories on different channels with
different sizes. For example, this is how the 4 DIMM banks are filled on an HP Z400
with a Intel W3505 CPU:
$ ./edac-ctl --layout
+-----------------------------------+
| mc0 |
| channel0 | channel1 | channel2 |
-------+-----------------------------------+
slot2: | 0 MB | 0 MB | 0 MB |
slot1: | 1024 MB | 0 MB | 0 MB |
slot0: | 1024 MB | 1024 MB | 1024 MB |
-------+-----------------------------------+
Those are the logs that dump the Memory Controller registers:
[ 115.818947] EDAC DEBUG: get_dimm_config: Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs
[ 115.818950] EDAC DEBUG: get_dimm_config: dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
[ 115.818955] EDAC DEBUG: get_dimm_config: dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400
[ 115.818982] EDAC DEBUG: get_dimm_config: Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs
[ 115.818985] EDAC DEBUG: get_dimm_config: dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
[ 115.819012] EDAC DEBUG: get_dimm_config: Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs
[ 115.819016] EDAC DEBUG: get_dimm_config: dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
The Nehalem memory controllers allow up to 3 DIMMs per channel, and has 3 channels (so,
a total of 9 DIMMs). Most motherboards, however, expose either 4 or 8 DIMMs per CPU,
so it isn't possible to have all channels and dimms filled on them.
On this motherboard, DIMM1 to DIMM3 are mapped to the the first dimm# at channels 0 to 2, and
DIMM4 goes to the second dimm# at channel 0.
See? On slot 1, only channel 0 is filled.
Even if this memory controller would be rank-based[1], the channel information
can't be mapped using the legacy EDAC API, as, on the old API, all channels need to be
filled with memories with the same size. So, this driver uses both the slot layer and
the channel layer as the fake csrow.
[1] As you can see from the logs and from the source code, the MC registers aren't per rank,
they are per DIMM. The number of ranks is just one attribute of the register that describes
a DIMM. The MCA Error registers, however, don't map the rank when reporting an errors,
nor the error counters are per rank. So, while it is possible to enumerate information
per rank, the error detection is always per DIMM.
the "is_csrow" property (or "is_virt_csrow" after the renaming patch) is there to indicate
what memory layer will compose the "csrow" when using the legacy API.
> [..]
>
>> With regards to the changes at edac_mc_sysfs, it will likely affect all per-dimm
>> routines, plus the counters reset logic. The problem of pointing to a set of
>> routines that need changes is that this list can/will change with time.
>>
>> So, the intention behind this note is not to give an exhaustive list of what should
>> be changed, if EDAC_MAX_LAYERS is incremented. Instead, it is meant to give a
>> clue that incrementing the number of layers is not as easy as just changing
>> it: it would require to change the number of layers also at the code.
>
> Then write that instead of adding a clueless note which only confuses readers.
>
>>
>>>
>>>> + */
>>>> +#define EDAC_MAX_LAYERS 3
>>>> +
>>>> +/*
>>>> + * A loop could be used here to make it more generic, but, as we only have
>>>> + * 3 layers, this is a little faster. By design, layers can never be 0 or
>>>> + * more than 3. If that ever happens, a NULL is returned, causing an OOPS
>>>> + * during the memory allocation routine, with would point to the developer
>>>> + * that he's doing something wrong.
>>>> + */
>>>> +#define GET_POS(layers, var, nlayers, lay0, lay1, lay2) ({ \
>>>
>>> This is returning size per layers so it cannot be GET_POS(), AFAICT.
>>> EDAC_GET_SIZE or similar maybe?
>>
>> This is not returning the size, per layers. It is returning a pointer to the
>> structure that holds the dimm.
>>
>>>
>>>> + typeof(var) __p; \
>>>> + if ((nlayers) == 1) \
>>>> + __p = &var[lay0]; \
>>>> + else if ((nlayers) == 2) \
>>>> + __p = &var[(lay1) + ((layers[1]).size * (lay0))]; \
>>>> + else if ((nlayers) == 3) \
>>>> + __p = &var[(lay2) + ((layers[2]).size * ((lay1) + \
>>>> + ((layers[1]).size * (lay0))))]; \
>>>> + else \
>>>> + __p = NULL; \
>>>> + __p; \
>>>> +})
>
> Ok, I'm looking at your next patch trying to understand this thing:
(That's why this were merged with the big patch. The meaning of this
patch and the next one is explained by the changes that happened at the
drivers. Those two patches, plus the 26 patches are part of a single logical
change)
>
> + /*
> + * Fills the dimm struct
> + */
> + memset(&pos, 0, sizeof(pos));
> + row = 0;
> + chn = 0;
> + debugf4("%s: initializing %d dimms\n", __func__, tot_dimms);
> + for (i = 0; i < tot_dimms; i++) {
> + chan = &csi[row].channels[chn];
> + dimm = GET_POS(lay, mci->dimms, n_layers,
> + pos[0], pos[1], pos[2]);
>
> pos is an unsigned[3] array with all its elements set to 0 in the memset
> above. Which means I need a run-variable like that all the time whenever
> I iterate over the layers.
Yes.
>
> Now, say nlayers == 3, then your macro does this:
>
> __p = &var[(lay2) + ((layers[2]).size * ((lay1) + ((layers[1]).size * (lay0))))];
>
> So I'm multiplying a loop variable with layers[i].size which is the
> maximum size of the layer. What does that mean, where is this size
> initialized?
The layers array is initialized by the drivers. For example, at amd664_edac (see
patch 01/26):
@@ -2520,7 +2561,13 @@ static int amd64_init_one_instance(struct pci_dev *F2)
goto err_siblings;
ret = -ENOMEM;
- mci = edac_mc_alloc(0, pvt->csels[0].b_cnt, pvt->channel_count, nid);
+ layers[0].type = EDAC_MC_LAYER_CHIP_SELECT;
+ layers[0].size = pvt->csels[0].b_cnt;
+ layers[0].is_csrow = true;
+ layers[1].type = EDAC_MC_LAYER_CHANNEL;
+ layers[1].size = pvt->channel_count;
+ layers[1].is_csrow = false;
+ mci = new_edac_mc_alloc(nid, ARRAY_SIZE(layers), layers, false, 0);
if (!mci)
goto err_siblings;
>
> I can imagine that I'll get an element in the mci->dimms array in the
> end
Yes. In the case of a 2 layer MC, this would be similar to create a
bi-dimensional array and use:
dimm = mci->dimm[x][y]
> but this is very confusing.
>
> So please explain what each argument of this macro exactly means.
>
I'll.
Regards,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists