[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c9ae8b26-e254-47a7-8e2f-b5da90f50030@amd.com>
Date: Wed, 6 Aug 2025 16:08:07 -0500
From: "Naik, Avadhut" <avadnaik@....com>
To: Yazen Ghannam <yazen.ghannam@....com>, Avadhut Naik <avadhut.naik@....com>
Cc: linux-edac@...r.kernel.org, bp@...en8.de, john.allen@....com,
linux-kernel@...r.kernel.org
Subject: [PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message
Hi,
On 7/28/2025 10:14, Yazen Ghannam wrote:
> On Thu, Jul 17, 2025 at 04:48:43PM +0000, Avadhut Naik wrote:
>> Currently, the amd64_edac module only provides UMC normalized and system
>
> The amd64_edac module provides data for the EDAC interface. This is only
> the system physical address (PFN + offset). The UMC normalized address
> comes from MCA error decoding.
>
Will reword this part.
>> physical address when a DRAM ECC error occurs. DRAM Address on which the
>> error has occurred is not provided since the support required to translate
>> the normalized address into DRAM address has not yet been implemented.
>
> I don't think this last sentence is necessary.
>
Noted.
>>
>> This support however, has now been implemented through an earlier patch
>> (RAS/AMD/ATL: Translate UMC normalized address to DRAM address using PRM)
>> and DRAM address, which provides additional debugging information relating
>> to the error received, can now be logged by the module.
>>
>> Add the required support to log DRAM address on which the error has been
>> received in dmesg and through the RAS tracepoint.
>
> These last two paragraphs could be something like this:
>
> "Use the new PRM call in the AMD Address Translation Library to gather the
> DRAM address of an error. Include this data in the EDAC 'string' so it
> is available in the kernel messages and EDAC trace event."
>
Okay. Will change them.
>>
>> Signed-off-by: Avadhut Naik <avadhut.naik@....com>
>> ---
>> drivers/edac/amd64_edac.c | 23 +++++++++++++++++++++++
>> drivers/edac/amd64_edac.h | 1 +
>> 2 files changed, 24 insertions(+)
>>
>> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
>> index 07f1e9dc1ca7..36b46cd81bb2 100644
>> --- a/drivers/edac/amd64_edac.c
>> +++ b/drivers/edac/amd64_edac.c
>> @@ -2724,6 +2724,22 @@ static void __log_ecc_error(struct mem_ctl_info *mci, struct err_info *err,
>> switch (err->err_code) {
>> case DECODE_OK:
>> string = "";
>> + if (err->dram_addr) {
>> + char s[100];
>> +
>> + memset(s, 0, 100);
>> + sprintf(s, "Cs: 0x%x Bank Grp: 0x%x Bank Addr: 0x%x"
>> + " Row: 0x%x Column: 0x%x"
>> + " RankMul: 0x%x SubChannel: 0x%x",
>
> There's a checkpatch warning here about splitting user-visible strings.
>
> Why not use scnprintf() or the like?
>
Had noticed the checkpatch warning initially.
IIRC, it was for splitting the quoted string across multiple lines.
Can use scnprintf here. But wont the warning still prevail?
One way I can think of for getting rid of the warning is to generate the string
through multiple scnprintf calls. Something like below:
memset(s, 0, len);
n = scnprintf(s + n, len - n, "Cs: 0x%x Bank Grp: 0x%x",
err->dram_addr->chip_select,
err->dram_addr->bank_group);
n += scnprintf(s + n, len - n, " Bank Addr: 0x%x",
err->dram_addr->bank_addr);
n += scnprintf(s + n, len - n, " Row: 0x%x Column: 0x%x",
err->dram_addr->row_addr,
err->dram_addr->col_addr);
n += scnprintf(s + n, len - n, " RankMul: 0x%x SubChannel: 0x%x",
err->dram_addr->rank_mul,
err->dram_addr->sub_ch);
pr_err("%s: s: %s\n", __func__, s);
string = s;
Is this acceptable?
>> + err->dram_addr->chip_select,
>> + err->dram_addr->bank_group,
>> + err->dram_addr->bank_addr,
>> + err->dram_addr->row_addr,
>> + err->dram_addr->col_addr,
>> + err->dram_addr->rank_mul,
>> + err->dram_addr->sub_ch);
>> + string = s;
>
> Looking at the description for 'edac_mc_handle_error()', it seems that
> "other_detail" would be more appropriate for this info.
>
> I do think we should consider updating the EDAC interface if multiple
> vendors are gathering this data.
>
Okay, will use "other_detail" parameter of edac_mc_handle_error() for this.
>> + }
>> break;
>> case ERR_NODE:
>> string = "Failed to map error addr to a node";
>> @@ -2808,11 +2824,13 @@ static void umc_get_err_info(struct mce *m, struct err_info *err)
>> static void decode_umc_error(int node_id, struct mce *m)
>> {
>> u8 ecc_type = (m->status >> 45) & 0x3;
>> + struct dram_addr dram_addr;
>> struct mem_ctl_info *mci;
>> unsigned long sys_addr;
>> struct amd64_pvt *pvt;
>> struct atl_err a_err;
>> struct err_info err;
>> + int ret;
>>
>> node_id = fixup_node_id(node_id, m);
>>
>> @@ -2822,6 +2840,7 @@ static void decode_umc_error(int node_id, struct mce *m)
>>
>> pvt = mci->pvt_info;
>>
>> + memset(&dram_addr, 0, sizeof(dram_addr));
>> memset(&err, 0, sizeof(err));
>>
>> if (m->status & MCI_STATUS_DEFERRED)
>> @@ -2853,6 +2872,10 @@ static void decode_umc_error(int node_id, struct mce *m)
>> goto log_error;
>> }
>>
>> + ret = amd_convert_umc_mca_addr_to_dram_addr(&a_err, &dram_addr);
>> + if (!ret)
>> + err.dram_addr = &dram_addr;
>
> I feel like it is not necessary to pass a second struct if it is already
> contained in another.
>
> You could just clear (or not set) that field if an error occurs.
>
Slightly confused here.
Do you mean we should avoid passing dram_addr as second parameter
for amd_convert_umc_mca_addr_to_dram_addr() and instead just pass
struct err_info instance err?
And, in case some error occurs, we should just do
err.dram_addr = 0x0;
?
>> +
>> error_address_to_page_and_offset(sys_addr, &err);
>>
>> log_error:
>> diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
>> index 17228d07de4c..88b0b8425ab3 100644
>> --- a/drivers/edac/amd64_edac.h
>> +++ b/drivers/edac/amd64_edac.h
>> @@ -399,6 +399,7 @@ struct err_info {
>> u16 syndrome;
>> u32 page;
>> u32 offset;
>> + struct dram_addr *dram_addr;
>> };
>>
>> static inline u32 get_umc_base(u8 channel)
>> --
>
> Thanks,
> Yazen
--
Thanks,
Avadhut Naik
Powered by blists - more mailing lists