linux-kernel - [PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c9ae8b26-e254-47a7-8e2f-b5da90f50030@amd.com>
Date: Wed, 6 Aug 2025 16:08:07 -0500
From: "Naik, Avadhut" <avadnaik@....com>
To: Yazen Ghannam <yazen.ghannam@....com>, Avadhut Naik <avadhut.naik@....com>
Cc: linux-edac@...r.kernel.org, bp@...en8.de, john.allen@....com,
 linux-kernel@...r.kernel.org
Subject: [PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message

Hi,

On 7/28/2025 10:14, Yazen Ghannam wrote:
> On Thu, Jul 17, 2025 at 04:48:43PM +0000, Avadhut Naik wrote:
>> Currently, the amd64_edac module only provides UMC normalized and system
> 
> The amd64_edac module provides data for the EDAC interface. This is only
> the system physical address (PFN + offset). The UMC normalized address
> comes from MCA error decoding.
> 
Will reword this part.
>> physical address when a DRAM ECC error occurs. DRAM Address on which the
>> error has occurred is not provided since the support required to translate
>> the normalized address into DRAM address has not yet been implemented.
> 
> I don't think this last sentence is necessary.
> 
Noted.
>>
>> This support however, has now been implemented through an earlier patch
>> (RAS/AMD/ATL: Translate UMC normalized address to DRAM address using PRM)
>> and DRAM address, which provides additional debugging information relating
>> to the error received, can now be logged by the module.
>>
>> Add the required support to log DRAM address on which the error has been
>> received in dmesg and through the RAS tracepoint.
> 
> These last two paragraphs could be something like this:
> 
> "Use the new PRM call in the AMD Address Translation Library to gather the
> DRAM address of an error. Include this data in the EDAC 'string' so it
> is available in the kernel messages and EDAC trace event." 
> 
Okay. Will change them.
>>
>> Signed-off-by: Avadhut Naik <avadhut.naik@....com>
>> ---
>>  drivers/edac/amd64_edac.c | 23 +++++++++++++++++++++++
>>  drivers/edac/amd64_edac.h |  1 +
>>  2 files changed, 24 insertions(+)
>>
>> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
>> index 07f1e9dc1ca7..36b46cd81bb2 100644
>> --- a/drivers/edac/amd64_edac.c
>> +++ b/drivers/edac/amd64_edac.c
>> @@ -2724,6 +2724,22 @@ static void __log_ecc_error(struct mem_ctl_info *mci, struct err_info *err,
>>  	switch (err->err_code) {
>>  	case DECODE_OK:
>>  		string = "";
>> +		if (err->dram_addr) {
>> +			char s[100];
>> +
>> +			memset(s, 0, 100);
>> +			sprintf(s, "Cs: 0x%x Bank Grp: 0x%x Bank Addr: 0x%x"
>> +					   " Row: 0x%x Column: 0x%x"
>> +					   " RankMul: 0x%x SubChannel: 0x%x",
> 
> There's a checkpatch warning here about splitting user-visible strings.
> 
> Why not use scnprintf() or the like?
> 
Had noticed the checkpatch warning initially.
IIRC, it was for splitting the quoted string across multiple lines.
Can use scnprintf here. But wont the warning still prevail?
One way I can think of for getting rid of the warning is to generate the string
through multiple scnprintf calls. Something like below:

            memset(s, 0, len);
            n = scnprintf(s + n, len - n, "Cs: 0x%x Bank Grp: 0x%x",
                      err->dram_addr->chip_select,
                      err->dram_addr->bank_group);
            n += scnprintf(s + n, len - n, " Bank Addr: 0x%x",
                      err->dram_addr->bank_addr);
            n += scnprintf(s + n, len - n, " Row: 0x%x Column: 0x%x",
                      err->dram_addr->row_addr,
                      err->dram_addr->col_addr);
            n += scnprintf(s + n, len - n, " RankMul: 0x%x SubChannel: 0x%x",
                      err->dram_addr->rank_mul,
                      err->dram_addr->sub_ch);

            pr_err("%s: s: %s\n", __func__, s);
            string = s;

Is this acceptable?

>> +					   err->dram_addr->chip_select,
>> +					   err->dram_addr->bank_group,
>> +					   err->dram_addr->bank_addr,
>> +					   err->dram_addr->row_addr,
>> +					   err->dram_addr->col_addr,
>> +					   err->dram_addr->rank_mul,
>> +					   err->dram_addr->sub_ch);
>> +			string = s;
> 
> Looking at the description for 'edac_mc_handle_error()', it seems that
> "other_detail" would be more appropriate for this info.
> 
> I do think we should consider updating the EDAC interface if multiple
> vendors are gathering this data.
> 
Okay, will use "other_detail" parameter of edac_mc_handle_error() for this.

>> +		}
>>  		break;
>>  	case ERR_NODE:
>>  		string = "Failed to map error addr to a node";
>> @@ -2808,11 +2824,13 @@ static void umc_get_err_info(struct mce *m, struct err_info *err)
>>  static void decode_umc_error(int node_id, struct mce *m)
>>  {
>>  	u8 ecc_type = (m->status >> 45) & 0x3;
>> +	struct dram_addr dram_addr;
>>  	struct mem_ctl_info *mci;
>>  	unsigned long sys_addr;
>>  	struct amd64_pvt *pvt;
>>  	struct atl_err a_err;
>>  	struct err_info err;
>> +	int ret;
>>  
>>  	node_id = fixup_node_id(node_id, m);
>>  
>> @@ -2822,6 +2840,7 @@ static void decode_umc_error(int node_id, struct mce *m)
>>  
>>  	pvt = mci->pvt_info;
>>  
>> +	memset(&dram_addr, 0, sizeof(dram_addr));
>>  	memset(&err, 0, sizeof(err));
>>  
>>  	if (m->status & MCI_STATUS_DEFERRED)
>> @@ -2853,6 +2872,10 @@ static void decode_umc_error(int node_id, struct mce *m)
>>  		goto log_error;
>>  	}
>>  
>> +	ret = amd_convert_umc_mca_addr_to_dram_addr(&a_err, &dram_addr);
>> +	if (!ret)
>> +		err.dram_addr = &dram_addr;
> 
> I feel like it is not necessary to pass a second struct if it is already
> contained in another.
> 
> You could just clear (or not set) that field if an error occurs.
>
Slightly confused here.
Do you mean we should avoid passing dram_addr as second parameter
for amd_convert_umc_mca_addr_to_dram_addr() and instead just pass
struct err_info instance err?

And, in case some error occurs, we should just do
	err.dram_addr = 0x0;
?
 
>> +
>>  	error_address_to_page_and_offset(sys_addr, &err);
>>  
>>  log_error:
>> diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
>> index 17228d07de4c..88b0b8425ab3 100644
>> --- a/drivers/edac/amd64_edac.h
>> +++ b/drivers/edac/amd64_edac.h
>> @@ -399,6 +399,7 @@ struct err_info {
>>  	u16 syndrome;
>>  	u32 page;
>>  	u32 offset;
>> +	struct dram_addr *dram_addr;
>>  };
>>  
>>  static inline u32 get_umc_base(u8 channel)
>> -- 
> 
> Thanks,
> Yazen

-- 
Thanks,
Avadhut Naik