linux-kernel - Re: [PATCH 7/7] EDAC/amd64: RAS: platform/x86/amd: Identify all physical pages in row

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f9e2d199-1813-4b4a-83fd-bf93919a3411@amd.com>
Date:   Thu, 26 Oct 2023 10:31:04 -0400
From:   Yazen Ghannam <yazen.ghannam@....com>
To:     Muralidhara M K <muralimk@....com>, linux-edac@...r.kernel.org,
        x86@...nel.org
Cc:     yazen.ghannam@....com, linux-kernel@...r.kernel.org, bp@...en8.de,
        mchehab@...nel.org, Muralidhara M K <muralidhara.mk@....com>
Subject: Re: [PATCH 7/7] EDAC/amd64: RAS: platform/x86/amd: Identify all
 physical pages in row

On 10/25/2023 3:33 AM, Muralidhara M K wrote:
> From: Muralidhara M K <muralidhara.mk@....com>


The $SUBJECT needs to be updated.

> 
> AMD systems have HBM memory embedded with the chips, The entire memory
> is managed by host OS. Error containment needs to be reliable, because
> HBM memory cannot be replaced.
> 
> Persist all UMC DRAM ECC errors, the OS can make the bad or poisoned page
> state persistent so that it will not use the memory upon the next boot.
> 

There is nothing in this patch regarding persistence. It finds all 
system physical addresses covering a DRAM row and request their pages to 
be retired.

> The reported MCA error address in HBM in the format PC/SID/Bank/ROW/COL
> For example, In MI300A C1/C0 (column bits 1-0) is at SPA bit 6-5. Assuming
> PFN only looks at SPA bit 12 or higher, column bits 1-0 could be skipped.
> For PFN, SPA bits higher or equal than 12 matters. So column bits c2, c3
> and c4 gives 8 possible combination of addresses in a row.
> 
> So, Identify all physical pages in a HBM row and retire all the pages
> to get rid of intermittent or recurrent memory errors.
> 

There are a number of grammatical errors in the commit message. Please 
fix them.

> Signed-off-by: Muralidhara M K <muralidhara.mk@....com>
> ---
>   drivers/edac/amd64_edac.c |   5 ++
>   drivers/ras/amd/atl/umc.c | 103 ++++++++++++++++++++++++++++++++++++++
>   include/linux/amd-atl.h   |   2 +
>   3 files changed, 110 insertions(+)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 79c6c552ee14..d0db11e19a46 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -2838,6 +2838,11 @@ static void decode_umc_error(int node_id, struct mce *m)
>   
>   	error_address_to_page_and_offset(sys_addr, &err);
>   
> +	if (pvt->fam == 0x19 && (pvt->model >= 0x90 && pvt->model <= 0x9f)) {
> +		if (identify_poison_pages_retire_row(m))
> +			return;

EDAC can still log the original error. So why return early here?

> +	}
> +
>   log_error:
>   	__log_ecc_error(mci, &err, ecc_type);
>   }
> diff --git a/drivers/ras/amd/atl/umc.c b/drivers/ras/amd/atl/umc.c
> index 52247a7949fb..d31ad7680ff1 100644
> --- a/drivers/ras/amd/atl/umc.c
> +++ b/drivers/ras/amd/atl/umc.c
> @@ -255,3 +255,106 @@ int umc_mca_addr_to_sys_addr(struct mce *m, u64 *sys_addr)
>   	return 0;
>   }
>   EXPORT_SYMBOL_GPL(umc_mca_addr_to_sys_addr);
> +
> +/*
> + * High Bandwidth Memory (HBM v3) has fixed number of columns in a
> + * row (8 columns in one HBM row).
> + * Extract column bits to find all the combination of masks to retire
> + * all the poison pages in a row.
> + */
> +#define MAX_COLUMNS_IN_HBM_ROW	8

Is this true in general? Or is it specific to a product?

> +
> +/* The C2 bit in CH NA address */
> +#define UMC_NA_C2_BIT	BIT(8)
> +/* The C3 bit in CH NA address */
> +#define UMC_NA_C3_BIT	BIT(9)
> +/* The C4 bit in CH NA address */
> +#define UMC_NA_C4_BIT	BIT(14)
> +

C2/C3/C4 is unclear in the comments. Please expand these to be more 
explicit, like Column 2 bit, etc.

> +/* masks to get all possible combinations of column addresses */
> +#define C_1_1_1_MASK	(UMC_NA_C4_BIT | UMC_NA_C3_BIT | UMC_NA_C2_BIT)
> +#define C_1_1_0_MASK	(UMC_NA_C4_BIT | UMC_NA_C3_BIT)
> +#define C_1_0_1_MASK	(UMC_NA_C4_BIT | UMC_NA_C2_BIT)
> +#define C_1_0_0_MASK	(UMC_NA_C4_BIT)
> +#define C_0_1_1_MASK	(UMC_NA_C3_BIT | UMC_NA_C2_BIT)
> +#define C_0_1_0_MASK	(UMC_NA_C3_BIT)
> +#define C_0_0_1_MASK	(UMC_NA_C2_BIT)
> +#define C_0_0_0_MASK	~C_1_1_1_MASK
> +
> +/* Identify all combination of column address physical pages in a row */

This comment is not clear to me.

> +static int amd_umc_identify_pages_in_row(struct mce *m, u64 *spa_addr)

Also, this function does not identify pages. It identifies system 
physical addresses.

Additionally, this function seems very much specific to this 
implementation of HBM3. So the function name should indicate this.

> +{
> +	u8 cs_inst_id = get_cs_inst_id(m);
> +	u8 socket_id = get_socket_id(m);
> +	u64 norm_addr = get_norm_addr(m);
> +	u8 die_id = get_die_id(m);
> +	u16 df_acc_id = get_df_acc_id(m);
> +
> +	u64 retire_addr, column;
> +	u64 column_masks[] = { 0, C_0_0_1_MASK, C_0_1_0_MASK, C_0_1_1_MASK,
> +			C_1_0_0_MASK, C_1_0_1_MASK, C_1_1_0_MASK, C_1_1_1_MASK };
> +
> +	/* clear and loop for all possibilities of [c4 c3 c2] */
> +	norm_addr &= C_0_0_0_MASK;
> +
> +	for (column = 0; column < ARRAY_SIZE(column_masks); column++) {
> +		retire_addr = norm_addr | column_masks[column];
> +
> +		if (norm_to_sys_addr(df_acc_id, socket_id, die_id, cs_inst_id, &retire_addr))
> +			return -EINVAL;

Why return if a single translation fails? What if the other seven can 
succeed? Wouldn't it be better to find and offline 7/8 possible bad 
addresses than 0/8?

> +		*(spa_addr + column) = retire_addr;
> +	}
> +
> +	return 0;
> +}
> +
> +/* Find any duplicate addresses in all combination of column address */
> +static void amd_umc_find_duplicate_spa(u64 arr[], int *size)
> +{
> +	int i, j, k;
> +
> +	/* use nested for loop to find the duplicate elements in array */
> +	for (i = 0; i < *size; i++) {
> +		for (j = i + 1; j < *size; j++) {
> +			/* check duplicate element */
> +			if (arr[i] == arr[j]) {
> +				/* delete the current position of the duplicate element */
> +				for (k = j; k < (*size - 1); k++)
> +					arr[k] = arr[k + 1];
> +
> +			/* decrease the size of array after removing duplicate element */
> +				(*size)--;
> +
> +			/* if the position of the elements is changes, don't increase index j */
> +				j--;
> +			}
> +		}
> +	}
> +}

Is it really necessary to de-duplicate this array?

This data is discarded after the page retirement step. So checking for 
duplicates can be done there.

> +
> +int identify_poison_pages_retire_row(struct mce *m)
> +{
> +	int i, ret, addr_range;
> +	unsigned long pfn;
> +	u64 col[MAX_COLUMNS_IN_HBM_ROW];
> +	u64 *spa_addr = col;
> +

Just use "col[]"; *spa_addr is not needed.

Also, please order variable declarations from longest to shortest by 
line length.

> +	/* Identify all pages in a row */

Comment is not needed.

> +	pr_info("Identify all physical Pages in a row for MCE addr:0x%llx\n", m->addr);
> +	ret = amd_umc_identify_pages_in_row(m, spa_addr);
> +	if (!ret) {

If this succeeds, then you print info. And if it fails, then you don't 
print, but continue to process information. This doesn't seem correct.

> +		for (i = 0; i < MAX_COLUMNS_IN_HBM_ROW; i++)
> +			pr_info("col[%d]_addr:0x%llx ", i, spa_addr[i]);
> +	}
> +	/* Find duplicate entries from all 8 physical addresses in a row */
> +	addr_range = ARRAY_SIZE(col);

You already know the array size; you defined it above.

> +	amd_umc_find_duplicate_spa(spa_addr, &addr_range);
> +	/* do page retirement on all system physical addresses */
> +	for (i = 0; i < addr_range; i++) {

You can check for duplicates here. If a value is a duplicate, then 
continue the loop.

> +		pfn = PHYS_PFN(spa_addr[i]);
> +		memory_failure(pfn, 0);
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(identify_poison_pages_retire_row);

A namespace prefix is needed for exported functions, i.e. amd_atl_* or 
similar.

Thanks,
Yazen