[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f9e2d199-1813-4b4a-83fd-bf93919a3411@amd.com>
Date: Thu, 26 Oct 2023 10:31:04 -0400
From: Yazen Ghannam <yazen.ghannam@....com>
To: Muralidhara M K <muralimk@....com>, linux-edac@...r.kernel.org,
x86@...nel.org
Cc: yazen.ghannam@....com, linux-kernel@...r.kernel.org, bp@...en8.de,
mchehab@...nel.org, Muralidhara M K <muralidhara.mk@....com>
Subject: Re: [PATCH 7/7] EDAC/amd64: RAS: platform/x86/amd: Identify all
physical pages in row
On 10/25/2023 3:33 AM, Muralidhara M K wrote:
> From: Muralidhara M K <muralidhara.mk@....com>
The $SUBJECT needs to be updated.
>
> AMD systems have HBM memory embedded with the chips, The entire memory
> is managed by host OS. Error containment needs to be reliable, because
> HBM memory cannot be replaced.
>
> Persist all UMC DRAM ECC errors, the OS can make the bad or poisoned page
> state persistent so that it will not use the memory upon the next boot.
>
There is nothing in this patch regarding persistence. It finds all
system physical addresses covering a DRAM row and request their pages to
be retired.
> The reported MCA error address in HBM in the format PC/SID/Bank/ROW/COL
> For example, In MI300A C1/C0 (column bits 1-0) is at SPA bit 6-5. Assuming
> PFN only looks at SPA bit 12 or higher, column bits 1-0 could be skipped.
> For PFN, SPA bits higher or equal than 12 matters. So column bits c2, c3
> and c4 gives 8 possible combination of addresses in a row.
>
> So, Identify all physical pages in a HBM row and retire all the pages
> to get rid of intermittent or recurrent memory errors.
>
There are a number of grammatical errors in the commit message. Please
fix them.
> Signed-off-by: Muralidhara M K <muralidhara.mk@....com>
> ---
> drivers/edac/amd64_edac.c | 5 ++
> drivers/ras/amd/atl/umc.c | 103 ++++++++++++++++++++++++++++++++++++++
> include/linux/amd-atl.h | 2 +
> 3 files changed, 110 insertions(+)
>
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 79c6c552ee14..d0db11e19a46 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -2838,6 +2838,11 @@ static void decode_umc_error(int node_id, struct mce *m)
>
> error_address_to_page_and_offset(sys_addr, &err);
>
> + if (pvt->fam == 0x19 && (pvt->model >= 0x90 && pvt->model <= 0x9f)) {
> + if (identify_poison_pages_retire_row(m))
> + return;
EDAC can still log the original error. So why return early here?
> + }
> +
> log_error:
> __log_ecc_error(mci, &err, ecc_type);
> }
> diff --git a/drivers/ras/amd/atl/umc.c b/drivers/ras/amd/atl/umc.c
> index 52247a7949fb..d31ad7680ff1 100644
> --- a/drivers/ras/amd/atl/umc.c
> +++ b/drivers/ras/amd/atl/umc.c
> @@ -255,3 +255,106 @@ int umc_mca_addr_to_sys_addr(struct mce *m, u64 *sys_addr)
> return 0;
> }
> EXPORT_SYMBOL_GPL(umc_mca_addr_to_sys_addr);
> +
> +/*
> + * High Bandwidth Memory (HBM v3) has fixed number of columns in a
> + * row (8 columns in one HBM row).
> + * Extract column bits to find all the combination of masks to retire
> + * all the poison pages in a row.
> + */
> +#define MAX_COLUMNS_IN_HBM_ROW 8
Is this true in general? Or is it specific to a product?
> +
> +/* The C2 bit in CH NA address */
> +#define UMC_NA_C2_BIT BIT(8)
> +/* The C3 bit in CH NA address */
> +#define UMC_NA_C3_BIT BIT(9)
> +/* The C4 bit in CH NA address */
> +#define UMC_NA_C4_BIT BIT(14)
> +
C2/C3/C4 is unclear in the comments. Please expand these to be more
explicit, like Column 2 bit, etc.
> +/* masks to get all possible combinations of column addresses */
> +#define C_1_1_1_MASK (UMC_NA_C4_BIT | UMC_NA_C3_BIT | UMC_NA_C2_BIT)
> +#define C_1_1_0_MASK (UMC_NA_C4_BIT | UMC_NA_C3_BIT)
> +#define C_1_0_1_MASK (UMC_NA_C4_BIT | UMC_NA_C2_BIT)
> +#define C_1_0_0_MASK (UMC_NA_C4_BIT)
> +#define C_0_1_1_MASK (UMC_NA_C3_BIT | UMC_NA_C2_BIT)
> +#define C_0_1_0_MASK (UMC_NA_C3_BIT)
> +#define C_0_0_1_MASK (UMC_NA_C2_BIT)
> +#define C_0_0_0_MASK ~C_1_1_1_MASK
> +
> +/* Identify all combination of column address physical pages in a row */
This comment is not clear to me.
> +static int amd_umc_identify_pages_in_row(struct mce *m, u64 *spa_addr)
Also, this function does not identify pages. It identifies system
physical addresses.
Additionally, this function seems very much specific to this
implementation of HBM3. So the function name should indicate this.
> +{
> + u8 cs_inst_id = get_cs_inst_id(m);
> + u8 socket_id = get_socket_id(m);
> + u64 norm_addr = get_norm_addr(m);
> + u8 die_id = get_die_id(m);
> + u16 df_acc_id = get_df_acc_id(m);
> +
> + u64 retire_addr, column;
> + u64 column_masks[] = { 0, C_0_0_1_MASK, C_0_1_0_MASK, C_0_1_1_MASK,
> + C_1_0_0_MASK, C_1_0_1_MASK, C_1_1_0_MASK, C_1_1_1_MASK };
> +
> + /* clear and loop for all possibilities of [c4 c3 c2] */
> + norm_addr &= C_0_0_0_MASK;
> +
> + for (column = 0; column < ARRAY_SIZE(column_masks); column++) {
> + retire_addr = norm_addr | column_masks[column];
> +
> + if (norm_to_sys_addr(df_acc_id, socket_id, die_id, cs_inst_id, &retire_addr))
> + return -EINVAL;
Why return if a single translation fails? What if the other seven can
succeed? Wouldn't it be better to find and offline 7/8 possible bad
addresses than 0/8?
> + *(spa_addr + column) = retire_addr;
> + }
> +
> + return 0;
> +}
> +
> +/* Find any duplicate addresses in all combination of column address */
> +static void amd_umc_find_duplicate_spa(u64 arr[], int *size)
> +{
> + int i, j, k;
> +
> + /* use nested for loop to find the duplicate elements in array */
> + for (i = 0; i < *size; i++) {
> + for (j = i + 1; j < *size; j++) {
> + /* check duplicate element */
> + if (arr[i] == arr[j]) {
> + /* delete the current position of the duplicate element */
> + for (k = j; k < (*size - 1); k++)
> + arr[k] = arr[k + 1];
> +
> + /* decrease the size of array after removing duplicate element */
> + (*size)--;
> +
> + /* if the position of the elements is changes, don't increase index j */
> + j--;
> + }
> + }
> + }
> +}
Is it really necessary to de-duplicate this array?
This data is discarded after the page retirement step. So checking for
duplicates can be done there.
> +
> +int identify_poison_pages_retire_row(struct mce *m)
> +{
> + int i, ret, addr_range;
> + unsigned long pfn;
> + u64 col[MAX_COLUMNS_IN_HBM_ROW];
> + u64 *spa_addr = col;
> +
Just use "col[]"; *spa_addr is not needed.
Also, please order variable declarations from longest to shortest by
line length.
> + /* Identify all pages in a row */
Comment is not needed.
> + pr_info("Identify all physical Pages in a row for MCE addr:0x%llx\n", m->addr);
> + ret = amd_umc_identify_pages_in_row(m, spa_addr);
> + if (!ret) {
If this succeeds, then you print info. And if it fails, then you don't
print, but continue to process information. This doesn't seem correct.
> + for (i = 0; i < MAX_COLUMNS_IN_HBM_ROW; i++)
> + pr_info("col[%d]_addr:0x%llx ", i, spa_addr[i]);
> + }
> + /* Find duplicate entries from all 8 physical addresses in a row */
> + addr_range = ARRAY_SIZE(col);
You already know the array size; you defined it above.
> + amd_umc_find_duplicate_spa(spa_addr, &addr_range);
> + /* do page retirement on all system physical addresses */
> + for (i = 0; i < addr_range; i++) {
You can check for duplicates here. If a value is a duplicate, then
continue the loop.
> + pfn = PHYS_PFN(spa_addr[i]);
> + memory_failure(pfn, 0);
> + }
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(identify_poison_pages_retire_row);
A namespace prefix is needed for exported functions, i.e. amd_atl_* or
similar.
Thanks,
Yazen
Powered by blists - more mailing lists