linux-kernel - Re: [PATCH v6 5/5] EDAC/amd64: Enumerate memory on Aldebaran GPU nodes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <90642509-81af-51f2-8942-c90432680fa2@amd.com>
Date:   Mon, 15 Nov 2021 20:54:55 +0530
From:   "Chatradhi, Naveen Krishna" <nchatrad@....com>
To:     Borislav Petkov <bp@...en8.de>
Cc:     linux-edac@...r.kernel.org, x86@...nel.org,
        linux-kernel@...r.kernel.org, mingo@...hat.com, mchehab@...nel.org,
        yazen.ghannam@....com, Muralidhara M K <muralimk@....com>
Subject: Re: [PATCH v6 5/5] EDAC/amd64: Enumerate memory on Aldebaran GPU
 nodes

Hi Boris

On 11/11/2021 6:42 PM, Borislav Petkov wrote:
> [CAUTION: External Email]
>
> On Thu, Oct 28, 2021 at 06:31:06PM +0530, Naveen Krishna Chatradhi wrote:
>> On newer heterogeneous systems with AMD CPUs, the data fabrics of the GPUs
>> are connected directly via custom links.
>>
>> One such system, where Aldebaran GPU nodes are connected to the
>> Family 19h, model 30h family of CPU nodes, the Aldebaran GPUs can report
>> memory errors via SMCA banks.
>>
>> Aldebaran GPU support was added to DRM framework
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Famd-gfx%2F2021-February%2F059694.html&amp;data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7Ca6aa66df219641c880d008d9a514e5bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637722331481685552%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Pnx%2FQxgmXS2MfWzu8lVEs22As3yJSWoAsjOvp%2FFOJYw%3D&amp;reserved=0
>>
>> The GPU nodes comes with HBM2 memory in-built, ECC support is enabled by
>> default and the UMCs on GPU node are different from the UMCs on CPU nodes.
>>
>> GPU specific ops routines are defined to extend the amd64_edac
>> module to enumerate HBM memory leveraging the existing edac and the
>> amd64 specific data structures.
>>
>> Note: The UMC Phys on GPU nodes are enumerated as csrows and the UMC
>> channels connected to HBM banks are enumerated as ranks.
> For all your future commit messages, from
> Documentation/process/submitting-patches.rst:
>
>   "Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
>    instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
>    to do frotz", as if you are giving orders to the codebase to change
>    its behaviour."
>
> Also, do not talk about what your patch does - that should hopefully be
> visible in the diff itself. Rather, talk about *why* you're doing what
> you're doing.
Sure.
>
>> Cc: Yazen Ghannam <yazen.ghannam@....com>
>> Co-developed-by: Muralidhara M K <muralimk@....com>
>> Signed-off-by: Muralidhara M K <muralimk@....com>
>> Signed-off-by: Naveen Krishna Chatradhi <nchatrad@....com>
>> Link: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20210823185437.94417-4-nchatrad%40amd.com&amp;data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7Ca6aa66df219641c880d008d9a514e5bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637722331481685552%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=aD3jFWS4RrPrIRF5mFoggBs%2BYqUseU3adJFqlrbD2D8%3D&amp;reserved=0
>> ---
>> Changes since v5:
>> Removed else condition in per_family_init for 19h family
>>
>> Changes since v4:
>>   Split "f17_addr_mask_to_cs_size" instead as did in 3rd patch earlier
>>
>> Changes since v3:
>> 1. Bifurcated the GPU code from v2
>>
>> Changes since v2:
>> 1. Restored line deletions and handled minor comments
>> 2. Modified commit message and some of the function comments
>> 3. variable df_inst_id is introduced instead of umc_num
>>
>> Changes since v1:
>> 1. Modifed the commit message
>> 2. Change the edac_cap
>> 3. kept sizes of both cpu and noncpu together
>> 4. return success if the !F3 condition true and remove unnecessary validation
>>
>>   drivers/edac/amd64_edac.c | 298 +++++++++++++++++++++++++++++++++-----
>>   drivers/edac/amd64_edac.h |  27 ++++
>>   2 files changed, 292 insertions(+), 33 deletions(-)
>>
>> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
>> index 7953ffe9d547..b404fa5b03ce 100644
>> --- a/drivers/edac/amd64_edac.c
>> +++ b/drivers/edac/amd64_edac.c
>> @@ -1121,6 +1121,20 @@ static void debug_display_dimm_sizes_df(struct amd64_pvt *pvt, u8 ctrl)
>>        }
>>   }
>>
>> +static void debug_display_dimm_sizes_gpu(struct amd64_pvt *pvt, u8 ctrl)
>> +{
>> +     int size, cs = 0, cs_mode;
>> +
>> +     edac_printk(KERN_DEBUG, EDAC_MC, "UMC%d chip selects:\n", ctrl);
>> +
>> +     cs_mode = CS_EVEN_PRIMARY | CS_ODD_PRIMARY;
>> +
>> +     for_each_chip_select(cs, ctrl, pvt) {
>> +             size = pvt->ops->dbam_to_cs(pvt, ctrl, cs_mode, cs);
>> +             amd64_info(EDAC_MC ": %d: %5dMB\n", cs, size);
>> +     }
>> +}
>> +
>>   static void __dump_misc_regs_df(struct amd64_pvt *pvt)
>>   {
>>        struct amd64_umc *umc;
>> @@ -1165,6 +1179,27 @@ static void __dump_misc_regs_df(struct amd64_pvt *pvt)
>>                 pvt->dhar, dhar_base(pvt));
>>   }
>>
>> +static void __dump_misc_regs_gpu(struct amd64_pvt *pvt)
> The function pointer this gets assigned to is called *get*_misc_regs.
> But this function is is doing *dump*.
>
> When I see __dump_misc_regs_gpu() I'd expect this function to be called
> by a higher level dump_misc_regs() as the "__" denotes layering usually.
>
> There is, in fact, dump_misc_regs() but that one calls
> ->get_misc_regs().
>
> So this all needs fixing - right now I see a mess.
>
> Also, there's <function_name>_gpu and gpu_<function_name> with the "gpu"
> being either prefix or suffix. You need to fix the current ones to be
> only prefixes - in a pre-patch - and then add yours as prefixes only too.
Will modify the names to use prefixes
>
> And in talking about pre-patches - this one is doing a bunch of things
> at once and needs splitting. You do the preparatory work like carving
> out common functionality and other refactoring in pre-patches, and then
> you add the new functionality ontop.
Sure, i will split this.
>
> Also, I don't like ->is_gpu one bit, it being sprinkled like that around
> the code. This says that the per-family attrs and ops assignment is
> lacking.
Yes, adding few more ops routines should help get rid of this if conditions.
>
>
>> @@ -2982,7 +3132,17 @@ static void decode_umc_error(int node_id, struct mce *m)
>>
>>        pvt->ops->get_umc_err_info(m, &err);
>>
>> -     if (umc_normaddr_to_sysaddr(m->addr, pvt->mc_node_id, err.channel, &sys_addr)) {
>> +     /*
>> +      * GPU node has #phys[X] which has #channels[Y] each.
>> +      * On GPUs, df_inst_id = [X] * num_ch_per_phy + [Y].
>> +      * On CPUs, "Channel"="UMC Number"="DF Instance ID".
>> +      */
> This comment doesn't look like it is destined for human beings to read
> but maybe to be parsed by programs?

The present system supports the following setup

A GPU node includes four Unified Memory Controllers (UMC). Each UMC 
contains eight UMCCH instances.

Each UMCCH controls one 128-bit HBM2e (2GB) channel. The UMC interfaces 
to the system via the Data Fabric (DF) Coherent Slave.


For a generalized function, i will modify the comment as

"A GPU node has 'X' number of UMC PHYs and 'Y' number of UMCCH instances 
each. This results in 'X*Y' number of

instances in the Data Fabric. Therefore the Data Fabric ID of an 
instance can be found with the following formula:

df_inst_id = 'X' * number of channels per PHY + 'Y'

>
>> +     if (pvt->is_gpu)
>> +             df_inst_id = (err.csrow * pvt->channel_count / mci->nr_csrows) + err.channel;
> I'm not sure we want to log ECCs from the GPUs: what is going to happen
> to them, how does the further error recovery look like?

AMD GPU has support for memory error reporting via the SMCA register 
banks, functionality is similar to the CPU dies.

The customer/end users of this product want to count memory errors and 
they want to do this in the OS using similar

mechanisms as they are used to on the CPU side.

>
> Also, EDAC sysfs structure currently has only DIMMs, below's from my
> box.
>
> I don't see how that structure fits the GPUs so here's the $10^6
> question: why does EDAC need to know about GPUs?

The errors are not specific to GPUs, the errors are originating from 
HBM2e memory chips on the GPU.

As a first step, I'm trying to leverage the existing EDAC interfaces to 
report the HBM errors.

Page retirement and storing the bad pages info on a persistent storage 
can be the next steps.

>
> What is the strategy here?
>
> Your 0th message talks about the "what" but what gets added is not
> important - the "why" is.
Sure, i will update with the why.
>
> $ grep -r . /sys/devices/system/edac/mc/ 2>/dev/null
> /sys/devices/system/edac/mc/power/runtime_active_time:0
> /sys/devices/system/edac/mc/power/runtime_status:unsupported
> /sys/devices/system/edac/mc/power/runtime_suspended_time:0
> /sys/devices/system/edac/mc/power/control:auto
> /sys/devices/system/edac/mc/mc0/ce_count:0
> /sys/devices/system/edac/mc/mc0/rank7/dimm_ue_count:0
> /sys/devices/system/edac/mc/mc0/rank7/dimm_mem_type:Unbuffered-DDR4
> /sys/devices/system/edac/mc/mc0/rank7/power/runtime_active_time:0
> /sys/devices/system/edac/mc/mc0/rank7/power/runtime_status:unsupported
> /sys/devices/system/edac/mc/mc0/rank7/power/runtime_suspended_time:0
> /sys/devices/system/edac/mc/mc0/rank7/power/control:auto
> /sys/devices/system/edac/mc/mc0/rank7/dimm_dev_type:Unknown
> /sys/devices/system/edac/mc/mc0/rank7/size:8192
> /sys/devices/system/edac/mc/mc0/rank7/dimm_ce_count:0
> /sys/devices/system/edac/mc/mc0/rank7/dimm_label:mc#0csrow#3channel#1
> /sys/devices/system/edac/mc/mc0/rank7/dimm_location:csrow 3 channel 1
> /sys/devices/system/edac/mc/mc0/rank7/dimm_edac_mode:SECDED
> /sys/devices/system/edac/mc/mc0/topmem:0x00000000e0000000
> /sys/devices/system/edac/mc/mc0/mc_name:F17h
> /sys/devices/system/edac/mc/mc0/rank5/dimm_ue_count:0
> /sys/devices/system/edac/mc/mc0/rank5/dimm_mem_type:Unbuffered-DDR4
> /sys/devices/system/edac/mc/mc0/rank5/power/runtime_active_time:0
> /sys/devices/system/edac/mc/mc0/rank5/power/runtime_status:unsupported
> /sys/devices/system/edac/mc/mc0/rank5/power/runtime_suspended_time:0
> /sys/devices/system/edac/mc/mc0/rank5/power/control:auto
> /sys/devices/system/edac/mc/mc0/rank5/dimm_dev_type:Unknown
> /sys/devices/system/edac/mc/mc0/rank5/size:8192
> /sys/devices/system/edac/mc/mc0/rank5/dimm_ce_count:0
> /sys/devices/system/edac/mc/mc0/rank5/dimm_label:mc#0csrow#2channel#1
> /sys/devices/system/edac/mc/mc0/rank5/dimm_location:csrow 2 channel 1
> /sys/devices/system/edac/mc/mc0/rank5/dimm_edac_mode:SECDED
> /sys/devices/system/edac/mc/mc0/dram_hole:0 0 0
> /sys/devices/system/edac/mc/mc0/ce_noinfo_count:0
> ...
>
> --
> Regards/Gruss,
>      Boris.

Regards,

Naveen

>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.kernel.org%2Ftglx%2Fnotes-about-netiquette&amp;data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7Ca6aa66df219641c880d008d9a514e5bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637722331481685552%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=fw%2B%2F8rCVRhhZ0xp8XLUW5rvoDUnfNQzwJleHVoPmDkw%3D&amp;reserved=0