linux-kernel - Re: [PATCH v6 1/5] x86/amd_nb: Add support for northbridges on Aldebaran

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b7f3639a-e46c-25e8-270b-04860074fd3c@amd.com>
Date:   Thu, 4 Nov 2021 18:48:29 +0530
From:   "Chatradhi, Naveen Krishna" <nchatrad@....com>
To:     Borislav Petkov <bp@...en8.de>
Cc:     linux-edac@...r.kernel.org, x86@...nel.org,
        linux-kernel@...r.kernel.org, mingo@...hat.com, mchehab@...nel.org,
        yazen.ghannam@....com, Muralidhara M K <muralimk@....com>
Subject: Re: [PATCH v6 1/5] x86/amd_nb: Add support for northbridges on
 Aldebaran

Hi Boris,

On 11/2/2021 11:33 PM, Borislav Petkov wrote:
> [CAUTION: External Email]
>
> On Thu, Oct 28, 2021 at 06:31:02PM +0530, Naveen Krishna Chatradhi wrote:
>
> Staring at this more...
Thanks for taking the time.
>
>> +/*
>> + * Newer AMD CPUs and GPUs whose data fabrics can be connected via custom xGMI
>> + * links, comes with registers to gather local and remote node type map info.
>> + *
>> + * "Local Node Type" refers to nodes with the same type as that from which the
>> + * register is read, and "Remote Node Type" refers to nodes with a different type.
>> + *
>> + * This function, reads the registers from GPU DF function 1.
>> + * Hence, local nodes are GPU and remote nodes are CPUs.
>> + */
>> +static int amd_get_node_map(void)
> ... so this is a generic function name...
>
>> +{
>> +     struct amd_node_map *nodemap;
>> +     struct pci_dev *pdev;
>> +     u32 tmp;
>> +
>> +     pdev = pci_get_device(PCI_VENDOR_ID_AMD,
>> +                           PCI_DEVICE_ID_AMD_ALDEBARAN_DF_F1, NULL);
> ... but this here is trying to get the Aldebaran PCI device function.
I know, this is confusion. we will try to give a meaning for definition 
here.
>
>   So what happens if in the future, the GPU is a different one and it
> gets RAS functionality and PCI device functions too? You'd probably need
> to add that new GPU support too.
Yes, might happen
>
> And then looking at that patch again, see how this new code is bolted on
> and sure, it all is made to work, but it is strenuous and you have to
> always pay attention to what type of devices you're dealing with.
>
> And the next patch does:
>
>          ... if (bank_type == SMCA_UMC_V2) {
>
>          /* do UMC v2 special stuff here. */
>
> which begs the question: wouldn't this GPU PCI devices enumeration be a
> lot cleaner if it were separate?
>
> I.e., in amd_nb.c you'd have
>
> init_amd_nbs:
>
>          amd_cache_northbridges();
>          amd_cache_gart();
>          amd_cache_gpu_devices();

Agreed. however, a slight modification to the suggestion

Instead of modifying the init_amd_nbs()

How about, defining a new struct

+struct system_topology {
+       const struct pci_device_id *misc_ids;
+       const struct pci_device_id *link_ids;
+       const struct pci_device_id *root_ids;
+       u16 roots_per_misc;
+       u16 misc_count;
+       u16 root_count;
+};

and modifying the amd_cache_northbridges() to

+int amd_cache_northbridges(void)
+{
+        struct system_toplogy topo;
+        int ret;
+
+        if (amd_northbridges.num)
+                return 0;
+
+        ret = amd_cpu_nbs(&topo);
+        printk("==> misc:%d\n", ret);
+
+        if (look_for_remote_nodes()) {
+                ret = amd_gpu_nbs(&topo);
+                printk("==> gpu_misc:%d\n", ret);
+        }
+
+        get_next_northbridges(&topo);

This way, creating appropriate number MCs under EDAC and existing 
exported APIs can remain the same.

Let me know your thoughts on this. I can send an updated version with 
your comments addressed.

>
> and in this last one you do your enumeration. Completely separate data
> structures and all. Adding a new device support would then be trivial.
>
> And then looking at the next patch again, you have:
>
> +               } else if (bank_type == SMCA_UMC_V2) {
> +                       /*
> +                        * SMCA_UMC_V2 exists on GPU nodes, extract the node id
> +                        * from register MCA_IPID[47:44](InstanceIdHi).
> +                        * The InstanceIdHi field represents the instance ID of the GPU.
> +                        * Which needs to be mapped to a value used by Linux,
> +                        * where GPU nodes are simply numerically after the CPU nodes.
> +                        */
> +                       node_id = ((m->ipid >> 44) & 0xF) -
> +                                  amd_gpu_node_start_id() + amd_cpu_node_count();
>
> where instead of exporting those functions and having the caller do the
> calculations, you'd have a function in amd_nb.c which is called
>
>          amd_get_gpu_node_id(unsigned long ipid)
>
> which will use those separate data structures mentioned above and give
> you the node id.
Sure, we can modify this way.
>
> And those GPU node IDs are placed numerically after the CPU nodes so
> your code doesn't need to do anything special - just read out registers
> and cache them.
>
> And you don't need those exports either - it is all nicely encapsulated
> and a single function is used to get the callers what they wanna know.
Got it, thank you.
>
> Hmmm?
>
> --
> Regards/Gruss,
>      Boris.
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.kernel.org%2Ftglx%2Fnotes-about-netiquette&amp;data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7Cdd5b3586178441f4886808d99e2b1ef3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637714730331703852%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=oXDojOFqVVhxn4P1tgwLycaJgc2rvwo8EoUj3i971Mw%3D&amp;reserved=0