[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b7f3639a-e46c-25e8-270b-04860074fd3c@amd.com>
Date: Thu, 4 Nov 2021 18:48:29 +0530
From: "Chatradhi, Naveen Krishna" <nchatrad@....com>
To: Borislav Petkov <bp@...en8.de>
Cc: linux-edac@...r.kernel.org, x86@...nel.org,
linux-kernel@...r.kernel.org, mingo@...hat.com, mchehab@...nel.org,
yazen.ghannam@....com, Muralidhara M K <muralimk@....com>
Subject: Re: [PATCH v6 1/5] x86/amd_nb: Add support for northbridges on
Aldebaran
Hi Boris,
On 11/2/2021 11:33 PM, Borislav Petkov wrote:
> [CAUTION: External Email]
>
> On Thu, Oct 28, 2021 at 06:31:02PM +0530, Naveen Krishna Chatradhi wrote:
>
> Staring at this more...
Thanks for taking the time.
>
>> +/*
>> + * Newer AMD CPUs and GPUs whose data fabrics can be connected via custom xGMI
>> + * links, comes with registers to gather local and remote node type map info.
>> + *
>> + * "Local Node Type" refers to nodes with the same type as that from which the
>> + * register is read, and "Remote Node Type" refers to nodes with a different type.
>> + *
>> + * This function, reads the registers from GPU DF function 1.
>> + * Hence, local nodes are GPU and remote nodes are CPUs.
>> + */
>> +static int amd_get_node_map(void)
> ... so this is a generic function name...
>
>> +{
>> + struct amd_node_map *nodemap;
>> + struct pci_dev *pdev;
>> + u32 tmp;
>> +
>> + pdev = pci_get_device(PCI_VENDOR_ID_AMD,
>> + PCI_DEVICE_ID_AMD_ALDEBARAN_DF_F1, NULL);
> ... but this here is trying to get the Aldebaran PCI device function.
I know, this is confusion. we will try to give a meaning for definition
here.
>
> So what happens if in the future, the GPU is a different one and it
> gets RAS functionality and PCI device functions too? You'd probably need
> to add that new GPU support too.
Yes, might happen
>
> And then looking at that patch again, see how this new code is bolted on
> and sure, it all is made to work, but it is strenuous and you have to
> always pay attention to what type of devices you're dealing with.
>
> And the next patch does:
>
> ... if (bank_type == SMCA_UMC_V2) {
>
> /* do UMC v2 special stuff here. */
>
> which begs the question: wouldn't this GPU PCI devices enumeration be a
> lot cleaner if it were separate?
>
> I.e., in amd_nb.c you'd have
>
> init_amd_nbs:
>
> amd_cache_northbridges();
> amd_cache_gart();
> amd_cache_gpu_devices();
Agreed. however, a slight modification to the suggestion
Instead of modifying the init_amd_nbs()
How about, defining a new struct
+struct system_topology {
+ const struct pci_device_id *misc_ids;
+ const struct pci_device_id *link_ids;
+ const struct pci_device_id *root_ids;
+ u16 roots_per_misc;
+ u16 misc_count;
+ u16 root_count;
+};
and modifying the amd_cache_northbridges() to
+int amd_cache_northbridges(void)
+{
+ struct system_toplogy topo;
+ int ret;
+
+ if (amd_northbridges.num)
+ return 0;
+
+ ret = amd_cpu_nbs(&topo);
+ printk("==> misc:%d\n", ret);
+
+ if (look_for_remote_nodes()) {
+ ret = amd_gpu_nbs(&topo);
+ printk("==> gpu_misc:%d\n", ret);
+ }
+
+ get_next_northbridges(&topo);
This way, creating appropriate number MCs under EDAC and existing
exported APIs can remain the same.
Let me know your thoughts on this. I can send an updated version with
your comments addressed.
>
> and in this last one you do your enumeration. Completely separate data
> structures and all. Adding a new device support would then be trivial.
>
> And then looking at the next patch again, you have:
>
> + } else if (bank_type == SMCA_UMC_V2) {
> + /*
> + * SMCA_UMC_V2 exists on GPU nodes, extract the node id
> + * from register MCA_IPID[47:44](InstanceIdHi).
> + * The InstanceIdHi field represents the instance ID of the GPU.
> + * Which needs to be mapped to a value used by Linux,
> + * where GPU nodes are simply numerically after the CPU nodes.
> + */
> + node_id = ((m->ipid >> 44) & 0xF) -
> + amd_gpu_node_start_id() + amd_cpu_node_count();
>
> where instead of exporting those functions and having the caller do the
> calculations, you'd have a function in amd_nb.c which is called
>
> amd_get_gpu_node_id(unsigned long ipid)
>
> which will use those separate data structures mentioned above and give
> you the node id.
Sure, we can modify this way.
>
> And those GPU node IDs are placed numerically after the CPU nodes so
> your code doesn't need to do anything special - just read out registers
> and cache them.
>
> And you don't need those exports either - it is all nicely encapsulated
> and a single function is used to get the callers what they wanna know.
Got it, thank you.
>
> Hmmm?
>
> --
> Regards/Gruss,
> Boris.
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.kernel.org%2Ftglx%2Fnotes-about-netiquette&data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7Cdd5b3586178441f4886808d99e2b1ef3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637714730331703852%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=oXDojOFqVVhxn4P1tgwLycaJgc2rvwo8EoUj3i971Mw%3D&reserved=0
Powered by blists - more mailing lists