lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Mon, 11 Oct 2021 19:56:34 +0530 From: "Chatradhi, Naveen Krishna" <nchatrad@....com> To: Borislav Petkov <bp@...en8.de> Cc: linux-edac@...r.kernel.org, x86@...nel.org, linux-kernel@...r.kernel.org, mchehab@...nel.org, Yazen.Ghannam@....com, Muralidhara M K <muralimk@....com> Subject: Re: [PATCH v3 1/3] x86/amd_nb: Add support for northbridges on Aldebaran Hi Boris, Apologies for the late reply. We were modifying the code based on the conversation between yourself and Yazen. I would like to answer some of your questions before submitting the next version of the patch-set. On 8/25/2021 4:12 PM, Borislav Petkov wrote: > [CAUTION: External Email] > > On Tue, Aug 24, 2021 at 12:24:35AM +0530, Naveen Krishna Chatradhi wrote: >> From: Muralidhara M K <muralimk@....com> >> >> On newer systems the CPUs manage MCA errors reported from the GPUs. >> Enumerate the GPU nodes with the AMD NB framework to support EDAC. >> >> This patch adds necessary code to manage the Aldebaran nodes along with > Avoid having "This patch" or "This commit" in the commit message. It is > tautologically useless. > > Also, do > > $ git grep 'This patch' Documentation/process > > for more details. Sure thanks > > Also, what are the "Aldebaran nodes"? > > Something on a star which is 65 light years away? Aldebaran is an AMD GPU name, code submitted [PATCH 000/159] Aldebaran support (lists.freedesktop.org) <https://lists.freedesktop.org/archives/amd-gfx/2021-February/059694.html> is a part of the DRM framework > >> the CPU nodes. >> >> The GPU nodes are enumerated in sequential order based on the >> PCI hierarchy, and the first GPU node is assumed to have an "AMD Node >> ID" value of 8 (the second GPU node has 9, etc.). > What does that mean? The GPU nodes are simply numerically after the CPU > nodes or how am I to understand this nomenclature? > >> Each Aldebaran GPU >> package has 2 Data Fabrics, which are enumerated as 2 nodes. >> With this implementation detail, the Data Fabric on the GPU nodes can be >> accessed the same way as the Data Fabric on CPU nodes. >> >> Signed-off-by: Muralidhara M K <muralimk@....com> >> Signed-off-by: Naveen Krishna Chatradhi <nchatrad@....com> >> Reviewed-by: Yazen Ghannam <yazen.ghannam@....com> >> --- >> Changes since v2: Added Reviewed-by Yazen Ghannam >> >> arch/x86/include/asm/amd_nb.h | 10 ++++++ >> arch/x86/kernel/amd_nb.c | 63 ++++++++++++++++++++++++++++++++--- >> include/linux/pci_ids.h | 1 + >> 3 files changed, 69 insertions(+), 5 deletions(-) >> >> diff --git a/arch/x86/include/asm/amd_nb.h b/arch/x86/include/asm/amd_nb.h >> index 455066a06f60..09905f6c7218 100644 >> --- a/arch/x86/include/asm/amd_nb.h >> +++ b/arch/x86/include/asm/amd_nb.h >> @@ -80,6 +80,16 @@ struct amd_northbridge_info { >> >> #ifdef CONFIG_AMD_NB >> >> +/* >> + * On newer heterogeneous systems the data fabrics of the CPUs and GPUs >> + * are connected directly via a custom links, like is done with > s/ a // > >> + * 2 socket CPU systems and also within a socket for Multi-chip Module >> + * (MCM) CPUs like Naples. >> + * The first GPU node(non cpu) is assumed to have an "AMD Node ID" value > In all your text: > > s/cpu/CPU/g > >> + * of 8 (the second GPU node has 9, etc.). >> + */ >> +#define NONCPU_NODE_INDEX 8 > Why is this assumed? Can it instead be read from the hardware somewhere? > Or there simply won't be more than 8 CPU nodes anyway? Not at least in > the near future? > > I'd prefer stuff to be read out directly from the hardware so that when > the hardware changes, the code just works instead of doing assumptions > which get invalidated later. > >> + >> u16 amd_nb_num(void); >> bool amd_nb_has_feature(unsigned int feature); >> struct amd_northbridge *node_to_amd_nb(int node); >> diff --git a/arch/x86/kernel/amd_nb.c b/arch/x86/kernel/amd_nb.c >> index 23dda362dc0f..6ad5664a18aa 100644 >> --- a/arch/x86/kernel/amd_nb.c >> +++ b/arch/x86/kernel/amd_nb.c >> @@ -26,6 +26,8 @@ >> #define PCI_DEVICE_ID_AMD_17H_M70H_DF_F4 0x1444 >> #define PCI_DEVICE_ID_AMD_19H_DF_F4 0x1654 >> #define PCI_DEVICE_ID_AMD_19H_M50H_DF_F4 0x166e >> +#define PCI_DEVICE_ID_AMD_ALDEBARAN_ROOT 0x14bb >> +#define PCI_DEVICE_ID_AMD_ALDEBARAN_DF_F4 0x14d4 > You see how those defines are aligned vertically, right? > > If your new defines don't fit, you realign them all vertically too - you > don't just slap them there. > > And if it wasn't clear above, that Aldebaran GPU chip name means > something only to AMD folks. If this is correct > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FVideo_Core_Next&data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7Ce18fc8abf0da4da621b908d967b4ff31%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637654849356966973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ge%2Fh%2FA2YmlKA7IF8XjuwM%2F4eygYYfybMOJs4jLX3g3I%3D&reserved=0 > > then that Aldebaran thing is VCN 2.6 although this is only the video > encoding stuff and GPU I guess is more than that. > > IOW, what I'm trying to say is, just like we name the CPUs using their > families, you should name the GPUs nomenclature with GPU families (I > guess there's stuff like that) and not use the marketing crap. Aldebaran GPU might be a later variant of gfx9 and are connected to the CPU sockets via custom xGMI links. I could not find any family number associated with the GPUs. The DRM driver code uses it as follows and does not expose the value to other frameworks in Linux. +#define CHIP_ALDEBARAN 25 in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx9.asm > If you need an example, here's how we did it for the Intel marketing > pile of bullsh*t: > > arch/x86/include/asm/intel-family.h > > Please provide a similar way of referring to the GPU chips. > >> /* Protect the PCI config register pairs used for SMN and DF indirect access. */ >> static DEFINE_MUTEX(smn_mutex); >> @@ -94,6 +96,21 @@ static const struct pci_device_id hygon_nb_link_ids[] = { >> {} >> }; >> >> +static const struct pci_device_id amd_noncpu_root_ids[] = { > Why is that "noncpu" thing everywhere? Is this thing going to be > anything else besides a GPU? > > If not, you can simply call it > > amd_gpu_root_ids > > to mean *exactly* what they are. PCI IDs on the GPU. > >> + { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_ALDEBARAN_ROOT) }, >> + {} >> +}; >> + >> +static const struct pci_device_id amd_noncpu_nb_misc_ids[] = { >> + { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_ALDEBARAN_DF_F3) }, >> + {} >> +}; >> + >> +static const struct pci_device_id amd_noncpu_nb_link_ids[] = { >> + { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_ALDEBARAN_DF_F4) }, >> + {} >> +}; >> + >> const struct amd_nb_bus_dev_range amd_nb_bus_dev_ranges[] __initconst = { >> { 0x00, 0x18, 0x20 }, >> { 0xff, 0x00, 0x20 }, >> @@ -230,11 +247,16 @@ int amd_cache_northbridges(void) >> const struct pci_device_id *misc_ids = amd_nb_misc_ids; >> const struct pci_device_id *link_ids = amd_nb_link_ids; >> const struct pci_device_id *root_ids = amd_root_ids; >> + >> + const struct pci_device_id *noncpu_misc_ids = amd_noncpu_nb_misc_ids; >> + const struct pci_device_id *noncpu_link_ids = amd_noncpu_nb_link_ids; >> + const struct pci_device_id *noncpu_root_ids = amd_noncpu_root_ids; >> + >> struct pci_dev *root, *misc, *link; >> struct amd_northbridge *nb; >> u16 roots_per_misc = 0; >> - u16 misc_count = 0; >> - u16 root_count = 0; >> + u16 misc_count = 0, misc_count_noncpu = 0; >> + u16 root_count = 0, root_count_noncpu = 0; >> u16 i, j; >> >> if (amd_northbridges.num) >> @@ -253,10 +275,16 @@ int amd_cache_northbridges(void) >> if (!misc_count) >> return -ENODEV; >> >> + while ((misc = next_northbridge(misc, noncpu_misc_ids)) != NULL) >> + misc_count_noncpu++; >> + >> root = NULL; >> while ((root = next_northbridge(root, root_ids)) != NULL) >> root_count++; >> >> + while ((root = next_northbridge(root, noncpu_root_ids)) != NULL) >> + root_count_noncpu++; >> + >> if (root_count) { >> roots_per_misc = root_count / misc_count; >> >> @@ -270,15 +298,28 @@ int amd_cache_northbridges(void) >> } >> } >> >> - nb = kcalloc(misc_count, sizeof(struct amd_northbridge), GFP_KERNEL); >> + if (misc_count_noncpu) { >> + /* >> + * The first non-CPU Node ID starts at 8 even if there are fewer >> + * than 8 CPU nodes. To maintain the AMD Node ID to Linux amd_nb >> + * indexing scheme, allocate the number of GPU nodes plus 8. >> + * Some allocated amd_northbridge structures will go unused when >> + * the number of CPU nodes is less than 8, but this tradeoff is to >> + * keep things relatively simple. > Why simple? > > What's wrong with having > > [node IDs][GPU node IDs] > > i.e., the usual nodes come first and the GPU ones after it. > > You enumerate everything properly here so you can control what goes > where. Which means, you don't need this NONCPU_NODE_INDEX non-sense at > all. > > Hmmm? > >> + */ >> + amd_northbridges.num = NONCPU_NODE_INDEX + misc_count_noncpu; >> + } else { >> + amd_northbridges.num = misc_count; >> + } >> + >> + nb = kcalloc(amd_northbridges.num, sizeof(struct amd_northbridge), GFP_KERNEL); >> if (!nb) >> return -ENOMEM; >> >> amd_northbridges.nb = nb; >> - amd_northbridges.num = misc_count; >> >> link = misc = root = NULL; >> - for (i = 0; i < amd_northbridges.num; i++) { >> + for (i = 0; i < misc_count; i++) { >> node_to_amd_nb(i)->root = root = >> next_northbridge(root, root_ids); >> node_to_amd_nb(i)->misc = misc = >> @@ -299,6 +340,18 @@ int amd_cache_northbridges(void) >> root = next_northbridge(root, root_ids); >> } >> >> + if (misc_count_noncpu) { >> + link = misc = root = NULL; >> + for (i = NONCPU_NODE_INDEX; i < NONCPU_NODE_INDEX + misc_count_noncpu; i++) { > So this is not keeping things relatively simple - this is making you > jump to the GPU nodes to prepare them too which is making them special. > >> + node_to_amd_nb(i)->root = root = >> + next_northbridge(root, noncpu_root_ids); >> + node_to_amd_nb(i)->misc = misc = >> + next_northbridge(misc, noncpu_misc_ids); >> + node_to_amd_nb(i)->link = link = >> + next_northbridge(link, noncpu_link_ids); > And seeing how you put those pointers in ->root, ->misc and ->link, > you can just as well drop those noncpu_*_ids and put the aldebaran > PCI IDs simply in amd_root_ids, amd_nb_misc_ids and amd_nb_link_ids > respectively. > > Because to this code, the RAS functionality is no different than any > other CPU because, well, the interface is those PCI devices. So the > thing doesn't care if it is GPU or not. > > So you don't need any of that separation between GPU and CPU nodes when > it comes to the RAS code. The roots_per_misc count is different for the CPU nodes and GPU nodes. We tried to address your comment without introducing pci_dev_id arrays for GPU roots, misc and links. But, introducing GPU ID arrays looks cleaner, let me submit the revised code and we can revisit this point. > > Makes sense? > > -- > Regards/Gruss, > Boris. Regards, Naveen > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.kernel.org%2Ftglx%2Fnotes-about-netiquette&data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7Ce18fc8abf0da4da621b908d967b4ff31%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637654849356966973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Y5j%2BNctWbWjlpOkQ5DYvDtroWv8MplSBTPgopewm38E%3D&reserved=0
Powered by blists - more mailing lists