lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250519085130.GFaCrxEnZvaoETKrao@fat_crate.local>
Date: Mon, 19 May 2025 10:51:30 +0200
From: Borislav Petkov <bp@...en8.de>
To: Vijay Balakrishna <vijayb@...ux.microsoft.com>
Cc: Tony Luck <tony.luck@...el.com>, Rob Herring <robh@...nel.org>,
	Krzysztof Kozlowski <krzk+dt@...nel.org>,
	Conor Dooley <conor+dt@...nel.org>,
	James Morse <james.morse@....com>,
	Mauro Carvalho Chehab <mchehab@...nel.org>,
	Robert Richter <rric@...nel.org>, linux-edac@...r.kernel.org,
	linux-kernel@...r.kernel.org, Tyler Hicks <code@...icks.com>,
	Marc Zyngier <maz@...nel.org>,
	Sascha Hauer <s.hauer@...gutronix.de>,
	Lorenzo Pieralisi <lpieralisi@...nel.org>,
	devicetree@...r.kernel.org
Subject: Re: [PATCH 1/3] drivers/edac: Add L1 and L2 error detection for A72

On Thu, May 15, 2025 at 05:06:11PM -0700, Vijay Balakrishna wrote:
> Subject: Re: [PATCH 1/3] drivers/edac: Add L1 and L2 error detection for A72

git log drivers/edac/

to get inspired about proper commit titles and prefix.

> From: Sascha Hauer <s.hauer@...gutronix.de>
> 
> The Cortex A72 cores have error detection capabilities for
> the L1/L2 Caches, this patch adds a driver for them. The selected errors

Avoid having "This patch" or "This commit" in the commit message. It is
tautologically useless.

Also, do

$ git grep 'This patch' Documentation/process

for more details.

> to detect/report are by reading CPU/L2 memory error syndrome registers.
> 
> Unfortunately there is no robust way to inject errors into the caches,
> so this driver doesn't contain any code to actually test it. It has
> been tested though with code taken from an older version [1] of this
> driver.  For reasons stated in thread [1], the error injection code is
> not suitable for mainline, so it is removed from the driver.
> 
> [1] https://lore.kernel.org/all/1521073067-24348-1-git-send-email-york.sun@nxp.com/#t
> 
> Signed-off-by: Sascha Hauer <s.hauer@...gutronix.de>
> Co-developed-by: Vijay Balakrishna <vijayb@...ux.microsoft.com>
> Signed-off-by: Vijay Balakrishna <vijayb@...ux.microsoft.com>
> ---
>  drivers/edac/Kconfig    |   8 ++
>  drivers/edac/Makefile   |   1 +
>  drivers/edac/edac_a72.c | 233 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 242 insertions(+)
>  create mode 100644 drivers/edac/edac_a72.c
> 
> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index 19ad3c3b675d..7c99bb04b0c4 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -576,4 +576,12 @@ config EDAC_LOONGSON
>  	  errors (CE) only. Loongson-3A5000/3C5000/3D5000/3A6000/3C6000
>  	  are compatible.
>  
> +config EDAC_CORTEX_A72
> +	tristate "ARM Cortex A72"
> +	depends on ARM64
> +	help
> +	  Support for L1/L2 cache error detection for ARM Cortex A72 processor.
> +	  The detected and reported erros are from reading CPU/L2 memory error

+         The detected and reported erros are from reading memory error
Unknown word [erros] in Kconfig help text.
Suggestions: ['errors', 'Eros', 'errs', 'euros'...

> +	  syndrome registers.
> +
>  endif # EDAC
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index a8f2d8f6c894..835539b5d5af 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -88,3 +88,4 @@ obj-$(CONFIG_EDAC_NPCM)			+= npcm_edac.o
>  obj-$(CONFIG_EDAC_ZYNQMP)		+= zynqmp_edac.o
>  obj-$(CONFIG_EDAC_VERSAL)		+= versal_edac.o
>  obj-$(CONFIG_EDAC_LOONGSON)		+= loongson_edac.o
> +obj-$(CONFIG_EDAC_CORTEX_A72)	+= edac_a72.o

I don't know what tree you are preparing your patches against - it should be
this one:

https://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git/log/?h=edac-for-next

but the indentation level here is wrong:

obj-$(CONFIG_EDAC_ZYNQMP)^I^I+= zynqmp_edac.o$
obj-$(CONFIG_EDAC_VERSAL)^I^I+= versal_edac.o$
obj-$(CONFIG_EDAC_LOONGSON)^I^I+= loongson_edac.o$
obj-$(CONFIG_EDAC_CORTEX_A72)^I+= edac_a72.o$
			     ^^^

after I apply your patch.

> diff --git a/drivers/edac/edac_a72.c b/drivers/edac/edac_a72.c
> new file mode 100644
> index 000000000000..13acd7e7cef0
> --- /dev/null
> +++ b/drivers/edac/edac_a72.c
> @@ -0,0 +1,233 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Cortex A72 EDAC L1 and L2 cache error detection
> + *
> + * Copyright (c) 2020 Pengutronix, Sascha Hauer <s.hauer@...gutronix.de>
> + *
> + * Based on Code from:
> + * Copyright (c) 2018, NXP Semiconductor
> + * Author: York Sun <york.sun@....com>
> + *
> + */
> +
> +#include <linux/module.h>
> +#include <linux/of.h>
> +#include <linux/bitfield.h>
> +#include <asm/smp_plat.h>
> +
> +#include "edac_module.h"
> +
> +#define DRVNAME				"edac-a72"
> +
> +#define CPUMERRSR_EL1_RAMID		GENMASK(30, 24)
> +
> +#define CPUMERRSR_EL1_VALID		BIT(31)
> +#define CPUMERRSR_EL1_FATAL		BIT(63)
> +
> +#define L1_I_TAG_RAM			0x00
> +#define L1_I_DATA_RAM			0x01
> +#define L1_D_TAG_RAM			0x08
> +#define L1_D_DATA_RAM			0x09
> +#define TLB_RAM				0x18
> +
> +#define L2MERRSR_EL1_CPUID_WAY	GENMASK(21, 18)
> +
> +#define L2MERRSR_EL1_VALID		BIT(31)
> +#define L2MERRSR_EL1_FATAL		BIT(63)
> +
> +struct merrsr {
> +	u64 cpumerr;
> +	u64 l2merr;
> +};

That struct naming needs some making the names more understandable. "merrsr"
doesn't tell me anything.

> +
> +#define MESSAGE_SIZE 64
> +
> +#define SYS_CPUMERRSR_EL1			sys_reg(3, 1, 15, 2, 2)
> +#define SYS_L2MERRSR_EL1			sys_reg(3, 1, 15, 2, 3)

Please group all defines together, align them vertically and then put other
definitions below. Look at other drivers for inspiration.

> +
> +static struct cpumask compat_mask;
> +
> +static void report_errors(struct edac_device_ctl_info *edac_ctl, int cpu,
> +			  struct merrsr *merrsr)
> +{
> +	char msg[MESSAGE_SIZE];
> +	u64 cpumerr = merrsr->cpumerr;
> +	u64 l2merr = merrsr->l2merr;

The edac-tree preferred ordering of variable declarations at the
beginning of a function is reverse fir tree order::

	struct long_struct_name *descriptive_name;
	unsigned long foo, bar;
	unsigned int tmp;
	int ret;

The above is faster to parse than the reverse ordering::

	int ret;
	unsigned int tmp;
	unsigned long foo, bar;
	struct long_struct_name *descriptive_name;

And even more so than random ordering::

	unsigned long foo, bar;
	int ret;
	struct long_struct_name *descriptive_name;
	unsigned int tmp;

Please check all your functions.

> +
> +	if (cpumerr & CPUMERRSR_EL1_VALID) {
> +		const char *str;
> +		bool fatal = cpumerr & CPUMERRSR_EL1_FATAL;
> +
> +		switch (FIELD_GET(CPUMERRSR_EL1_RAMID, cpumerr)) {
> +		case L1_I_TAG_RAM:
> +			str = "L1-I Tag RAM";
> +			break;
> +		case L1_I_DATA_RAM:
> +			str = "L1-I Data RAM";
> +			break;
> +		case L1_D_TAG_RAM:
> +			str = "L1-D Tag RAM";
> +			break;
> +		case L1_D_DATA_RAM:
> +			str = "L1-D Data RAM";
> +			break;
> +		case TLB_RAM:
> +			str = "TLB RAM";
> +			break;
> +		default:
> +			str = "Unspecified";
> +			break;
> +		}
> +
> +		snprintf(msg, MESSAGE_SIZE, "%s %s error(s) on CPU %d",
> +			 str, fatal ? "fatal" : "correctable", cpu);
> +
> +		if (fatal)
> +			edac_device_handle_ue(edac_ctl, cpu, 0, msg);
> +		else
> +			edac_device_handle_ce(edac_ctl, cpu, 0, msg);
> +	}
> +
> +	if (l2merr & L2MERRSR_EL1_VALID) {
> +		bool fatal = l2merr & L2MERRSR_EL1_FATAL;
> +
> +		snprintf(msg, MESSAGE_SIZE, "L2 %s error(s) on CPU %d CPUID/WAY 0x%lx",
> +			 fatal ? "fatal" : "correctable", cpu,
> +			 FIELD_GET(L2MERRSR_EL1_CPUID_WAY, l2merr));
> +		if (fatal)
> +			edac_device_handle_ue(edac_ctl, cpu, 1, msg);
> +		else
> +			edac_device_handle_ce(edac_ctl, cpu, 1, msg);
> +	}
> +}
> +
> +static void read_errors(void *data)
> +{
> +	struct merrsr *merrsr = data;
> +
> +	merrsr->cpumerr = read_sysreg_s(SYS_CPUMERRSR_EL1);
> +	if (merrsr->cpumerr & CPUMERRSR_EL1_VALID) {
> +		write_sysreg_s(0, SYS_CPUMERRSR_EL1);
> +		isb();
> +	}
> +	merrsr->l2merr = read_sysreg_s(SYS_L2MERRSR_EL1);
> +	if (merrsr->l2merr & L2MERRSR_EL1_VALID) {
> +		write_sysreg_s(0, SYS_L2MERRSR_EL1);
> +		isb();
> +	}
> +}
> +
> +static void cortex_arm64_edac_check(struct edac_device_ctl_info *edac_ctl)

All static functions don't need a prefix "cortex_arm64_".

> +{
> +	struct merrsr merrsr;
> +	int cpu;

I'd venture a guess you need to protect here against CPU hotplug...

> +	for_each_cpu_and(cpu, cpu_online_mask, &compat_mask) {
> +		smp_call_function_single(cpu, read_errors, &merrsr, true);
> +		report_errors(edac_ctl, cpu, &merrsr);
> +	}
> +}
> +
> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
> +{
> +	struct edac_device_ctl_info *edac_ctl;
> +	struct device *dev = &pdev->dev;
> +	int rc;
> +
> +	edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
> +					      num_possible_cpus(), "L", 2, 1,
> +					      edac_device_alloc_index());
> +	if (!edac_ctl)
> +		return -ENOMEM;
> +
> +	edac_ctl->edac_check = cortex_arm64_edac_check;
> +	edac_ctl->dev = dev;
> +	edac_ctl->mod_name = dev_name(dev);
> +	edac_ctl->dev_name = dev_name(dev);
> +	edac_ctl->ctl_name = DRVNAME;
> +	dev_set_drvdata(dev, edac_ctl);
> +
> +	rc = edac_device_add_device(edac_ctl);
> +	if (rc)
> +		goto out_dev;
> +
> +	return 0;
> +
> +out_dev:
> +	edac_device_free_ctl_info(edac_ctl);
> +
> +	return rc;
> +}
> +
> +static void cortex_arm64_edac_remove(struct platform_device *pdev)
> +{
> +	struct edac_device_ctl_info *edac_ctl = dev_get_drvdata(&pdev->dev);
> +
> +	edac_device_del_device(edac_ctl->dev);
> +	edac_device_free_ctl_info(edac_ctl);
> +}
> +
> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
> +	{ .compatible = "arm,cortex-a72" },
> +	{}
> +};
> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
> +
> +static struct platform_driver cortex_arm64_edac_driver = {
> +	.probe = cortex_arm64_edac_probe,
> +	.remove = cortex_arm64_edac_remove,
> +	.driver = {
> +		.name = DRVNAME,
> +	},
> +};
> +
> +static int __init cortex_arm64_edac_driver_init(void)
> +{
> +	struct device_node *np;
> +	int cpu;
> +	struct platform_device *pdev;
> +	int err;
> +
> +	for_each_possible_cpu(cpu) {
> +		np = of_get_cpu_node(cpu, NULL);
> +

^ Superfluous newline.

> +		if (!np) {
> +			pr_warn("failed to find device node for cpu %d\n", cpu);

In visible strings s/cpu/CPU/g

> +			continue;
> +		}
> +		if (!of_match_node(cortex_arm64_edac_of_match, np))
> +			continue;
> +		if (!of_property_read_bool(np, "edac-enabled"))
> +			continue;
> +		cpumask_set_cpu(cpu, &compat_mask);
> +		of_node_put(np);
> +	}
> +
> +	if (cpumask_empty(&compat_mask))
> +		return 0;
> +
> +	err = platform_driver_register(&cortex_arm64_edac_driver);
> +	if (err)
> +		return err;
> +
> +	pdev = platform_device_register_simple(DRVNAME, -1, NULL, 0);
> +	if (IS_ERR(pdev)) {
> +		pr_err("failed to register cortex arm64 edac device\n");

That driver is called edac_a72 now.

"cortex arm64 edac" is too broad.

> +		platform_driver_unregister(&cortex_arm64_edac_driver);
> +		return PTR_ERR(pdev);
> +	}
> +
> +	return 0;
> +}

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ