[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <637fa0190fe64594954ee4d9e012c39c@huawei.com>
Date: Mon, 27 Jan 2025 12:53:16 +0000
From: Shiju Jose <shiju.jose@...wei.com>
To: Dan Williams <dan.j.williams@...el.com>, "linux-edac@...r.kernel.org"
<linux-edac@...r.kernel.org>, "linux-cxl@...r.kernel.org"
<linux-cxl@...r.kernel.org>, "linux-acpi@...r.kernel.org"
<linux-acpi@...r.kernel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC: "bp@...en8.de" <bp@...en8.de>, "tony.luck@...el.com"
<tony.luck@...el.com>, "rafael@...nel.org" <rafael@...nel.org>,
"lenb@...nel.org" <lenb@...nel.org>, "mchehab@...nel.org"
<mchehab@...nel.org>, "dave@...olabs.net" <dave@...olabs.net>, "Jonathan
Cameron" <jonathan.cameron@...wei.com>, "dave.jiang@...el.com"
<dave.jiang@...el.com>, "alison.schofield@...el.com"
<alison.schofield@...el.com>, "vishal.l.verma@...el.com"
<vishal.l.verma@...el.com>, "ira.weiny@...el.com" <ira.weiny@...el.com>,
"david@...hat.com" <david@...hat.com>, "Vilas.Sridharan@....com"
<Vilas.Sridharan@....com>, "leo.duran@....com" <leo.duran@....com>,
"Yazen.Ghannam@....com" <Yazen.Ghannam@....com>, "rientjes@...gle.com"
<rientjes@...gle.com>, "jiaqiyan@...gle.com" <jiaqiyan@...gle.com>,
"Jon.Grimm@....com" <Jon.Grimm@....com>, "dave.hansen@...ux.intel.com"
<dave.hansen@...ux.intel.com>, "naoya.horiguchi@....com"
<naoya.horiguchi@....com>, "james.morse@....com" <james.morse@....com>,
"jthoughton@...gle.com" <jthoughton@...gle.com>, "somasundaram.a@....com"
<somasundaram.a@....com>, "erdemaktas@...gle.com" <erdemaktas@...gle.com>,
"pgonda@...gle.com" <pgonda@...gle.com>, "duenwen@...gle.com"
<duenwen@...gle.com>, "gthelen@...gle.com" <gthelen@...gle.com>,
"wschwartz@...erecomputing.com" <wschwartz@...erecomputing.com>,
"dferguson@...erecomputing.com" <dferguson@...erecomputing.com>,
"wbs@...amperecomputing.com" <wbs@...amperecomputing.com>,
"nifan.cxl@...il.com" <nifan.cxl@...il.com>, tanxiaofei
<tanxiaofei@...wei.com>, "Zengtao (B)" <prime.zeng@...ilicon.com>, "Roberto
Sassu" <roberto.sassu@...wei.com>, "kangkang.shen@...urewei.com"
<kangkang.shen@...urewei.com>, wanghuiqiang <wanghuiqiang@...wei.com>,
Linuxarm <linuxarm@...wei.com>
Subject: RE: [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol
scrub control feature
Hi Dan,
Thanks for the comments.
Please find reply inline.
Thanks,
Shiju
>-----Original Message-----
>From: Dan Williams <dan.j.williams@...el.com>
>Sent: 24 January 2025 20:39
>To: Shiju Jose <shiju.jose@...wei.com>; linux-edac@...r.kernel.org; linux-
>cxl@...r.kernel.org; linux-acpi@...r.kernel.org; linux-mm@...ck.org; linux-
>kernel@...r.kernel.org
>Cc: bp@...en8.de; tony.luck@...el.com; rafael@...nel.org; lenb@...nel.org;
>mchehab@...nel.org; dan.j.williams@...el.com; dave@...olabs.net; Jonathan
>Cameron <jonathan.cameron@...wei.com>; dave.jiang@...el.com;
>alison.schofield@...el.com; vishal.l.verma@...el.com; ira.weiny@...el.com;
>david@...hat.com; Vilas.Sridharan@....com; leo.duran@....com;
>Yazen.Ghannam@....com; rientjes@...gle.com; jiaqiyan@...gle.com;
>Jon.Grimm@....com; dave.hansen@...ux.intel.com;
>naoya.horiguchi@....com; james.morse@....com; jthoughton@...gle.com;
>somasundaram.a@....com; erdemaktas@...gle.com; pgonda@...gle.com;
>duenwen@...gle.com; gthelen@...gle.com;
>wschwartz@...erecomputing.com; dferguson@...erecomputing.com;
>wbs@...amperecomputing.com; nifan.cxl@...il.com; tanxiaofei
><tanxiaofei@...wei.com>; Zengtao (B) <prime.zeng@...ilicon.com>; Roberto
>Sassu <roberto.sassu@...wei.com>; kangkang.shen@...urewei.com;
>wanghuiqiang <wanghuiqiang@...wei.com>; Linuxarm
><linuxarm@...wei.com>; Shiju Jose <shiju.jose@...wei.com>
>Subject: Re: [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol
>scrub control feature
>
>shiju.jose@ wrote:
>> From: Shiju Jose <shiju.jose@...wei.com>
>>
>> CXL spec 3.1 section 8.2.9.9.11.1 describes the device patrol scrub
>> control feature. The device patrol scrub proactively locates and makes
>> corrections to errors in regular cycle.
>>
>> Allow specifying the number of hours within which the patrol scrub
>> must be completed, subject to minimum and maximum limits reported by the
>device.
>> Also allow disabling scrub allowing trade-off error rates against
>> performance.
>>
>> Add support for patrol scrub control on CXL memory devices.
>> Register with the EDAC device driver, which retrieves the scrub
>> attribute descriptors from EDAC scrub and exposes the sysfs scrub
>> control attributes to userspace. For example, scrub control for the
>> CXL memory device "cxl_mem0" is exposed in
>/sys/bus/edac/devices/cxl_mem0/scrubX/.
>>
>> Additionally, add support for region-based CXL memory patrol scrub control.
>> CXL memory regions may be interleaved across one or more CXL memory
>> devices. For example, region-based scrub control for "cxl_region1" is
>> exposed in /sys/bus/edac/devices/cxl_region1/scrubX/.
>>
>> Reviewed-by: Dave Jiang <dave.jiang@...el.com>
>> Co-developed-by: Jonathan Cameron <Jonathan.Cameron@...wei.com>
>> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@...wei.com>
>> Signed-off-by: Shiju Jose <shiju.jose@...wei.com>
>> ---
>> Documentation/edac/scrub.rst | 66 ++++++
>> drivers/cxl/Kconfig | 17 ++
>> drivers/cxl/core/Makefile | 1 +
>> drivers/cxl/core/memfeature.c | 392
>++++++++++++++++++++++++++++++++++
>> drivers/cxl/core/region.c | 6 +
>> drivers/cxl/cxlmem.h | 7 +
>> drivers/cxl/mem.c | 5 +
>> include/cxl/features.h | 16 ++
>> 8 files changed, 510 insertions(+)
>> create mode 100644 drivers/cxl/core/memfeature.c diff --git
>> a/Documentation/edac/scrub.rst b/Documentation/edac/scrub.rst index
>> f86645c7f0af..80e986c57885 100644
>> --- a/Documentation/edac/scrub.rst
>> +++ b/Documentation/edac/scrub.rst
>> @@ -325,3 +325,69 @@ root@...alhost:~# cat
>> /sys/bus/edac/devices/acpi_ras_mem0/scrub0/current_cycle_d
>> 10800
>>
>> root@...alhost:~# echo 0 >
>> /sys/bus/edac/devices/acpi_ras_mem0/scrub0/enable_background
>> +
>> +2. CXL memory device patrol scrubber
>> +
>> +2.1 Device based scrubbing
>> +
>> +root@...alhost:~# cat
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/min_cycle_duration
>> +
>> +3600
>> +
>> +root@...alhost:~# cat
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/max_cycle_duration
>> +
>> +918000
>> +
>> +root@...alhost:~# cat
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
>> +
>> +43200
>> +
>> +root@...alhost:~# echo 54000 >
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
>> +
>> +root@...alhost:~# cat
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/current_cycle_duration
>> +
>> +54000
>> +
>> +root@...alhost:~# echo 1 >
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
>> +
>> +root@...alhost:~# cat
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
>> +
>> +1
>> +
>> +root@...alhost:~# echo 0 >
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
>> +
>> +root@...alhost:~# cat
>> +/sys/bus/edac/devices/cxl_mem0/scrub0/enable_background
>> +
>> +0
>> +
>> +2.2. Region based scrubbing
>> +
>> +root@...alhost:~# cat
>> +/sys/bus/edac/devices/cxl_region0/scrub0/min_cycle_duration
>> +
>> +3600
>> +
>> +root@...alhost:~# cat
>> +/sys/bus/edac/devices/cxl_region0/scrub0/max_cycle_duration
>> +
>> +918000
>> +
>> +root@...alhost:~# cat
>> +/sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
>> +
>> +43200
>> +
>> +root@...alhost:~# echo 54000 >
>> +/sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
>> +
>> +root@...alhost:~# cat
>> +/sys/bus/edac/devices/cxl_region0/scrub0/current_cycle_duration
>> +
>> +54000
>> +
>> +root@...alhost:~# echo 1 >
>> +/sys/bus/edac/devices/cxl_region0/scrub0/enable_background
>> +
>> +root@...alhost:~# cat
>> +/sys/bus/edac/devices/cxl_region0/scrub0/enable_background
>> +
>> +1
>> +
>> +root@...alhost:~# echo 0 >
>> +/sys/bus/edac/devices/cxl_region0/scrub0/enable_background
>> +
>> +root@...alhost:~# cat
>> +/sys/bus/edac/devices/cxl_region0/scrub0/enable_background
>
>What is this content-free blob of cat and echo statements? Please write actual
>documentation with theory of operation, clarification of assumptions, rationale
>for defaults, guidance on changing defaults...
Jonathan already replied.
>
>> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig index
>> 0bc6a2cb8474..6078f02e883b 100644
>> --- a/drivers/cxl/Kconfig
>> +++ b/drivers/cxl/Kconfig
>> @@ -154,4 +154,21 @@ config CXL_FEATURES
>>
>> If unsure say 'y'.
>>
>> +config CXL_RAS_FEATURES
>> + tristate "CXL: Memory RAS features"
>> + depends on CXL_PCI
>
>What is the build dependency on CXL_PCI? This enabling does not call back into
>symbols provided by cxl_pci.ko does it?
Will remove, which is not required. Initially cxl_mem_ras_features_init() was called from the pci.c
>
>> + depends on CXL_MEM
>
>Similar comment, and this also goes away if all of this just moves into the new
>cxl_features driver.
Agree with Jonathan told in reply. These are RAS specific features for CXL memory devices and
thus added in memfeature.c
>
>> + depends on EDAC
>> + help
>> + The CXL memory RAS feature control is optional and allows host to
>> + control the RAS features configurations of CXL Type 3 devices.
>> +
>> + It registers with the EDAC device subsystem to expose control
>> + attributes of CXL memory device's RAS features to the user.
>> + It provides interface functions to support configuring the CXL
>> + memory device's RAS features.
>> + Say 'y/m/n' to enable/disable control of the CXL.mem device's RAS
>features.
>> + See section 8.2.9.9.11 of CXL 3.1 specification for the detailed
>> + information of CXL memory device features.
>
>Usually the "say X" statement provides a rationale like.
>
>"Say y/m if you have an expert need to change default memory scrub rates
>established by the platform/device, otherwise say n"
Will change.
>
>> +
>> endif
>> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
>> index 73b6348afd67..54baca513ecb 100644
>> --- a/drivers/cxl/core/Makefile
>> +++ b/drivers/cxl/core/Makefile
>> @@ -17,3 +17,4 @@ cxl_core-y += cdat.o cxl_core-y += features.o
>> cxl_core-$(CONFIG_TRACING) += trace.o
>> cxl_core-$(CONFIG_CXL_REGION) += region.o
>> +cxl_core-$(CONFIG_CXL_RAS_FEATURES) += memfeature.o
>> diff --git a/drivers/cxl/core/memfeature.c
>> b/drivers/cxl/core/memfeature.c new file mode 100644 index
>> 000000000000..77d1bf6ce45f
>> --- /dev/null
>> +++ b/drivers/cxl/core/memfeature.c
>> @@ -0,0 +1,392 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * CXL memory RAS feature driver.
>> + *
>> + * Copyright (c) 2024 HiSilicon Limited.
>> + *
>> + * - Supports functions to configure RAS features of the
>> + * CXL memory devices.
>> + * - Registers with the EDAC device subsystem driver to expose
>> + * the features sysfs attributes to the user for configuring
>> + * CXL memory RAS feature.
>> + */
>> +
>> +#include <linux/cleanup.h>
>> +#include <linux/edac.h>
>> +#include <linux/limits.h>
>> +#include <cxl/features.h>
>> +#include <cxl.h>
>> +#include <cxlmem.h>
>> +
>> +#define CXL_DEV_NUM_RAS_FEATURES 1
>> +#define CXL_DEV_HOUR_IN_SECS 3600
>> +
>> +#define CXL_DEV_NAME_LEN 128
>> +
>> +/* CXL memory patrol scrub control functions */ struct
>> +cxl_patrol_scrub_context {
>> + u8 instance;
>> + u16 get_feat_size;
>> + u16 set_feat_size;
>> + u8 get_version;
>> + u8 set_version;
>> + u16 effects;
>> + struct cxl_memdev *cxlmd;
>> + struct cxl_region *cxlr;
>> +};
>> +
>> +/**
>> + * struct cxl_memdev_ps_params - CXL memory patrol scrub parameter data
>structure.
>> + * @enable: [IN & OUT] enable(1)/disable(0) patrol scrub.
>> + * @scrub_cycle_changeable: [OUT] scrub cycle attribute of patrol scrub is
>changeable.
>> + * @scrub_cycle_hrs: [IN] Requested patrol scrub cycle in hours.
>> + * [OUT] Current patrol scrub cycle in hours.
>> + * @min_scrub_cycle_hrs:[OUT] minimum patrol scrub cycle in hours
>supported.
>> + */
>> +struct cxl_memdev_ps_params {
>> + bool enable;
>> + bool scrub_cycle_changeable;
>> + u8 scrub_cycle_hrs;
>> + u8 min_scrub_cycle_hrs;
>> +};
>> +
>> +enum cxl_scrub_param {
>> + CXL_PS_PARAM_ENABLE,
>> + CXL_PS_PARAM_SCRUB_CYCLE,
>> +};
>> +
>> +#define CXL_MEMDEV_PS_SCRUB_CYCLE_CHANGE_CAP_MASK BIT(0)
>> +#define
> CXL_MEMDEV_PS_SCRUB_CYCLE_REALTIME_REPORT_CAP_MASK
> BIT(1)
>> +#define CXL_MEMDEV_PS_CUR_SCRUB_CYCLE_MASK GENMASK(7, 0)
>> +#define CXL_MEMDEV_PS_MIN_SCRUB_CYCLE_MASK GENMASK(15,
>8)
>> +#define CXL_MEMDEV_PS_FLAG_ENABLED_MASK BIT(0)
>> +
>> +struct cxl_memdev_ps_rd_attrs {
>> + u8 scrub_cycle_cap;
>> + __le16 scrub_cycle_hrs;
>> + u8 scrub_flags;
>> +} __packed;
>> +
>> +struct cxl_memdev_ps_wr_attrs {
>> + u8 scrub_cycle_hrs;
>> + u8 scrub_flags;
>> +} __packed;
>
>If these are packed to match specification layout, include a specification
>reference comment.
Will add specification reference comment. Added same for memory repair features,
but missed here.
>
>> +
>> +static int cxl_mem_ps_get_attrs(struct cxl_mailbox *cxl_mbox,
>> + struct cxl_memdev_ps_params *params) {
>> + size_t rd_data_size = sizeof(struct cxl_memdev_ps_rd_attrs);
>> + u16 scrub_cycle_hrs;
>> + size_t data_size;
>> + u16 return_code;
>> + struct cxl_memdev_ps_rd_attrs *rd_attrs __free(kfree) =
>> + kmalloc(rd_data_size,
>GFP_KERNEL);
>
>I would feel better with kzalloc() if short reads are possible.
Will change to kzalloc().
>
>How big can rd_data_size get? I.e. should this be kvzalloc()?
rd_data_size is 4 bytes for the patrol scrub feature.
>
>> + if (!rd_attrs)
>> + return -ENOMEM;
>> +
>> + data_size = cxl_get_feature(cxl_mbox->features,
>CXL_FEAT_PATROL_SCRUB_UUID,
>> + CXL_GET_FEAT_SEL_CURRENT_VALUE,
>> + rd_attrs, rd_data_size, 0, &return_code);
>> + if (!data_size || return_code != CXL_MBOX_CMD_RC_SUCCESS)
>> + return -EIO;
>> +
>> + params->scrub_cycle_changeable =
>FIELD_GET(CXL_MEMDEV_PS_SCRUB_CYCLE_CHANGE_CAP_MASK,
>> + rd_attrs->scrub_cycle_cap);
>> + params->enable =
>FIELD_GET(CXL_MEMDEV_PS_FLAG_ENABLED_MASK,
>> + rd_attrs->scrub_flags);
>> + scrub_cycle_hrs = le16_to_cpu(rd_attrs->scrub_cycle_hrs);
>> + params->scrub_cycle_hrs =
>FIELD_GET(CXL_MEMDEV_PS_CUR_SCRUB_CYCLE_MASK,
>> + scrub_cycle_hrs);
>> + params->min_scrub_cycle_hrs =
>FIELD_GET(CXL_MEMDEV_PS_MIN_SCRUB_CYCLE_MASK,
>> + scrub_cycle_hrs);
>> +
>> + return 0;
>> +}
>> +
>> +static int cxl_ps_get_attrs(struct cxl_patrol_scrub_context *cxl_ps_ctx,
>> + struct cxl_memdev_ps_params *params) {
>> + struct cxl_memdev *cxlmd;
>> + u16 min_scrub_cycle = 0;
>> + int i, ret;
>> +
>> + if (cxl_ps_ctx->cxlr) {
>> + struct cxl_region *cxlr = cxl_ps_ctx->cxlr;
>> + struct cxl_region_params *p = &cxlr->params;
>> +
>> + for (i = p->interleave_ways - 1; i >= 0; i--) {
>> + struct cxl_endpoint_decoder *cxled = p->targets[i];
>
>It looks like this is called directly as a callback from EDAC. Where is the locking
>that keeps cxl_ps_ctx->cxlr valid, or p->targets content stable?
Jonathan already replied.
>
>> +
>> + cxlmd = cxled_to_memdev(cxled);
>> + ret = cxl_mem_ps_get_attrs(&cxlmd->cxlds->cxl_mbox,
>params);
>> + if (ret)
>> + return ret;
>> +
>> + if (params->min_scrub_cycle_hrs > min_scrub_cycle)
>> + min_scrub_cycle = params-
>>min_scrub_cycle_hrs;
>> + }
>> + params->min_scrub_cycle_hrs = min_scrub_cycle;
>> + return 0;
>> + }
>> + cxlmd = cxl_ps_ctx->cxlmd;
>> +
>> + return cxl_mem_ps_get_attrs(&cxlmd->cxlds->cxl_mbox, params); }
>> +
>> +static int cxl_mem_ps_set_attrs(struct device *dev,
>> + struct cxl_patrol_scrub_context *cxl_ps_ctx,
>> + struct cxl_mailbox *cxl_mbox,
>> + struct cxl_memdev_ps_params *params,
>> + enum cxl_scrub_param param_type)
>> +{
>> + struct cxl_memdev_ps_wr_attrs wr_attrs;
>> + struct cxl_memdev_ps_params rd_params;
>> + u16 return_code;
>> + int ret;
>> +
>> + ret = cxl_mem_ps_get_attrs(cxl_mbox, &rd_params);
>> + if (ret) {
>> + dev_err(dev, "Get cxlmemdev patrol scrub params failed
>ret=%d\n",
>> + ret);
>> + return ret;
>> + }
>> +
>> + switch (param_type) {
>> + case CXL_PS_PARAM_ENABLE:
>> + wr_attrs.scrub_flags =
>FIELD_PREP(CXL_MEMDEV_PS_FLAG_ENABLED_MASK,
>> + params->enable);
>> + wr_attrs.scrub_cycle_hrs =
>FIELD_PREP(CXL_MEMDEV_PS_CUR_SCRUB_CYCLE_MASK,
>> +
>rd_params.scrub_cycle_hrs);
>> + break;
>> + case CXL_PS_PARAM_SCRUB_CYCLE:
>> + if (params->scrub_cycle_hrs < rd_params.min_scrub_cycle_hrs)
>{
>> + dev_err(dev, "Invalid CXL patrol scrub cycle(%d) to
>set\n",
>> + params->scrub_cycle_hrs);
>> + dev_err(dev, "Minimum supported CXL patrol scrub
>cycle in hour %d\n",
>> + rd_params.min_scrub_cycle_hrs);
>> + return -EINVAL;
>> + }
>> + wr_attrs.scrub_cycle_hrs =
>FIELD_PREP(CXL_MEMDEV_PS_CUR_SCRUB_CYCLE_MASK,
>> + params->scrub_cycle_hrs);
>> + wr_attrs.scrub_flags =
>FIELD_PREP(CXL_MEMDEV_PS_FLAG_ENABLED_MASK,
>> + rd_params.enable);
>> + break;
>> + }
>> +
>> + ret = cxl_set_feature(cxl_mbox->features,
>CXL_FEAT_PATROL_SCRUB_UUID,
>> + cxl_ps_ctx->set_version,
>> + &wr_attrs, sizeof(wr_attrs),
>> + CXL_SET_FEAT_FLAG_DATA_SAVED_ACROSS_RESET,
>> + 0, &return_code);
>> + if (ret || return_code != CXL_MBOX_CMD_RC_SUCCESS) {
>> + dev_err(dev, "CXL patrol scrub set feature failed ret=%d
>return_code=%u\n",
>> + ret, return_code);
>
>What can the admin do with this log spam? I would reconsider making all of
>these dev_dbg() and improving the sysfs documentation on what error codes
>mean.
Sure will change.
>
>[..]
>> +
>> +int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct
>> +cxl_region *cxlr)
>
>Please separate this into a memdev helper and a region helper. It is silly to have
>two arguments to a function where one is expected to be NULL at all times, and
>then have an if else statement inside that to effectively turn it back into 2 code
>paths.
>
>If there is code to be shared amongst those, make *that* the shared helper.
I added single function cxl_mem_ras_features_init() for both memdev and region based
scrubbing to reduce code size as there were feedbacks try reduce code size.
>
>> +{
>> + struct edac_dev_feature ras_features[CXL_DEV_NUM_RAS_FEATURES];
>> + char cxl_dev_name[CXL_DEV_NAME_LEN];
>> + int num_ras_features = 0;
>> + u8 scrub_inst = 0;
>> + int rc;
>> +
>> + rc = cxl_memdev_scrub_init(cxlmd, cxlr,
>&ras_features[num_ras_features],
>> + scrub_inst);
>> + if (rc < 0)
>> + return rc;
>> +
>> + scrub_inst++;
>> + num_ras_features++;
>> +
>> + if (cxlr)
>> + snprintf(cxl_dev_name, sizeof(cxl_dev_name),
>> + "cxl_region%d", cxlr->id);
>
>Why not pass dev_name(&cxlr->dev) directly?
Jonathan already replied.
>
>> + else
>> + snprintf(cxl_dev_name, sizeof(cxl_dev_name),
>> + "%s_%s", "cxl", dev_name(&cxlmd->dev));
>
>Can a "cxl" directory be created so that the raw name can be used?
>
>> +
>> + return edac_dev_register(&cxlmd->dev, cxl_dev_name, NULL,
>> + num_ras_features, ras_features);
>
>I'm so confused... a few lines down in this patch we have:
>
> rc = cxl_mem_ras_features_init(NULL, cxlr);
>
>...so how can this call to edac_dev_register() unconditionally de-reference
>@cxlmd?
Thanks for spotting this. It is a bug, need to fix , cxlmd inited for region based scrubbing
was done inside cxl_mem_ras_features_init() previously, which now moved to
inside cxl_memdev_scrub_init().
Region based scrubbing required better testing because of some difficulty in running
this use case in my test setup. Will check with Jonathan how to do.
>
>Are there any tests for this? cxl-test is purpose-built for this kind of basic
>coverage tests.
Will check this.
>
>> +EXPORT_SYMBOL_NS_GPL(cxl_mem_ras_features_init, "CXL");
>> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
>> index b98b1ccffd1c..c2be70cd87f8 100644
>> --- a/drivers/cxl/core/region.c
>> +++ b/drivers/cxl/core/region.c
>> @@ -3449,6 +3449,12 @@ static int cxl_region_probe(struct device *dev)
>> p->res->start, p->res->end, cxlr,
>> is_system_ram) > 0)
>> return 0;
>> +
>> + rc = cxl_mem_ras_features_init(NULL, cxlr);
>> + if (rc)
>> + dev_warn(&cxlr->dev, "CXL RAS features init for
>region_id=%d failed\n",
>> + cxlr->id);
>
>There is more to RAS than EDAC memory scrub so this message is misleading. It
>is also unnecessary because the driver continues to load and the admin, if they
>care, will notice that the EDAC attributes are missing.
This message was added for the debugging purpose in CXL driver. I will change to dev_dbg().
>
>> +
>> return devm_cxl_add_dax_region(cxlr);
>> default:
>> dev_dbg(&cxlr->dev, "unsupported region mode: %d\n", diff --
>git
>> a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h index
>> 55c55685cb39..2b02e47cd7e7 100644
>> --- a/drivers/cxl/cxlmem.h
>> +++ b/drivers/cxl/cxlmem.h
>> @@ -800,6 +800,13 @@ int cxl_trigger_poison_list(struct cxl_memdev
>> *cxlmd); int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa);
>> int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa);
>>
>> +#if IS_ENABLED(CONFIG_CXL_RAS_FEATURES)
>> +int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct
>> +cxl_region *cxlr); #else static inline int
>> +cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct cxl_region
>> +*cxlr) { return 0; } #endif
>> +
>> #ifdef CONFIG_CXL_SUSPEND
>> void cxl_mem_active_inc(void);
>> void cxl_mem_active_dec(void);
>> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c index
>> 2f03a4d5606e..d236b4b8a93c 100644
>> --- a/drivers/cxl/mem.c
>> +++ b/drivers/cxl/mem.c
>> @@ -116,6 +116,10 @@ static int cxl_mem_probe(struct device *dev)
>> if (!cxlds->media_ready)
>> return -EBUSY;
>>
>> + rc = cxl_mem_ras_features_init(cxlmd, NULL);
>> + if (rc)
>> + dev_warn(&cxlmd->dev, "CXL RAS features init failed\n");
>> +
>> /*
>> * Someone is trying to reattach this device after it lost its port
>> * connection (an endpoint port previously registered by this memdev
>> was @@ -259,3 +263,4 @@
>MODULE_ALIAS_CXL(CXL_DEVICE_MEMORY_EXPANDER);
>> * endpoint registration.
>> */
>> MODULE_SOFTDEP("pre: cxl_port");
>> +MODULE_SOFTDEP("pre: cxl_features");
>
>Why?
This dependency is no more required. During integration testing, this was added when
cxl_features found was not getting initialized when CXL memdev RAS features are getting
initialized, which calls features command function, cxl_get_supported_feature_entry,
for the RAS features. The reason was different from this and got fixed.
Thanks,
Shiju
Powered by blists - more mailing lists