[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8f8d69fa-f6d4-462e-b476-b812d4138c2a@intel.com>
Date: Tue, 3 Sep 2024 14:37:18 +0800
From: "Li, Ming4" <ming4.li@...el.com>
To: <ira.weiny@...el.com>, Dave Jiang <dave.jiang@...el.com>, Fan Ni
<fan.ni@...sung.com>, Jonathan Cameron <Jonathan.Cameron@...wei.com>,
"Navneet Singh" <navneet.singh@...el.com>, Chris Mason <clm@...com>, Josef
Bacik <josef@...icpanda.com>, David Sterba <dsterba@...e.com>, Petr Mladek
<pmladek@...e.com>, Steven Rostedt <rostedt@...dmis.org>, Andy Shevchenko
<andriy.shevchenko@...ux.intel.com>, Rasmus Villemoes
<linux@...musvillemoes.dk>, Sergey Senozhatsky <senozhatsky@...omium.org>,
Jonathan Corbet <corbet@....net>, Andrew Morton <akpm@...ux-foundation.org>
CC: Dan Williams <dan.j.williams@...el.com>, Davidlohr Bueso
<dave@...olabs.net>, Alison Schofield <alison.schofield@...el.com>, "Vishal
Verma" <vishal.l.verma@...el.com>, <linux-btrfs@...r.kernel.org>,
<linux-cxl@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<linux-doc@...r.kernel.org>, <nvdimm@...ts.linux.dev>
Subject: Re: [PATCH v3 18/25] cxl/extent: Process DCD events and realize
region extents
On 8/16/2024 10:44 PM, ira.weiny@...el.com wrote:
> From: Navneet Singh <navneet.singh@...el.com>
>
> A dynamic capacity device (DCD) sends events to signal the host for
> changes in the availability of Dynamic Capacity (DC) memory. These
> events contain extents describing a DPA range and meta data for memory
> to be added or removed. Events may be sent from the device at any time.
>
> Three types of events can be signaled, Add, Release, and Force Release.
>
> On add, the host may accept or reject the memory being offered. If no
> region exists, or the extent is invalid, the extent should be rejected.
> Add extent events may be grouped by a 'more' bit which indicates those
> extents should be processed as a group.
>
> On remove, the host can delay the response until the host is safely not
> using the memory. If no region exists the release can be sent
> immediately. The host may also release extents (or partial extents) at
> any time. Thus the 'more' bit grouping of release events is of less
> value and can be ignored in favor of sending multiple release capacity
> responses for groups of release events.
>
> Force removal is intended as a mechanism between the FM and the device
> and intended only when the host is unresponsive, out of sync, or
> otherwise broken. Purposely ignore force removal events.
>
> Regions are made up of one or more devices which may be surfacing memory
> to the host. Once all devices in a region have surfaced an extent the
> region can expose a corresponding extent for the user to consume.
> Without interleaving a device extent forms a 1:1 relationship with the
> region extent. Immediately surface a region extent upon getting a
> device extent.
>
> Per the specification the device is allowed to offer or remove extents
> at any time. However, anticipated use cases can expect extents to be
> offered, accepted, and removed in well defined chunks.
>
> Simplify extent tracking with the following restrictions.
>
> 1) Flag for removal any extent which overlaps a requested
> release range.
> 2) Refuse the offer of extents which overlap already accepted
> memory ranges.
> 3) Accept again a range which has already been accepted by the
> host. (It is likely the device has an error because it
> should already know that this range was accepted. But from
> the host point of view it is safe to acknowledge that
> acceptance again.)
>
> Management of the region extent devices must be synchronized with
> potential uses of the memory within the DAX layer. Create region extent
> devices as children of the cxl_dax_region device such that the DAX
> region driver can co-drive them and synchronize with the DAX layer.
> Synchronization and management is handled in a subsequent patch.
>
> Process DCD events and create region devices.
>
> Signed-off-by: Navneet Singh <navneet.singh@...el.com>
> Co-developed-by: Ira Weiny <ira.weiny@...el.com>
> Signed-off-by: Ira Weiny <ira.weiny@...el.com>
>
> ---
> Changes:
> [iweiny: combine this with the extent surface patches to better show the
> lifetime extent objects in review]
> [iweiny: clean up commit message.]
> [iweiny: move extent verification of the 'read extents on region
> creation' to this patch]
> [iweiny: Provide for a common path for extent realization between an add
> event and adding existing extents.]
> [iweiny: Persist a check that an extent is within an endpoint decoder]
> [iweiny: reduce exported and non-static calls]
> [iweiny: use %par]
>
> <Combined comments from the old patches which were addressed>
>
> [Jonathan: implement the more bit with a simple algorithm which accepts
> all extents it can.
> Also include the response more bit to prevent payload
> overflow]
> [Fan: Do not error if a contained extent is added.]
> [Jonathan: allocate ida after kzalloc]
> [iweiny: fix ida resource leak]
> [fan/djiang: remove unneeded memset]
> [djiang: fix indentation]
> [Jonathan: Fix indentation]
> [Jonathan/djbw: make tag a uuid]
> [djbw: create helper calc_hpa_range() straight away]
> [djbw: Allow for multiple cxled_extents per region_extent]
> [djbw: s/cxl_ed/cxled]
> [djbw: s/cxl_release_ed_extent/cxled_release_extent/]
> [djbw: s/reg_ext/region_extent/]
> [djbw: s/dc_extent/extent/]
> [Gregory/djbw: reject shared extents]
> [iweiny: predicate extent.c compile on CONFIG_CXL_REGION]
> ---
> drivers/cxl/core/Makefile | 2 +-
> drivers/cxl/core/core.h | 13 ++
> drivers/cxl/core/extent.c | 345 ++++++++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/mbox.c | 268 ++++++++++++++++++++++++++++++++++-
> drivers/cxl/core/region.c | 6 +
> drivers/cxl/cxl.h | 52 ++++++-
> drivers/cxl/cxlmem.h | 26 ++++
> include/linux/cxl-event.h | 32 +++++
> tools/testing/cxl/Kbuild | 3 +-
> 9 files changed, 743 insertions(+), 4 deletions(-)
[...]
> +
> +static bool extents_contain(struct cxl_dax_region *cxlr_dax,
> + struct cxl_endpoint_decoder *cxled,
> + struct range *new_range)
> +{
> + struct device *extent_device;
> + struct match_data md = {
> + .cxled = cxled,
> + .new_range = new_range,
> + };
> +
> + extent_device = device_find_child(&cxlr_dax->dev, &md, match_contains);
Is it better to use __free(put_device) here to drop below 'put_device(extent_device)'?
> + if (!extent_device)
> + return false;
> +
> + put_device(extent_device);
> + return true;
> +}
> +
> +static int match_overlaps(struct device *dev, void *data)
> +{
> + struct region_extent *region_extent = to_region_extent(dev);
> + struct match_data *md = data;
> + struct cxled_extent *entry;
> + unsigned long index;
> +
> + if (!region_extent)
> + return 0;
> +
> + xa_for_each(®ion_extent->decoder_extents, index, entry) {
> + if (md->cxled == entry->cxled &&
> + range_overlaps(&entry->dpa_range, md->new_range))
> + return true;
> + }
> +
> + return false;
> +}
> +
> +static bool extents_overlap(struct cxl_dax_region *cxlr_dax,
> + struct cxl_endpoint_decoder *cxled,
> + struct range *new_range)
> +{
> + struct device *extent_device;
> + struct match_data md = {
> + .cxled = cxled,
> + .new_range = new_range,
> + };
> +
> + extent_device = device_find_child(&cxlr_dax->dev, &md, match_overlaps);
Same as above.
> + if (!extent_device)
> + return false;
> +
> + put_device(extent_device);
> + return true;
> +}
> +
> +static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dax_region *cxlr_dax,
> + struct range *dpa_range,
> + struct range *hpa_range)
> +{
> + resource_size_t dpa_offset, hpa;
> +
> + dpa_offset = dpa_range->start - cxled->dpa_res->start;
> + hpa = cxled->cxld.hpa_range.start + dpa_offset;
> +
> + hpa_range->start = hpa - cxlr_dax->hpa_range.start;
> + hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
> +}
> +
> +static int cxlr_rm_extent(struct device *dev, void *data)
> +{
> + struct region_extent *region_extent = to_region_extent(dev);
> + struct range *region_hpa_range = data;
> +
> + if (!region_extent)
> + return 0;
> +
> + /*
> + * Any extent which 'touches' the released range is removed.
> + */
> + if (range_overlaps(region_hpa_range, ®ion_extent->hpa_range)) {
> + dev_dbg(dev, "Remove region extent HPA %par\n",
> + ®ion_extent->hpa_range);
> + region_rm_extent(region_extent);
> + }
> + return 0;
> +}
> +
> +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> +{
> + u64 start_dpa = le64_to_cpu(extent->start_dpa);
> + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> + struct cxl_endpoint_decoder *cxled;
> + struct range hpa_range, dpa_range;
> + struct cxl_region *cxlr;
> +
> + dpa_range = (struct range) {
> + .start = start_dpa,
> + .end = start_dpa + le64_to_cpu(extent->length) - 1,
> + };
> +
> + guard(rwsem_read)(&cxl_region_rwsem);
> + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
> + if (!cxlr) {
> + memdev_release_extent(mds, &dpa_range);
> + return -ENXIO;
> + }
> +
> + calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range);
> +
> + /* Remove region extents which overlap */
> + return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range,
> + cxlr_rm_extent);
> +}
> +
> +static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax,
> + struct cxl_endpoint_decoder *cxled,
> + struct cxled_extent *ed_extent)
> +{
> + struct region_extent *region_extent;
> + struct range hpa_range;
> + int rc;
> +
> + calc_hpa_range(cxled, cxlr_dax, &ed_extent->dpa_range, &hpa_range);
> +
> + region_extent = alloc_region_extent(cxlr_dax, &hpa_range, ed_extent->tag);
> + if (IS_ERR(region_extent))
> + return PTR_ERR(region_extent);
> +
> + rc = xa_insert(®ion_extent->decoder_extents, (unsigned long)ed_extent, ed_extent,
> + GFP_KERNEL);
> + if (rc) {
> + free_region_extent(region_extent);
> + return rc;
> + }
> +
> + /* device model handles freeing region_extent */
> + return online_region_extent(region_extent);
> +}
> +
> +/* Callers are expected to ensure cxled has been attached to a region */
> +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent)
> +{
> + u64 start_dpa = le64_to_cpu(extent->start_dpa);
> + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
> + struct cxl_endpoint_decoder *cxled;
> + struct range ed_range, ext_range;
> + struct cxl_dax_region *cxlr_dax;
> + struct cxled_extent *ed_extent;
> + struct cxl_region *cxlr;
> + struct device *dev;
> +
> + ext_range = (struct range) {
> + .start = start_dpa,
> + .end = start_dpa + le64_to_cpu(extent->length) - 1,
> + };
> +
> + guard(rwsem_read)(&cxl_region_rwsem);
> + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled);
> + if (!cxlr)
> + return -ENXIO;
> +
> + cxlr_dax = cxled->cxld.region->cxlr_dax;
> + dev = &cxled->cxld.dev;
> + ed_range = (struct range) {
> + .start = cxled->dpa_res->start,
> + .end = cxled->dpa_res->end,
> + };
> +
> + dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %par\n",
> + cxled->dpa_res, &ext_range);
> +
> + if (!range_contains(&ed_range, &ext_range)) {
> + dev_err_ratelimited(dev,
> + "DC extent DPA %par (%*phC) is not fully in ED %par\n",
> + &ext_range.start, CXL_EXTENT_TAG_LEN,
> + extent->tag, &ed_range);
> + return -ENXIO;
> + }
> +
> + if (extents_contain(cxlr_dax, cxled, &ext_range))
> + return 0;
> +
> + if (extents_overlap(cxlr_dax, cxled, &ext_range))
> + return -ENXIO;
> +
> + ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL);
> + if (!ed_extent)
> + return -ENOMEM;
> +
> + ed_extent->cxled = cxled;
> + ed_extent->dpa_range = ext_range;
> + memcpy(ed_extent->tag, extent->tag, CXL_EXTENT_TAG_LEN);
> +
> + dev_dbg(dev, "Add extent %par (%*phC)\n", &ed_extent->dpa_range,
> + CXL_EXTENT_TAG_LEN, ed_extent->tag);
> +
> + return cxlr_add_extent(cxlr_dax, cxled, ed_extent);
> +}
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 01a447aaa1b1..f629ad7488ac 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -882,6 +882,48 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
>
> +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> + struct cxl_extent *extent)
> +{
> + u64 start = le64_to_cpu(extent->start_dpa);
> + u64 length = le64_to_cpu(extent->length);
> + struct device *dev = mds->cxlds.dev;
> +
> + struct range ext_range = (struct range){
> + .start = start,
> + .end = start + length - 1,
> + };
> +
> + if (le16_to_cpu(extent->shared_extn_seq) != 0) {
> + dev_err_ratelimited(dev,
> + "DC extent DPA %par (%*phC) can not be shared\n",
> + &ext_range.start, CXL_EXTENT_TAG_LEN,
> + extent->tag);
> + return -ENXIO;
> + }
> +
> + /* Extents must not cross DC region boundary's */
> + for (int i = 0; i < mds->nr_dc_region; i++) {
> + struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> + struct range region_range = (struct range) {
> + .start = dcr->base,
> + .end = dcr->base + dcr->decode_len - 1,
> + };
> +
> + if (range_contains(®ion_range, &ext_range)) {
> + dev_dbg(dev, "DC extent DPA %par (DCR:%d:%#llx)(%*phC)\n",
> + &ext_range, i, start - dcr->base,
> + CXL_EXTENT_TAG_LEN, extent->tag);
> + return 0;
> + }
> + }
> +
> + dev_err_ratelimited(dev,
> + "DC extent DPA %par (%*phC) is not in any DC region\n",
> + &ext_range, CXL_EXTENT_TAG_LEN, extent->tag);
> + return -ENXIO;
> +}
> +
> void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> enum cxl_event_log_type type,
> enum cxl_event_type event_type,
> @@ -1009,6 +1051,207 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> return rc;
> }
>
> +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode,
> + struct xarray *extent_array, int cnt)
> +{
> + struct cxl_mbox_dc_response *p;
> + struct cxl_mbox_cmd mbox_cmd;
> + struct cxl_extent *extent;
> + unsigned long index;
> + u32 pl_index;
> + int rc = 0;
> +
> + size_t pl_size = struct_size(p, extent_list, cnt);
> + u32 max_extents = cnt;
> +
> + /* May have to use more bit on response. */
> + if (pl_size > mds->payload_size) {
> + max_extents = (mds->payload_size - sizeof(*p)) /
> + sizeof(struct updated_extent_list);
> + pl_size = struct_size(p, extent_list, max_extents);
> + }
> +
> + struct cxl_mbox_dc_response *response __free(kfree) =
> + kzalloc(pl_size, GFP_KERNEL);
> + if (!response)
> + return -ENOMEM;
> +
> + pl_index = 0;
> + xa_for_each(extent_array, index, extent) {
> +
> + response->extent_list[pl_index].dpa_start = extent->start_dpa;
> + response->extent_list[pl_index].length = extent->length;
> + pl_index++;
> + response->extent_list_size = cpu_to_le32(pl_index);
> +
> + if (pl_index == max_extents) {
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = opcode,
> + .size_in = struct_size(response, extent_list,
> + pl_index),
> + .payload_in = response,
> + };
> +
> + response->flags = 0;
> + if (pl_index < cnt)
> + response->flags &= CXL_DCD_EVENT_MORE;
Should "response->flags |= CXL_DCD_EVENT_MORE"?
And seems like there is a bug if the value of 'cnt' is double the value of 'max_extents'. the response command will be sent in this xa_for_each() scope twice, and CXL_DCD_EVENT_MORE will be set for both times. because 'pl_index < cnt' is always true.
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc)
> + return rc;
> + pl_index = 0;
> + }
> + }
> +
> + if (pl_index) {
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = opcode,
> + .size_in = struct_size(response, extent_list,
> + pl_index),
> + .payload_in = response,
> + };
> +
> + response->flags = 0;
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + }
> +
> + return rc;
> +}
> +
Powered by blists - more mailing lists