[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6942201839852_1cee10044@dwillia2-mobl4.notmuch>
Date: Tue, 16 Dec 2025 19:14:32 -0800
From: <dan.j.williams@...el.com>
To: Michał Cłapiński <mclapinski@...gle.com>,
<dan.j.williams@...el.com>
CC: Mike Rapoport <rppt@...nel.org>, Ira Weiny <ira.weiny@...el.com>, "Dave
Jiang" <dave.jiang@...el.com>, Vishal Verma <vishal.l.verma@...el.com>,
<jane.chu@...cle.com>, Pasha Tatashin <pasha.tatashin@...een.com>, "Tyler
Hicks" <code@...icks.com>, <linux-kernel@...r.kernel.org>,
<nvdimm@...ts.linux.dev>
Subject: Re: [PATCH 1/1] nvdimm: allow exposing RAM carveouts as NVDIMM DIMM
devices
Michał Cłapiński wrote:
[..]
> > Sure, then make it 1280K of label space. There's no practical limit in
> > the implementation.
>
> Hi Dan,
> I just had the time to try this out. So I modified the code to
> increase the label space to 2M and I was able to create the
> namespaces. It put the metadata in volatile memory.
>
> But the infoblocks are still within the namespaces, right? If I try to
> create a 3G namespace with alignment set to 1G, its actual usable size
> is 2G. So I can't divide the whole pmem into 1G devices with 1G
> alignment.
Ugh, yes, I failed to predict that outcome.
> If I modify the code to remove the infoblocks, the namespace mode
> won't be persistent, right? In my solution I get that information from
> the kernel command line, so I don't need the infoblocks.
So, I dislike the command line option ABI expansion proposal enough to
invest some time to find an alternative. One observation is that the
label is able to indicate the namespace mode independent of an
info-block. The info-block is only really needed when deciding whether
and how much space to reserve to allocate 'struct page' metadata.
-- 8< --
>From 4f44cbb6e3bd4cac9481bdd4caf28975a4f1e471 Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams@...el.com>
Date: Mon, 15 Dec 2025 17:10:04 -0800
Subject: [PATCH] nvdimm: Allow fsdax and devdax namespace modes without
info-blocks
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Michał reports that the new ramdax facility does not meet his needs which
is to carve large reservations of memory into multiple 1GB aligned
namespaces/volumes. While ramdax solves the problem of in-memory
description of the volume layout, the nvdimm "infoblocks" eat capacity and
destroy alignment properties.
The infoblock serves 2 purposes, it indicates whether the namespace should
operate in fsdax or devdax mode, Michał needs this, and it optionally
reserves namespace capacity for storing 'struct page' metadata, Michał does
not need this. It turns out the mode information is already recorded in the
namespace label, and if no reservation is needed for 'struct page' metadata
then infoblock settings can just be hard coded.
Introduce a new ND_REGION_VIRT_INFOBLOCK flag for ramdax to indicate that
all infoblocks be synthesized and not consume any capacity from the
namespace.
With that ramdax can create a full sized namespace:
$ ndctl create-namespace -r region0 -s 1G -a 1G -M mem
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"mem",
"size":"1024.00 MiB (1073.74 MB)",
"uuid":"c48c4991-86af-4de1-8c7c-8919358df1f9",
"sector_size":512,
"align":1073741824,
"blockdev":"pmem0"
}
Note that the uuid is not persisted so the "raw_uuid" in the label will be
the method to identify the namespace:
<after disable/enable region>
$ ndctl list -vu
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"mem",
"size":"1024.00 MiB (1073.74 MB)",
"uuid":"00000000-0000-0000-0000-000000000000",
"raw_uuid":"1526a1df-d1ec-40e3-91e8-547f1ad9949d",
"sector_size":512,
"align":1073741824,
"blockdev":"pmem0",
"numa_node":0,
"target_node":0
}
Also note that the align is hard coded to (PUD) 1G. That is probably fine
for now unless and until someone comes up with a case for making that
setting configurable.
Lastly, the kernel will complain if "-a 1G -M mem" are not specified to
"ndctl create-namespace" as the kernel still enforces that that live
settings specified at configuration time match the "virtual" infoblock.
Cc: Michał Cłapiński" <mclapinski@...gle.com>
Cc: Pasha Tatashin <pasha.tatashin@...een.com>
Cc: Mike Rapoport <rppt@...nel.org>
Cc: Ira Weiny <ira.weiny@...el.com>
Signed-off-by: Dan Williams <dan.j.williams@...el.com>
---
include/linux/libnvdimm.h | 3 ++
drivers/nvdimm/pfn_devs.c | 58 +++++++++++++++++++++++++++++++++++++--
drivers/nvdimm/ramdax.c | 1 +
3 files changed, 60 insertions(+), 2 deletions(-)
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 28f086c4a187..c79efc49dd24 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -70,6 +70,9 @@ enum {
/* Region was created by CXL subsystem */
ND_REGION_CXL = 4,
+ /* Virtual info-block mode (no writeback / storage reservation) */
+ ND_REGION_VIRT_INFOBLOCK = 5,
+
/* mark newly adjusted resources as requiring a label update */
DPA_RESOURCE_ADJUSTED = 1 << 0,
};
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 42b172fc5576..68a998fe20a7 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -428,6 +428,50 @@ static bool nd_supported_alignment(unsigned long align)
return false;
}
+static int nd_pfn_virt_init(struct nd_pfn *nd_pfn, const char *sig)
+{
+ struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb;
+ struct nd_namespace_common *ndns = nd_pfn->ndns;
+
+ switch (ndns->claim_class) {
+ case NVDIMM_CCLASS_PFN:
+ if (memcmp(sig, PFN_SIG, PFN_SIG_LEN) != 0)
+ return -ENODEV;
+ break;
+ case NVDIMM_CCLASS_DAX:
+ if (memcmp(sig, DAX_SIG, PFN_SIG_LEN) != 0)
+ return -ENODEV;
+ break;
+ default:
+ return -ENODEV;
+ }
+
+ *pfn_sb = (struct nd_pfn_sb) {
+ .version_major = cpu_to_le16(1),
+ .version_minor = cpu_to_le16(4),
+ .mode = cpu_to_le32(PFN_MODE_RAM),
+ .align = cpu_to_le32(HPAGE_PUD_SIZE),
+ .page_size = cpu_to_le32(PAGE_SIZE),
+ .page_struct_size = cpu_to_le16(sizeof(struct page)),
+ };
+ memcpy(pfn_sb->signature, sig, PFN_SIG_LEN);
+
+ /*
+ * Virtual infoblock uuids do not persist, but match the live setting in
+ * the validation case. The @align and @mode settings are fixed for the
+ * virtual case, validation will enforce that they match.
+ */
+ if (nd_pfn->uuid)
+ memcpy(pfn_sb->uuid, nd_pfn->uuid, 16);
+ memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(&ndns->dev), 16);
+ pfn_sb->checksum = cpu_to_le64(nd_sb_checksum((struct nd_gen_sb *) pfn_sb));
+
+ dev_dbg(&nd_pfn->dev, "virtual %s infoblock for %s\n", sig,
+ dev_name(&ndns->dev));
+
+ return 0;
+}
+
/**
* nd_pfn_validate - read and validate info-block
* @nd_pfn: fsdax namespace runtime state / properties
@@ -448,6 +492,7 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb;
struct nd_namespace_common *ndns = nd_pfn->ndns;
const uuid_t *parent_uuid = nd_dev_to_uuid(&ndns->dev);
+ struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
if (!pfn_sb || !ndns)
return -ENODEV;
@@ -455,8 +500,14 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
if (!is_memory(nd_pfn->dev.parent))
return -ENODEV;
- if (nvdimm_read_bytes(ndns, SZ_4K, pfn_sb, sizeof(*pfn_sb), 0))
+ if (test_bit(ND_REGION_VIRT_INFOBLOCK, &nd_region->flags)) {
+ int rc = nd_pfn_virt_init(nd_pfn, sig);
+
+ if (rc)
+ return rc;
+ } else if (nvdimm_read_bytes(ndns, SZ_4K, pfn_sb, sizeof(*pfn_sb), 0)) {
return -ENXIO;
+ }
if (memcmp(pfn_sb->signature, sig, PFN_SIG_LEN) != 0)
return -ENODEV;
@@ -694,7 +745,10 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
};
pgmap->nr_range = 1;
if (nd_pfn->mode == PFN_MODE_RAM) {
- if (offset < reserve)
+ struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
+
+ if (!test_bit(ND_REGION_VIRT_INFOBLOCK, &nd_region->flags) &&
+ offset < reserve)
return -EINVAL;
nd_pfn->npfns = le64_to_cpu(pfn_sb->npfns);
} else if (nd_pfn->mode == PFN_MODE_PMEM) {
diff --git a/drivers/nvdimm/ramdax.c b/drivers/nvdimm/ramdax.c
index 954cb7919807..992346390086 100644
--- a/drivers/nvdimm/ramdax.c
+++ b/drivers/nvdimm/ramdax.c
@@ -60,6 +60,7 @@ static int ramdax_register_region(struct resource *res,
ndr_desc.num_mappings = 1;
ndr_desc.mapping = &mapping;
ndr_desc.nd_set = nd_set;
+ set_bit(ND_REGION_VIRT_INFOBLOCK, &ndr_desc.flags);
if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
goto err_free_nd_set;
--
2.51.1
Powered by blists - more mailing lists