lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 17 Apr 2015 21:35:25 -0400
From:	Dan Williams <dan.j.williams@...el.com>
To:	linux-nvdimm@...ts.01.org
Cc:	Boaz Harrosh <boaz@...xistor.com>, Neil Brown <neilb@...e.de>,
	Greg KH <gregkh@...uxfoundation.org>,
	linux-kernel@...r.kernel.org,
	Andy Lutomirski <luto@...capital.net>,
	Jens Axboe <axboe@...com>, "H. Peter Anvin" <hpa@...or.com>,
	Christoph Hellwig <hch@....de>, Ingo Molnar <mingo@...nel.org>
Subject: [PATCH 02/21] ND NFIT-Defined/NVIDIMM Subsystem

Maintainer information and documenation for drivers/block/nd/

Cc: Andy Lutomirski <luto@...capital.net>
Cc: Boaz Harrosh <boaz@...xistor.com>
Cc: H. Peter Anvin <hpa@...or.com>
Cc: Jens Axboe <axboe@...com>
Cc: Ingo Molnar <mingo@...nel.org>
Cc: Christoph Hellwig <hch@....de>
Cc: Neil Brown <neilb@...e.de>
Cc: Greg KH <gregkh@...uxfoundation.org>
Signed-off-by: Dan Williams <dan.j.williams@...el.com>
---
 Documentation/blockdev/nd.txt |  867 +++++++++++++++++++++++++++++++++++++++++
 MAINTAINERS                   |   34 +-
 2 files changed, 895 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/blockdev/nd.txt

diff --git a/Documentation/blockdev/nd.txt b/Documentation/blockdev/nd.txt
new file mode 100644
index 000000000000..bcfdf21063ab
--- /dev/null
+++ b/Documentation/blockdev/nd.txt
@@ -0,0 +1,867 @@
+                 The NFIT-Defined/NVDIMM Sub-system (ND)
+
+      nd - kernel abi / device-model & ndctl - userspace helper library
+                         linux-nvdimm@...ts.01.org
+                            v9: April 17th, 2015
+
+
+  Glossary
+
+  Overview
+    Supporting Documents
+    Git Trees
+
+  NFIT Terminology and NVDIMM Types
+
+  Why BLK?
+    PMEM vs BLK (SPA vs BDW)
+      BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
+
+  Example NFIT Diagram
+
+  ND Device Model/ABI and NDCTL API
+    NDCTL: Context
+      ndctl: instantiate a new library context example
+
+    ND/NDCTL: Bus
+      nd: control class device in /sys/class
+      nd: bus layout
+      ndctl: bus enumeration example
+
+    ND/NDCTL: DIMM (NMEM)
+      nd: DIMM (NMEM) layout
+      ndctl: DIMM enumeration example
+
+    ND/NDCTL: Region
+      nd: region layout
+      ndctl: region enumeration example
+      Why Not Encode the Region Type into the Region Name?
+      How Do I Determine the Major Type of a Region?
+
+    ND/NDCTL: Namespace
+      nd: namespace layout
+      ndctl: namespace enumeration example
+      ndctl: namespace creation example
+      Why the Term “namespace”?
+
+    ND/NDCTL: Block Translation Table “btt”
+      nd: btt layout
+      ndctl: btt creation example
+
+  Summary NDCTL Diagram
+
+
+Glossary
+--------
+
+NFIT: NVDIMM Firmware Interface Table
+
+SPA: System Physical Address also refers to an NFIT system-physical
+address table entry describing contiguous persistent memory range.
+
+DPA: DIMM Physical Address, is a DIMM-relative offset.  With one DIMM in
+the system there would be a 1:1 SPA:DPA association.  Once more DIMMs
+are added an interleave-description-table provided by NFIT is needed to
+decode a SPA to a DPA.
+
+DCR: DIMM Control Region Descriptor, an NFIT sub-table entry conveying
+the vendor, format, revision, and geometry of the related
+block-data-windows.
+
+BDW: Block Data Window Region Descriptor, an NFIT sub-table referenced
+by a DCR locating a set of data transfer apertures and control registers
+in system memory.
+
+PMEM: A linux block device which provides access to an SPA range. A PMEM
+device is capable of DAX (see below).
+
+DAX: File system extensions to bypass the page cache and block layer to
+map persistent memory, from a PMEM block device, directly into a process
+address space.
+
+BLK: A linux block device which accesses NVDIMM storage through a BDW
+(block-data-window aperture).  A BLK device is not amenable to DAX.
+
+DSM: Device Specific Method, refers to a runtime service provided by
+platform firmware to send formatted control/configuration messages to a
+DIMM device.  In ACPI this is an _DSM attribute of an object.
+
+BTT: Block Translation Table: Persistent memory is byte addressable.
+Existing software may have an expectation that the power-fail-atomicity
+of writes is at least one sector, 512 bytes.  The BTT is an indirection
+table with atomic update semantics to front a PMEM/BLK block device
+driver and present arbitrary atomic sector sizes.
+
+LABEL: Metadata stored on a DIMM device that partitions and identifies
+(persistently names) storage between PMEM and BLK.  It also partitions
+BLK storage to host BTTs with different parameters per BLK-partition.
+Note that traditional partition tables, GPT/MBR, are layered on top of a
+BLK or PMEM device.
+
+
+
+
+Overview
+--------
+
+The “NVDIMM Firmware Interface Table” (NFIT) defines a set of tables
+that describe the non-volatile memory resources in a platform.  Platform
+firmware provides this table as well as  runtime-services for sending
+control and configuration messages to capable NVDIMM devices.  NFIT is a
+new top-level table in ACPI 6.  The Linux ND subsystem is designed as a
+generic mechanism that can register a binary NFIT from any provider,
+ACPI being just one example of a provider.  The unit test infrastructure
+in the kernel exploits this capability to provide multiple sample NFITs
+via custom test-platform-devices.
+
+
+Supporting Documents
+ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
+NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
+DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+Driver Writer’s Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
+
+
+Git Trees
+ND: https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git/log/?h=nd
+NDCTL: https://github.com/pmem/ndctl.git
+PMEM: https://github.com/01org/prd
+
+
+NFIT Terminology and NVDIMM Types
+---------------------------------
+
+Prior to the arrival of the NFIT, non-volatile memory was described to a
+system in various ad-hoc ways.  Usually only the bare minimum was
+provided, namely, a single SPA range where writes are expected to be
+durable after a system power loss.  Now, the NFIT specification
+standardizes not only the description SPA ranges, but also DCR/BDW
+(block-aperture access) and DSM entry points for control/configuration.
+
+
+For each NFIT-defined I/O interface (SPA, DCR/BDW), ND provides a block
+device driver:
+
+
+1. PMEM (nd_pmem.ko): Drives an NFIT system-physical address (SPA)
+   range.  A SPA range is contiguous in system memory and may be
+   interleaved (hardware memory controller striped) across multiple DIMMs.
+   When a SPA is interleaved the NFIT optionally provides descriptions of
+   which DIMMs are participating in the interleave.
+
+   Note, while ND describes SPAs with backing DIMM information
+   (ND_NAMESPACE_PMEM) with a different device-type than SPAs without such
+   a description (ND_NAMESPACE_IO), to nd_pmem there is no distinction.
+   The different device-types are an implementation detail that userspace
+   can exploit to implement policies like “only interface with SPA ranges
+   from certain DIMMs”.
+
+
+2. BLK (nd_blk.ko): This driver performs I/O using a set of DCR/BDW
+   defined apertures.  A set of apertures will all access just one DIMM.
+   Multiple windows allow multiple concurrent accesses, much like
+   tagged-command-queuing, and would likely be used by different threads or
+   different CPUs.
+
+   The NFIT specification defines a standard format for a BDW, but the spec
+   also allows for vendor specific layouts.  As of this writing “nd_blk”
+   only supports the example interface detailed in the “DSM Interface
+   Example”.  If another BDW format arrives in the future this can added as
+   a new sub-device-type to nd_blk or as a new ND device type with its own
+   driver.
+
+
+Why BLK?
+--------
+
+While PMEM provides direct byte-addressable CPU-load/store access to
+NVDIMM storage, it does not provide the best system RAS (recovery,
+availability, and serviceability) model.  An access to a corrupted SPA
+address causes a cpu exception while an access to a corrupted address
+through a BDW aperture causes that block window to raise an error status
+in a register.  The latter is more aligned with the standard error model
+that host-bus-adapter attached disks present.  Also, if an administrator
+ever wants to replace a memory it is easier to service a system at DIMM
+module boundaries.  Compare this to PMEM where data could be interleaved
+in an opaque hardware specific manner across several DIMMs.
+
+
+PMEM vs BLK (SPA vs BDW)
+------------------------
+
+BDWs solve this RAS problem, but their presence is also the major
+contributing factor to the complexity of the ND subsystem.  They
+complicate the implementation because PMEM and BLK alias in DPA space.
+Any given DIMM’s DPA-range may contribute to one or more SPA sets of
+interleaved DIMMs, *and* may also be accessed in its entirety through
+its BDW.  Accessing a DPA through a SPA while simultaneously accessing
+the same DPA through a BDW has undefined results.  For this reason,
+DIMM’s with this dual interface configuration include a DSM function to
+store/retrieve a LABEL.  The LABEL effectively partitions the DPA-space
+into exclusive SPA and BDW accessible regions.  For simplicity a DIMM is
+allowed a PMEM “region” per each interleave set in which it is a member.
+The remaining DPA space can be carved into an arbitrary number of BLK
+devices with discontiguous extents.
+
+
+BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
+--------------------------------------------------
+One of the few reasons to allow multiple BLK namespaces per REGION is so
+that each BLK-namespace can be configured with a BTT with unique atomic
+sector sizes.  While a PMEM device can host a BTT the LABEL
+specification does not provide for a sector size to be specified for a
+PMEM namespace.  This is due to the expectation that the primary usage
+model for PMEM is via DAX, and the BTT is incompatible with DAX.
+However, for the cases where an application or filesystem still needs
+atomic sector update guarantees it can register a BTT on a PMEM device
+or partition.  See ND/NDCTL: Block Translation Table “btt”
+
+
+________________
+
+
+Example NFIT Diagram
+
+
+For the remainder of this document the following diagram and device
+names will be referenced for the example sysfs layouts.
+
+
+                             (a)               (b)           DIMM   BLK-REGION
+          +-------------------+--------+--------+--------+
++------+  |       pm0.0       | blk2.0 | pm1.0  | blk2.1 |    0      region2
+| imc0 +--+- - - region0- - - +--------+        +--------+
++--+---+  |       pm0.0       | blk3.0 | pm1.0  | blk3.1 |    1      region3
+   |      +-------------------+--------v        v--------+
++--+---+                               |                 |
+| cpu0 |                                     region1
++--+---+                               |                 |
+   |      +----------------------------^        ^--------+
++--+---+  |           blk4.0           | pm1.0  | blk4.0 |    2      region4
+| imc1 +--+----------------------------|        +--------+
++------+  |           blk5.0           | pm1.0  | blk5.0 |    3      region5
+          +----------------------------+--------+--------+
+
+
+In this platform we have four DIMMs and two memory controllers in one
+socket.  Each unique interface (BLK or PMEM) to DPA space is identified
+by a region device with a dynamically assigned id (REGION0 - REGION5).
+
+
+1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A
+   single PMEM namespace is created in the REGION0-SPA-range that spans
+   DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that
+   interleaved SPA range is reclaimed as BDW accessed space starting at
+   DPA-offset (a) into each DIMM.  In that reclaimed space we create two
+   BDW "namespaces" from REGION2 and REGION3 where "blk2.0" and "blk3.0"
+   are just human readable names that could be set to any user-desired name
+   in the LABEL.
+
+
+2. In the last portion of DIMM0 and DIMM1 we have an interleaved SPA
+   range, REGION1, that spans those two DIMMs as well as DIMM2 and DIMM3.
+   Some of REGION1 allocated to a PMEM namespace named "pm1.0" the rest is
+   reclaimed in 4 BDW namespaces (for each DIMM in the interleave set),
+   "blk2.1", "blk3.1", "blk4.0", and "blk5.0".
+
+
+3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1
+   interleaved SPA range (i.e. the DPA address below offset (b) are also
+   included in the "blk4.0" and "blk5.0" namespaces.  Note, that this
+   example shows that BDW namespaces don't need to be contiguous in
+   DPA-space.
+
+This bus is provided by the kernel under the device
+/sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and
+the nfit_test.ko module is loaded.
+
+
+ND Device Model/ABI and NDCTL API
+---------------------------------
+
+What follows is a description of the ND sysfs layout and a corresponding
+object hierarchy diagram as viewed through the NDCTL api.  The example
+sysfs paths and diagrams are relative to the Example NFIT Diagram which
+is also the NFIT used in the “nd/ndctl” unit test.
+
+
+NDCTL: Context
+Every api call in the NDCTL library requires a context that holds the
+logging parameters and other library instance state.  The library is
+based on the libabc template:
+https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git/
+
+ndctl: instantiate a new library context example
+
+	struct ndctl_ctx *ctx;
+
+	if (ndctl_new(&ctx) == 0)
+	        return ctx;
+	else
+	        return NULL;
+
+
+ND/NDCTL: Bus
+A bus has a 1:1 relationship with an NFIT.  The current expectation for
+ACPI based systems is that there is only ever one platform-global NFIT.
+That said, it is trivial to register multiple NFITs, the specification
+does not preclude it.  The infrastructure supports multiple busses and
+we we use this capability to test multiple NFIT configurations in the
+unit test.
+
+nd: control class device in /sys/class
+
+This character device accepts DSM messages to be passed to DIMM
+identified by its NFIT handle.
+
+	/sys/class/nd/ndctl0
+	|-- dev
+	|-- device -> ../../../ndbus0
+	|-- subsystem -> ../../../../../../../class/nd
+
+
+nd: bus layout
+
+	/sys/devices/platform/nfit_test.0/ndbus0
+	|-- btt0
+	|-- btt_seed
+	|-- commands
+	|-- nd
+	|-- nmem0
+	|-- nmem1
+	|-- nmem2
+	|-- nmem3
+	|-- provider
+	|-- region0
+	|-- region1
+	|-- region2
+	|-- region3
+	|-- region4
+	|-- region5
+	|-- revision
+	|-- uevent
+	`-- wait_probe
+
+
+ndctl: bus enumeration example
+
+Find the 'bus' handle that describes the bus from Example NFIT Diagram
+
+
+	static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx,
+	                const char *provider)
+	{
+	        struct ndctl_bus *bus;
+
+
+	        ndctl_bus_foreach(ctx, bus)
+	                if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0)
+	                        return bus;
+
+
+	        return NULL;
+	}
+
+	bus = get_bus_by_provider(ctx, “nfit_test.0”);
+
+
+ND/NDCTL: DIMM (NMEM)
+
+The DIMM object identifies the NFIT “handle” and a “phys_id” for a given
+memory device.  The “handle” is derived from the DIMM’s physical
+location (socket, memory-controller, channel, slot).  The “phys_id” is
+used for looking up DIMM details in other platform tables.  The handle
+value is also used to send control/configuration messages via ioctl
+through the “ndctl0” device in the given example.  The kernel id (‘N” in
+“DIMMN”) for the device is dynamically assigned.  The “vendor”,
+“device”, “revision” and “format” attributes are optionally available if
+the NFIT publishes a DCR (DIMM-control-region) for the given memory
+device.  These latter attributes are only useful in the presence of a
+vendor-specific DIMM.
+
+
+Note that the kernel device name for “DIMMs” is “nmemX”.  The NFIT
+describes these devices via “Memory Device to System Physical Address
+Range Mapping Structure”, and there is no requirement that they actually
+be DIMMs, so we use a more generic name.
+
+
+nd: DIMM (NMEM) layout
+
+	/sys/devices/platform/nfit_test.0/ndbus0/
+	|-- nmem0
+	|   |-- available_slots
+	|   |-- commands
+	|   |-- dev
+	|   |-- device
+	|   |-- devtype
+	|   |-- driver -> ../../../../../bus/nd/drivers/nd_dimm
+	|   |-- format
+	|   |-- handle
+	|   |-- modalias
+	|   |-- phys_id
+	|   |-- revision
+	|   |-- serial
+	|   |-- state
+	|   |-- subsystem -> ../../../../../bus/nd
+	|   |-- uevent
+	|   `-- vendor
+	|-- nmem1
+	[..]
+
+ndctl: DIMM enumeration example
+
+Note, DIMMs are identified by an “nfit_handle” which is a 32-bit value
+where:
+
+	Bit 3:0 DIMM number within the memory channel
+	Bit 7:4 memory channel number
+	Bit 11:8 memory controller ID
+	Bit 15:12 socket ID
+	Bit 27:16 Node Controller ID
+	Bit 31:28 Reserved
+
+	static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus,
+		unsigned int handle)
+	{
+	        struct ndctl_dimm *dimm;
+
+
+	        ndctl_dimm_foreach(bus, dimm)
+	                if (ndctl_dimm_get_handle(dimm) == handle)
+	                        return dimm;
+
+
+	        return NULL;
+	}
+
+	#define DIMM_HANDLE(n, s, i, c, d) \
+	        (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \
+	         | ((c & 0xf) << 4) | (d & 0xf))
+
+	dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0));
+
+
+ND/NDCTL: Region
+A generic REGION device is registered for each SPA or DCR/BDW.  Per the
+example there are 6 regions: 2 SPAs and 4 BDWs on the “nfit_test.0” bus.
+The primary role of regions are to be a container of “mappings”.  A
+mapping is a tuple of <DIMM, DPA-start-offset, length>.
+
+The ND core provides a driver for these REGION devices.  This driver is
+responsible for reconciling the aliased mappings across all regions,
+parsing the LABEL, if present, and then emitting “namespace” devices
+with the resolved/exclusive DPA-boundaries for a ND PMEM or BLK device
+driver to consume.
+
+In addition to the generic attributes of “mapping”s, “interleave_ways”
+and “size” the REGION device also exports some convenience attributes.
+“nstype” indicates the integer type of namespace-device this region
+emits, “devtype” duplicates the DEVTYPE variable stored by udev at the
+‘add’ event, “modalias” duplicates the MODALIAS variable stored by udev
+at the ‘add’ event, and finally, the optional “spa_index” is provided in
+the case where the region is defined by a SPA.
+
+nd: region layout
+
+	|-- region0
+	|   |-- available_size
+	|   |-- devtype
+	|   |-- driver -> ../../../../../bus/nd/drivers/nd_region
+	|   |-- init_namespaces
+	|   |-- mapping0
+	|   |-- mapping1
+	|   |-- mappings
+	|   |-- modalias
+	|   |-- namespace0.0
+	|   |-- namespace_seed
+	|   |-- nstype
+	|   |-- set_cookie
+	|   |-- size
+	|   |-- spa_index
+	|   |-- subsystem -> ../../../../../bus/nd
+	|   `-- uevent
+	|-- region1
+	|   |-- available_size
+	|   |-- devtype
+	|   |-- driver -> ../../../../../bus/nd/drivers/nd_region
+	|   |-- init_namespaces
+	|   |-- mapping0
+	|   |-- mapping1
+	|   |-- mapping2
+	|   |-- mapping3
+	|   |-- mappings
+	|   |-- modalias
+	|   |-- namespace1.0
+	|   |-- namespace_seed
+	|   |-- nstype
+	|   |-- set_cookie
+	|   |-- size
+	|   |-- spa_index
+	|   |-- subsystem -> ../../../../../bus/nd
+	|   `-- uevent
+	|-- region2
+	[..]
+
+
+ndctl: region enumeration example
+
+Sample region retrieval routines based on NFIT-unique data like
+“spa_index” (interleave set id) for PMEM and “nfit_handle” (dimm id) for
+BLK.
+
+	static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus,
+	                unsigned int spa_index)
+	{
+	        struct ndctl_region *region;
+
+
+	        ndctl_region_foreach(bus, region) {
+	                if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM)
+	                        continue;
+	                if (ndctl_region_get_spa_index(region) == spa_index)
+	                        return region;
+	        }
+	        return NULL;
+	}
+
+
+	static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus,
+	                unsigned int handle)
+	{
+	        struct ndctl_region *region;
+
+
+	        ndctl_region_foreach(bus, region) {
+	                struct ndctl_mapping *map;
+
+
+	                if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK)
+	                        continue;
+	                ndctl_mapping_foreach(region, map) {
+	                        struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map);
+
+
+	                        if (ndctl_dimm_get_handle(dimm) == handle)
+	                                return region;
+	                }
+	        }
+	        return NULL;
+	}
+
+
+Why Not Encode the Region Type into the Region Name?
+
+At first glance it seems since NFIT defines just PMEM and BLK interface
+types that we should simply name REGION devices with something derived
+from those type names.  However, the ND subsystem explicitly keeps the
+REGION name generic and expects userspace to always consider the
+region-attributes for 4 reasons:
+
+1. There are already more than two REGION and “namespace” types.  For
+   PMEM there are two subtypes.  As mentioned previously we have PMEM where
+   the constituent DIMM devices are known and anonymous PMEM.  For BLK
+   regions the NFIT specification already anticipates vendor specific
+   implementations.  The exact distinction of what a region contains is in
+   the region-attributes not the region-name or the region-devtype.
+
+2. A region with zero child-namespaces is a possible configuration.  For
+   example, the NFIT allows for a DCR to be published without a
+   corresponding BDW.  This equates to a DIMM that can only accept
+   control/configuration messages, but no i/o through a descendant block
+   device.  Again, this “type” is advertised in the attributes (‘mappings’
+   == 0) and the name does not tell you much.
+
+3. What if a third major interface type arises in the future?  Outside
+   of vendor specific implementations, it’s not difficult to envision a
+   third class of interface type beyond BLK and PMEM.  With a generic name
+   for the REGION level of the device-hierarchy old userspace
+   implementations can still make sense of new kernel advertised
+   region-types.  Userspace can always rely on the generic region
+   attributes like “mappings”, “size”, etc and the expected child devices
+   named “namespace”.  This generic format of the device-model hierarchy
+   allows the ND and NDCTL implementations to be more uniform and
+   future-proof.
+
+4. There are more robust mechanisms for determining the major type of a
+   region than a device name.  See the next section, How Do I Determine the
+   Major Type of a Region?
+
+
+How Do I Determine the Major Type of a Region?
+
+Outside of the blanket recommendation of “use the ndctl library”, or
+simply looking at the kernel header to decode the “nstype” integer
+attribute, here are some other options.
+
+
+1. module alias lookup:
+   The whole point of region/namespace device type differentiation is to
+   decide which block-device driver will attach to a given ND namespace.
+   One can simply use the modalias to lookup the resulting module.  It’s
+   important to note that this method is robust in the presence of a
+   vendor-specific driver down the road.  If a vendor-specific
+   implementation wants to supplant the standard nd_blk driver it can with
+   minimal impact to the rest of ND.
+
+   In fact, a vendor may also want to have a vendor-specific region-driver
+   (outside of nd_region).  For example, if a vendor defined its own LABEL
+   format it would need its own region driver to parse that LABEL and emit
+   the resulting namespaces.  The output from module resolution is more
+   accurate than a region-name or region-devtype.
+
+
+2. udev:
+	The kernel “devtype” is registered in the udev database
+	# udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0
+	P: /devices/platform/nfit_test.0/ndbus0/region0
+	E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0
+	E: DEVTYPE=nd_pmem
+	E: MODALIAS=nd:t2
+	E: SUBSYSTEM=nd
+
+
+	# udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4
+	P: /devices/platform/nfit_test.0/ndbus0/region4
+	E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4
+	E: DEVTYPE=nd_blk
+	E: MODALIAS=nd:t3
+	E: SUBSYSTEM=nd
+
+
+   ...and is available as a region attribute, but keep in mind that the
+   “devtype” does not indicate sub-type variations and scripts should
+   really be understanding the other attributes.
+
+
+3. type specific attributes:
+   As it currently stands a BDW region will never have a “spa_index”
+   attribute.  A DCR region with a “mappings” value of 0 is, as mentioned
+   above, a DIMM that does not allow I/O.  A PMEM region with a “mappings”
+   value of zero is a simple SPA range.
+
+
+ND/NDCTL: Namespace
+
+A REGION, after resolving DPA aliasing and LABEL specified boundaries,
+surfaces one or more “namespace” devices.  The arrival of a “namespace”
+device currently triggers either the nd_blk or nd_pmem driver to load
+and register a disk/block device.
+
+
+nd: namespace layout
+Here is a sample layout from the three major types of NAMESPACE where
+namespace0.0 represents DIMM-info-backed PMEM (note that it has a ‘uuid’
+attribute), namespace2.0 represents a BLK namespace (note it has a
+‘sector_size’ attribute) that, and namespace6.0 represents an anonymous
+PMEM namespace (note that has no ‘uuid’ attribute due to not support a
+LABEL).
+
+	/sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0
+	|-- alt_name
+	|-- devtype
+	|-- dpa_extents
+	|-- modalias
+	|-- resource
+	|-- size
+	|-- subsystem -> ../../../../../../bus/nd
+	|-- type
+	|-- uevent
+	`-- uuid
+	/sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0
+	|-- alt_name
+	|-- devtype
+	|-- dpa_extents
+	|-- modalias
+	|-- sector_size
+	|-- size
+	|-- subsystem -> ../../../../../../bus/nd
+	|-- type
+	|-- uevent
+	`-- uuid
+	/sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0
+	|-- block
+	|   `-- pmem0
+	|-- devtype
+	|-- driver -> ../../../../../../bus/nd/drivers/pmem
+	|-- modalias
+	|-- resource
+	|-- size
+	|-- subsystem -> ../../../../../../bus/nd
+	|-- type
+	`-- uevent
+
+
+ndctl: namespace enumeration example
+Namespaces are indexed relative to their parent region, example below.
+These indexes are mostly static from boot to boot, but subsystem makes
+no guarantees in this regard.  For a static namespace identifier use its
+‘uuid’ attribute.
+
+	static struct ndctl_namespace *get_namespace_by_id(struct ndctl_region *region,
+	                unsigned int id)
+	{
+	        struct ndctl_namespace *ndns;
+
+
+	        ndctl_namespace_foreach(region, ndns)
+	                if (ndctl_namespace_get_id(ndns) == id)
+	                        return ndns;
+
+
+	        return NULL;
+	}
+
+
+ndctl: namespace creation example
+
+Idle namespaces are automatically created by the kernel if a given
+region has enough available capacity to create a new namespace.
+Namespace instantiation involves finding an idle namespace and
+configuring it.  For the most part the setting of namespace attributes
+can occur in any order, the only constraint is that ‘uuid’ must be set
+before ‘size’.  This enables the kernel to track DPA allocations
+internally with a static identifier.
+
+
+	static int configure_namespace(struct ndctl_region *region,
+	                struct ndctl_namespace *ndns,
+	                struct namespace_parameters *parameters)
+	{
+	        char devname[50];
+
+
+	        snprintf(devname, sizeof(devname), "namespace%d.%d",
+	                        ndctl_region_get_id(region), paramaters->id);
+
+
+	        ndctl_namespace_set_alt_name(ndns, devname);
+	        /* ‘uuid’ must be set prior to setting size! */
+	        ndctl_namespace_set_uuid(ndns, paramaters->uuid);
+	        ndctl_namespace_set_size(ndns, paramaters->size);
+	        /* unlike pmem namespaces, blk namespaces have a sector size */
+	        if (parameters->lbasize)
+	                ndctl_namespace_set_sector_size(ndns, parameters->lbasize);
+	        ndctl_namespace_enable(ndns);
+	}
+
+Why the Term “namespace”?
+1. Why not “volume” for instance?  “volume” ran the risk of confusing ND
+   as a volume manager like device-mapper.
+
+
+2. The term originated to describe the sub-devices that can be created
+   within a NVME controller (see the nvme specification:
+   http://www.nvmexpress.org/specifications/), and NFIT namespaces are
+   meant to parallel the capabilities and configurability of
+   NVME-namespaces.
+
+
+ND/NDCTL: Block Translation Table “btt”
+A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked
+block device driver that fronts either the whole block device or a
+partition of a block device emitted by either a PMEM or BLK NAMESPACE.
+
+
+nd: btt layout
+Every bus will start out with at least one BTT device which is the seed
+device.  To activate it set the “backing_dev”, “uuid”, and “sector_size”
+attributes and then bind the device to the nd_btt driver.
+
+	/sys/devices/platform/nfit_test.1/ndbus0/btt0/
+	├── backing_dev
+	├── delete
+	├── devtype
+	├── modalias
+	├── sector_size
+	├── subsystem -> ../../../../../bus/nd
+	├── uevent
+	└── uuid
+
+ndctl: btt creation example
+
+Similar to namespaces an idle BTT device is automatically created per
+bus.  Each time this “seed” btt device is configured and enabled a new
+seed is created.  Creating a BTT configuration involves two steps of
+finding and idle BTT and assigning it to front a PMEM or BLK namespace.
+
+
+	static struct ndctl_btt *get_idle_btt(struct ndctl_bus *bus)
+	{
+	        struct ndctl_btt *btt;
+
+
+	        ndctl_btt_foreach(bus, btt)
+	                if (!ndctl_btt_is_enabled(btt) && !ndctl_btt_is_configured(btt))
+	                        return btt;
+
+
+	        return NULL;
+	}
+
+	static int configure_btt(struct ndctl_bus *bus, struct btt_parameters *parameters)
+	{
+	        btt = get_idle_btt(bus);
+
+
+	        sprintf(bdevpath, "/dev/%s",
+	                        ndctl_namespace_get_block_device(parameters->ndns));
+	        ndctl_btt_set_uuid(btt, parameters->uuid);
+	        ndctl_btt_set_sector_size(btt, parameters->sector_size);
+	        ndctl_btt_set_backing_dev(btt, parametes->bdevpath);
+	        ndctl_btt_enable(btt);
+	}
+
+
+Once instantiated a “nd_btt” link will be created under the
+“backing_dev” (pmem0) block device:
+
+	/sys/block/pmem0/
+	├── alignment_offset
+	├── bdi -> ../../../../../../../virtual/bdi/259:0
+	├── capability
+	├── dev
+	├── device -> ../../../namespace0.0
+	├── discard_alignment
+	├── ext_range
+	├── holders
+	├── inflight
+	└── nd_btt -> ../../../../btt0
+
+
+...and a new inactive seed device will appear on the bus.
+
+
+Once a “backing_dev” is disabled its associated BTT will be
+automatically deleted.  This deletion is only at the device model level.
+In order to destroy a BTT the “info block” needs to be destroyed.
+
+
+Summary NDCTL Diagram
+---------------------
+
+For the given example above, here is the view of the objects as seen by
+the NDCTL api:
+            +---+
+            |CTX|    +---------+   +--------------+  +---------------+
+            +-+-+  +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" |
+              |    | +---------+   +--------------+  +---------------+
++-------+     |    | +---------+   +--------------+  +---------------+
+| DIMM0 <-+   |    +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" |
++-------+ |   |    | +---------+   +--------------+  +---------------+
+| DIMM1 <-+ +-v--+ | +---------+   +--------------+  +---------------+
++-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6  "blk2.0" |
+| DIMM2 <-+ +----+ | +---------+ | +--------------+  +----------------------+
++-------+ |        |             +-> NAMESPACE2.1 +--> ND5  "blk2.1" | BTT2 |
+| DIMM3 <-+        |               +--------------+  +----------------------+
++-------+          | +---------+   +--------------+  +---------------+
+                   +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4  "blk3.0" |
+                   | +---------+ | +--------------+  +----------------------+
+                   |             +-> NAMESPACE3.1 +--> ND3  "blk3.1" | BTT1 |
+                   |               +--------------+  +----------------------+
+                   | +---------+   +--------------+  +---------------+
+                   +-> REGION4 +---> NAMESPACE4.0 +--> ND2  "blk4.0" |
+                   | +---------+   +--------------+  +---------------+
+                   | +---------+   +--------------+  +----------------------+
+                   +-> REGION5 +---> NAMESPACE5.0 +--> ND1  "blk5.0" | BTT0 |
+                     +---------+   +--------------+  +---------------+------+
diff --git a/MAINTAINERS b/MAINTAINERS
index 4517613dc638..6bc0af450544 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6666,6 +6666,34 @@ S:	Maintained
 F:	Documentation/hwmon/nct6775
 F:	drivers/hwmon/nct6775.c
 
+ND (NFIT-DEFINED/NVDIMM SUBSYSTEM)
+M:	Dan Williams <dan.j.williams@...el.com>
+L:	linux-nvdimm@...ts.01.org
+Q:	https://patchwork.kernel.org/project/linux-nvdimm/list/
+S:	Supported
+F:	drivers/block/nd/*
+F:	include/linux/nd.h
+F:	include/uapi/linux/ndctl.h
+
+ND BLOCK APERTURE DRIVER
+M:	Ross Zwisler <ross.zwisler@...ux.intel.com>
+L:	linux-nvdimm@...ts.01.org
+S:	Supported
+F:	drivers/block/nd/blk.c
+F:	drivers/block/nd/region_devs.c
+
+ND BLOCK TRANSLATION TABLE
+M:	Vishal Verma <vishal.verma@...ux.intel.com>
+L:	linux-nvdimm@...ts.01.org
+S:	Supported
+F:	drivers/block/nd/btt*
+
+ND PERSISTENT MEMORY DRIVER
+M:	Ross Zwisler <ross.zwisler@...ux.intel.com>
+L:	linux-nvdimm@...ts.01.org
+S:	Supported
+F:	drivers/block/nd/pmem.c
+
 NETEFFECT IWARP RNIC DRIVER (IW_NES)
 M:	Faisal Latif <faisal.latif@...el.com>
 L:	linux-rdma@...r.kernel.org
@@ -8071,12 +8099,6 @@ S:	Maintained
 F:	Documentation/blockdev/ramdisk.txt
 F:	drivers/block/brd.c
 
-PERSISTENT MEMORY DRIVER
-M:	Ross Zwisler <ross.zwisler@...ux.intel.com>
-L:	linux-nvdimm@...ts.01.org
-S:	Supported
-F:	drivers/block/pmem.c
-
 RANDOM NUMBER DRIVER
 M:	"Theodore Ts'o" <tytso@....edu>
 S:	Maintained

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists