[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1378398821.3768.116.camel@Virt-Centos-6.lm.intel.com>
Date: Thu, 05 Sep 2013 10:33:41 -0600
From: Rob Gittins <rob.gittins@...ux.intel.com>
To: Jeff Moyer <jmoyer@...hat.com>
Cc: linux-kernel@...r.kernel.org, linux-fsdevel@...er.org,
linux-pmfs@...ts.infradead.org
Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs
Hi Jeff,
Thanks for taking the time to look at this.
On Thu, 2013-09-05 at 08:12 -0400, Jeff Moyer wrote:
> Rob Gittins <rob.gittins@...ux.intel.com> writes:
>
> > Direct Memory Mappable DIMMs (DMMD) appear in the system address space
> > and are accessed via load and store instructions. These NVDIMMs
> > are part of the system physical address space (SPA) as memory with
> > the attribute that data survives a power interruption. As such this
> > memory is managed by the kernel which can assign virtual addresses and
> > mapped into application’s address space as well as being accessible
> > by the kernel. The area mapped into the system address space is
> > being referred to as persistent memory (PMEM).
> >
> > PMEM introduces the need for new operations in the
> > block_device_operations to support the specific characteristics of
> > the media.
> >
> > First data may not propagate all the way through the memory pipeline
> > when store instructions are executed. Data may stay in the CPU cache
> > or in other buffers in the processor and memory complex. In order to
> > ensure the durability of data there needs to be a driver entry point
> > to force a byte range out to media. The methods of doing this are
> > specific to the PMEM technology and need to be handled by the driver
> > that is supporting the DMMDs. To provide a way to ensure that data is
> > durable adding a commit function to the block_device_operations vector.
>
> If the memory is available to be mapped into the address space of the
> kernel or a user process, then I don't see why we should have a block
> device at all. I think it would make more sense to have a different
> driver class for these persistent memory devices.
The reason to include block device is to allow existing file systems
to be used with NV-DIMMs. Assuming that NV-DIMMs are approximately
the same speed of DRAM it would mean a block IO would happen in
approximately 1uS. This would make for a really fast existing
filesystem.
>
> > void (*commitpmem)(struct block_device *bdev, void *addr);
>
> Seems like this would benefit from a length argument as well, no?
Yes. Great point. I will add that in.
>
> > Another area requiring extension is the need to be able to clear PMEM
> > errors. When data is fetched from errored PMEM it is marked with the
> > poison attribute. If the CPU attempts to access the data it causes a
> > machine check. How errors are cleared is hardware dependent and needs
> > to be handled by the specific device driver. The following function
> > in the block_device_operations vector would clear the correct range
> > of PMEM and put new data there. If the argument data is null or the
> > size is zero the driver is free to put any data in PMEM it wishes.
> >
> > void (*clearerrorpmem)(struct block_device *bdev, void *addr,
> > size_t len, void *data);
> What is the size of data?
clearerrorpmem as part of the process of clearing an error
can effectively write a buffer of data as part of the
clear process. If the len is zero or the data pointer is null then
only a error clear happens.
> > Different applications, filesystem and drivers may wish to share
> > ranges of PMEM. This is analogous to partitioning a disk that is
> > using multiple and different filesystems. Since PMEM is addressed
> > on a byte basis rather than a block basis the existing partitioning
> > model does not fit well. As a result there needs to be a way to
> > describe PMEM ranges.
> >
> > struct pmem_layout *(*getpmem)(struct block_device *bdev);
>
> If existing partitioning doesn't work well, then it sounds like a block
> device isn't the right fit (again). Ignoring that detail, what about
> requesting and releasing ranges of persistent memory, as in your
> "partitioning" example? Would that not also be a function of the
> driver?
The existing partitioning mechanism was intended for small drives
and works best for a single fs/device. We are approaching NV-DIMMs
as if they were more like LUNs in storage arrays. Each range is
treated as a device. A range of an NV-DIMM could be partitioned if
someone wanted to do such a thing.
Thanks,
Rob
>
> Cheers,
> Jeff
>
> >
> >
> > Proposed patch.
> >
> > ---
> > Documentation/filesystems/Locking | 6 ++++
> > fs/block_dev.c | 42
> > +++++++++++++++++++++++++++++++++
> > include/linux/blkdev.h | 4 +++
> > include/linux/pmem.h | 47
> > +++++++++++++++++++++++++++++++++++++
> > 4 files changed, 99 insertions(+), 0 deletions(-)
> > create mode 100644 include/linux/pmem.h
> >
> > diff --git a/Documentation/filesystems/Locking
> > b/Documentation/filesystems/Locking
> > index 0706d32..78910f4 100644
> > --- a/Documentation/filesystems/Locking
> > +++ b/Documentation/filesystems/Locking
> > @@ -386,6 +386,9 @@ prototypes:
> > int (*revalidate_disk) (struct gendisk *);
> > int (*getgeo)(struct block_device *, struct hd_geometry *);
> > void (*swap_slot_free_notify) (struct block_device *, unsigned long);
> > + struct pmem_layout *(*getpmem)(struct block_device *);
> > + void (*commitpmem)(struct block_device *, void *);
> > + void (*clearerrorpmem)(struct block_device *, void *, size_t, void *);
> > locking rules:
> > bd_mutex
> > @@ -399,6 +402,9 @@ unlock_native_capacity: no
> > revalidate_disk: no
> > getgeo: no
> > swap_slot_free_notify: no (see below)
> > +getpmem: no
> > +commitpmem: no
> > +clearerrorpmem: no
> > media_changed, unlock_native_capacity and revalidate_disk are called
> > only
> > from
> > check_disk_change().
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index aae187a..a57863c 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -27,6 +27,7 @@
> > #include <linux/namei.h>
> > #include <linux/log2.h>
> > #include <linux/cleancache.h>
> > +#include <linux/pmem.h>
> > #include <asm/uaccess.h>
> > #include "internal.h"
> > @@ -1716,3 +1717,44 @@ void iterate_bdevs(void (*func)(struct
> > block_device
> > *, void *), void *arg)
> > spin_unlock(&inode_sb_list_lock);
> > iput(old_inode);
> > }
> > +
> > +/**
> > + * get_pmemgeo() - Return persistent memory geometry information
> > + * @bdev: device to interrogate
> > + *
> > + * Provides the memory layout for a persistent memory volume which
> > + * is made up of CPU-addressable persistent memory. If the
> > interrogated
> > + * device does not support CPU-addressable persistent memory then
> > -ENOTTY
> > + * is returned.
> > + *
> > + * Return: a pointer to a pmem_layout structure or ERR_PTR
> > + */
> > +struct pmem_layout *get_pmemgeo(struct block_device *bdev)
> > +{
> > + struct gendisk *bd_disk = bdev->bd_disk;
> > +
> > + if (!bd_disk || !bd_disk->fops->getpmem)
> > + return ERR_PTR(-ENOTTY);
> > + return bd_disk->fops->getpmem(bdev);
> > +}
> > +EXPORT_SYMBOL(get_pmemgeo);
> > +
> > +void commit_pmem(struct block_device *bdev, void *addr)
> > +{
> > + struct gendisk *bd_disk = bdev->bd_disk;
> > +
> > + if (bd_disk && bd_disk->fops->commitpmem)
> > + bd_disk->fops->commitpmem(bdev, addr);
> > +}
> > +EXPORT_SYMBOL(commit_pmem);
> > +
> > +void clear_pmem_error(struct block_device *bdev, void *addr, size_t
> > len,
> > + void *data)
> > +{
> > + struct gendisk *bd_disk = bdev->bd_disk;
> > +
> > + if (bd_disk && bd_disk->fops->clearerrorpmem)
> > + bd_disk->fops->clearerrorpmem(bdev, addr, len, data);
> > +}
> > +EXPORT_SYMBOL(clear_pmem_error);
> > +
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 78feda9..ba2c1f5 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1498,6 +1498,10 @@ struct block_device_operations {
> > int (*getgeo)(struct block_device *, struct hd_geometry *);
> > /* this callback is with swap_lock and sometimes page table lock held
> > */
> > void (*swap_slot_free_notify) (struct block_device *, unsigned long);
> > + /* persistent memory operations */
> > + struct pmem_layout * (*getpmem)(struct block_device *);
> > + void (*commitpmem)(struct block_device *, void *);
> > + void (*clearerrorpmem)(struct block_device *, void *, size_t, void *);
> > struct module *owner;
> > };
> > diff --git a/include/linux/pmem.h b/include/linux/pmem.h
> > new file mode 100644
> > index 0000000..f907307
> > --- /dev/null
> > +++ b/include/linux/pmem.h
> > @@ -0,0 +1,47 @@
> > +/*
> > + * Definitions for the Persistent Memory interface
> > + * Copyright (c) 2013, Intel Corporation.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > it
> > + * under the terms and conditions of the GNU General Public License,
> > + * version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope it will be useful, but
> > WITHOUT
> > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
> > or
> > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public
> > License
> > for
> > + * more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > along with
> > + * this program; if not, write to the Free Software Foundation, Inc.,
> > + * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
> > + */
> > +
> > +#ifndef _LINUX_PMEM_H
> > +#define _LINUX_PMEM_H
> > +
> > +#include <linux/types.h>
> > +
> > +struct persistent_memory_extent {
> > + phys_addr_t pme_spa;
> > + u64 pme_len;
> > + int pme_numa_node;
> > +};
> > +
> > +struct pmem_layout {
> > + u64 pml_flags;
> > + u64 pml_total_size;
> > + u32 pml_extent_count;
> > + u32 pml_interleave; /* interleave bytes */
> > + struct persistent_memory_extent pml_extents[];
> > +};
> > +
> > +/*
> > + * Flags values
> > + */
> > +#define PMEM_ENABLED 0x0000000000000001 /* can be used for Persistent
> > Mem
> > */
> > +#define PMEM_ERRORED 0x0000000000000002 /* in an error state */
> > +#define PMEM_COMMIT 0x0000000000000004 /* commit function available */
> > +#define PMEM_CLEAR_ERROR 0x0000000000000008 /* clear error function
> > provided */
> > +
> > +#endif
> > +
> > --
> > 1.7.1
> >
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@...r.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists