[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140307075733.GB21790@kernel.org>
Date: Fri, 7 Mar 2014 15:57:33 +0800
From: Shaohua Li <shli@...nel.org>
To: linux-kernel@...r.kernel.org, dm-devel@...hat.com
Cc: agk@...hat.com, snitzer@...hat.com, axboe@...nel.dk
Subject: Re: [patch v3]DM: dm-insitu-comp: a compressed DM target for SSD
ping!
On Tue, Feb 18, 2014 at 06:13:04PM +0800, Shaohua Li wrote:
>
> This is a simple DM target supporting compression for SSD only. Under layer SSD
> must support 512B sector size, the target only supports 4k sector size.
>
> Disk layout:
> |super|...meta...|..data...|
>
> Store unit is 4k (a block). Super is 1 block, which stores meta and data size
> and compression algorithm. Meta is a bitmap. For each data block, there are 5
> bits meta.
>
> Data:
> Data of a block is compressed. Compressed data is round up to 512B, which is
> the payload. In disk, payload is stored at the begining of logical sector of
> the block. Let's look at an example. Say we store data to block A, which is in
> sector B(A*8), its orginal size is 4k, compressed size is 1500. Compressed data
> (CD) will use 3 sectors (512B). The 3 sectors are the payload. Payload will be
> stored at sector B.
>
> ---------------------------------------------------
> ... | CD1 | CD2 | CD3 | | | | | | ...
> ---------------------------------------------------
> ^B ^B+1 ^B+2 ^B+7 ^B+8
>
> For this block, we will not use sector B+3 to B+7 (a hole). We use 4 meta bits
> to present payload size. The compressed size (1500) isn't stored in meta
> directly. Instead, we store it at the last 32bits of payload. In this example,
> we store it at the end of sector B+2. If compressed size + sizeof(32bits)
> crosses a sector, payload size will increase one sector. If payload uses 8
> sectors, we store uncompressed data directly.
>
> If IO size is bigger than one block, we can store the data as an extent. Data
> of the whole extent will compressed and stored in the similar way like above.
> The first block of the extent is the head, all others are the tail. If extent
> is 1 block, the block is head. We have 1 bit of meta to present if a block is
> head or tail. If 4 meta bits of head block can't store extent payload size, we
> will borrow tail block meta bits to store payload size. Max allowd extent size
> is 128k, so we don't compress/decompress too big size data.
>
> Meta:
> Modifying data will modify meta too. Meta will be written(flush) to disk
> depending on meta write policy. We support writeback and writethrough mode. In
> writeback mode, meta will be written to disk in an interval or a FLUSH request.
> In writethrough mode, data and meta data will be written to disk together.
>
> Advantages:
> 1. simple. Since we store compressed data in-place, we don't need complicated
> disk data management.
> 2. efficient. For each 4k, we only need 5 bits meta. 1T data will use less than
> 200M meta, so we can load all meta into memory. And actual compression size is
> in payload. So if IO doesn't need RMW and we use writeback meta flush, we don't
> need extra IO for meta.
>
> Disadvantages:
> 1. hole. Since we store compressed data in-place, there are a lot of holes (in
> above example, B+3 - B+7) Hole can impact IO, because we can't do IO merge.
> 2. 1:1 size. Compression doesn't change disk size. If disk is 1T, we can only store
> 1T data even we do compression.
>
> But this is for SSD only. Generally SSD firmware has a FTL layer to map disk
> sectors to flash nand. High end SSD firmware has filesystem-like FTL.
> 1. hole. Disk has a lot of holes, but SSD FTL can still store data continuous
> in nand. Even we can't do IO merge in OS layer, SSD firmware can do it.
> 2. 1:1 size. On one side, we write compressed data to SSD, which means less
> data is written to SSD. This will be very helpful to improve SSD garbage
> collection, and so write speed and life cycle. So even this is a problem, the
> target is still helpful. On the other side, advanced SSD FTL can easily do thin
> provision. For example, if nand is 1T and we let SSD report it as 2T, and use
> the SSD as compressed target. In such SSD, we don't have the 1:1 size issue.
>
> So if SSD FTL can map non-continuous disk sectors to continuous nand and
> support thin provision, the compressed target will work very well.
>
> V2->V3:
> Updated with new bio iter API
>
> V1->V2:
> 1. Change name to insitu_comp, cleanup code, add comments and doc
> 2. Improve performance (extent locking, dedicated workqueue)
>
> Signed-off-by: Shaohua Li <shli@...ionio.com>
> ---
> Documentation/device-mapper/insitu-comp.txt | 50
> drivers/md/Kconfig | 6
> drivers/md/Makefile | 1
> drivers/md/dm-insitu-comp.c | 1480 ++++++++++++++++++++++++++++
> drivers/md/dm-insitu-comp.h | 158 ++
> 5 files changed, 1695 insertions(+)
>
> Index: linux/drivers/md/Kconfig
> ===================================================================
> --- linux.orig/drivers/md/Kconfig 2014-02-17 17:34:45.431464714 +0800
> +++ linux/drivers/md/Kconfig 2014-02-17 17:34:45.423464815 +0800
> @@ -295,6 +295,12 @@ config DM_CACHE_CLEANER
> A simple cache policy that writes back all data to the
> origin. Used when decommissioning a dm-cache.
>
> +config DM_INSITU_COMPRESSION
> + tristate "Insitu compression target"
> + depends on BLK_DEV_DM
> + ---help---
> + Allow volume managers to insitu compress data for SSD.
> +
> config DM_MIRROR
> tristate "Mirror target"
> depends on BLK_DEV_DM
> Index: linux/drivers/md/Makefile
> ===================================================================
> --- linux.orig/drivers/md/Makefile 2014-02-17 17:34:45.431464714 +0800
> +++ linux/drivers/md/Makefile 2014-02-17 17:34:45.423464815 +0800
> @@ -53,6 +53,7 @@ obj-$(CONFIG_DM_VERITY) += dm-verity.o
> obj-$(CONFIG_DM_CACHE) += dm-cache.o
> obj-$(CONFIG_DM_CACHE_MQ) += dm-cache-mq.o
> obj-$(CONFIG_DM_CACHE_CLEANER) += dm-cache-cleaner.o
> +obj-$(CONFIG_DM_INSITU_COMPRESSION) += dm-insitu-comp.o
>
> ifeq ($(CONFIG_DM_UEVENT),y)
> dm-mod-objs += dm-uevent.o
> Index: linux/drivers/md/dm-insitu-comp.c
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux/drivers/md/dm-insitu-comp.c 2014-02-17 20:16:38.093360018 +0800
> @@ -0,0 +1,1480 @@
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/blkdev.h>
> +#include <linux/bio.h>
> +#include <linux/slab.h>
> +#include <linux/device-mapper.h>
> +#include <linux/dm-io.h>
> +#include <linux/crypto.h>
> +#include <linux/lzo.h>
> +#include <linux/kthread.h>
> +#include <linux/page-flags.h>
> +#include <linux/completion.h>
> +#include "dm-insitu-comp.h"
> +
> +#define DM_MSG_PREFIX "dm_insitu_comp"
> +
> +static struct insitu_comp_compressor_data compressors[] = {
> + [INSITU_COMP_ALG_LZO] = {
> + .name = "lzo",
> + .comp_len = lzo_comp_len,
> + },
> + [INSITU_COMP_ALG_ZLIB] = {
> + .name = "deflate",
> + },
> +};
> +static int default_compressor;
> +
> +static struct kmem_cache *insitu_comp_io_range_cachep;
> +static struct kmem_cache *insitu_comp_meta_io_cachep;
> +
> +static struct insitu_comp_io_worker insitu_comp_io_workers[NR_CPUS];
> +static struct workqueue_struct *insitu_comp_wq;
> +
> +/* each block has 5 bits metadata */
> +static u8 insitu_comp_get_meta(struct insitu_comp_info *info, u64 block_index)
> +{
> + u64 first_bit = block_index * INSITU_COMP_META_BITS;
> + int bits, offset;
> + u8 data, ret = 0;
> +
> + offset = first_bit & 7;
> + bits = min_t(u8, INSITU_COMP_META_BITS, 8 - offset);
> +
> + data = info->meta_bitmap[first_bit >> 3];
> + ret = (data >> offset) & ((1 << bits) - 1);
> +
> + if (bits < INSITU_COMP_META_BITS) {
> + data = info->meta_bitmap[(first_bit >> 3) + 1];
> + bits = INSITU_COMP_META_BITS - bits;
> + ret |= (data & ((1 << bits) - 1)) <<
> + (INSITU_COMP_META_BITS - bits);
> + }
> + return ret;
> +}
> +
> +static void insitu_comp_set_meta(struct insitu_comp_info *info,
> + u64 block_index, u8 meta, bool dirty_meta)
> +{
> + u64 first_bit = block_index * INSITU_COMP_META_BITS;
> + int bits, offset;
> + u8 data;
> + struct page *page;
> +
> + offset = first_bit & 7;
> + bits = min_t(u8, INSITU_COMP_META_BITS, 8 - offset);
> +
> + data = info->meta_bitmap[first_bit >> 3];
> + data &= ~(((1 << bits) - 1) << offset);
> + data |= (meta & ((1 << bits) - 1)) << offset;
> + info->meta_bitmap[first_bit >> 3] = data;
> +
> + /*
> + * For writethrough, we write metadata directly. For writeback, if
> + * request is FUA, we do this too; otherwise we just dirty the page,
> + * which will be flush out in an interval
> + */
> + if (info->write_mode == INSITU_COMP_WRITE_BACK) {
> + page = vmalloc_to_page(&info->meta_bitmap[first_bit >> 3]);
> + if (dirty_meta)
> + SetPageDirty(page);
> + else
> + ClearPageDirty(page);
> + }
> +
> + if (bits < INSITU_COMP_META_BITS) {
> + meta >>= bits;
> + data = info->meta_bitmap[(first_bit >> 3) + 1];
> + bits = INSITU_COMP_META_BITS - bits;
> + data = (data >> bits) << bits;
> + data |= meta & ((1 << bits) - 1);
> + info->meta_bitmap[(first_bit >> 3) + 1] = data;
> +
> + if (info->write_mode == INSITU_COMP_WRITE_BACK) {
> + page = vmalloc_to_page(&info->meta_bitmap[
> + (first_bit >> 3) + 1]);
> + if (dirty_meta)
> + SetPageDirty(page);
> + else
> + ClearPageDirty(page);
> + }
> + }
> +}
> +
> +/*
> + * set metadata for an extent since block @block_index, length is
> + * @logical_blocks. The extent uses @data_sectors sectors
> + */
> +static void insitu_comp_set_extent(struct insitu_comp_req *req,
> + u64 block_index, u16 logical_blocks, sector_t data_sectors)
> +{
> + int i;
> + u8 data;
> +
> + for (i = 0; i < logical_blocks; i++) {
> + data = min_t(sector_t, data_sectors, 8);
> + data_sectors -= data;
> + if (i != 0)
> + data |= INSITU_COMP_TAIL_MASK;
> + /* For FUA, we write out meta data directly */
> + insitu_comp_set_meta(req->info, block_index + i, data,
> + !(insitu_req_rw(req) & REQ_FUA));
> + }
> +}
> +
> +/*
> + * get metadata for an extent covering block @block_index. @first_block_index
> + * returns the first block of the extent. @logical_sectors returns the extent
> + * length. @data_sectors returns the sectors the extent uses
> + */
> +static void insitu_comp_get_extent(struct insitu_comp_info *info,
> + u64 block_index, u64 *first_block_index, u16 *logical_sectors,
> + u16 *data_sectors)
> +{
> + u8 data;
> +
> + data = insitu_comp_get_meta(info, block_index);
> + while (data & INSITU_COMP_TAIL_MASK) {
> + block_index--;
> + data = insitu_comp_get_meta(info, block_index);
> + }
> + *first_block_index = block_index;
> + *logical_sectors = INSITU_COMP_BLOCK_SIZE >> 9;
> + *data_sectors = data & INSITU_COMP_LENGTH_MASK;
> + block_index++;
> + while (block_index < info->data_blocks) {
> + data = insitu_comp_get_meta(info, block_index);
> + if (!(data & INSITU_COMP_TAIL_MASK))
> + break;
> + *logical_sectors += INSITU_COMP_BLOCK_SIZE >> 9;
> + *data_sectors += data & INSITU_COMP_LENGTH_MASK;
> + block_index++;
> + }
> +}
> +
> +static int insitu_comp_access_super(struct insitu_comp_info *info,
> + void *addr, int rw)
> +{
> + struct dm_io_region region;
> + struct dm_io_request req;
> + unsigned long io_error = 0;
> + int ret;
> +
> + region.bdev = info->dev->bdev;
> + region.sector = 0;
> + region.count = INSITU_COMP_BLOCK_SIZE >> 9;
> +
> + req.bi_rw = rw;
> + req.mem.type = DM_IO_KMEM;
> + req.mem.offset = 0;
> + req.mem.ptr.addr = addr;
> + req.notify.fn = NULL;
> + req.client = info->io_client;
> +
> + ret = dm_io(&req, 1, ®ion, &io_error);
> + if (ret || io_error)
> + return -EIO;
> + return 0;
> +}
> +
> +static void insitu_comp_meta_io_done(unsigned long error, void *context)
> +{
> + struct insitu_comp_meta_io *meta_io = context;
> +
> + meta_io->fn(meta_io->data, error);
> + kmem_cache_free(insitu_comp_meta_io_cachep, meta_io);
> +}
> +
> +static int insitu_comp_write_meta(struct insitu_comp_info *info,
> + u64 start_page, u64 end_page, void *data,
> + void (*fn)(void *data, unsigned long error), int rw)
> +{
> + struct insitu_comp_meta_io *meta_io;
> +
> + BUG_ON(end_page > info->meta_bitmap_pages);
> +
> + meta_io = kmem_cache_alloc(insitu_comp_meta_io_cachep, GFP_NOIO);
> + if (!meta_io) {
> + fn(data, -ENOMEM);
> + return -ENOMEM;
> + }
> + meta_io->data = data;
> + meta_io->fn = fn;
> +
> + meta_io->io_region.bdev = info->dev->bdev;
> + meta_io->io_region.sector = INSITU_COMP_META_START_SECTOR +
> + (start_page << (PAGE_SHIFT - 9));
> + meta_io->io_region.count = (end_page - start_page) << (PAGE_SHIFT - 9);
> +
> + atomic64_add(meta_io->io_region.count << 9, &info->meta_write_size);
> +
> + meta_io->io_req.bi_rw = rw;
> + meta_io->io_req.mem.type = DM_IO_VMA;
> + meta_io->io_req.mem.offset = 0;
> + meta_io->io_req.mem.ptr.addr = info->meta_bitmap +
> + (start_page << PAGE_SHIFT);
> + meta_io->io_req.notify.fn = insitu_comp_meta_io_done;
> + meta_io->io_req.notify.context = meta_io;
> + meta_io->io_req.client = info->io_client;
> +
> + dm_io(&meta_io->io_req, 1, &meta_io->io_region, NULL);
> + return 0;
> +}
> +
> +struct writeback_flush_data {
> + struct completion complete;
> + atomic_t cnt;
> +};
> +
> +static void writeback_flush_io_done(void *data, unsigned long error)
> +{
> + struct writeback_flush_data *wb = data;
> +
> + if (atomic_dec_return(&wb->cnt))
> + return;
> + complete(&wb->complete);
> +}
> +
> +static void insitu_comp_flush_dirty_meta(struct insitu_comp_info *info,
> + struct writeback_flush_data *data)
> +{
> + struct page *page;
> + u64 start = 0, index;
> + u32 pending = 0, cnt = 0;
> + bool dirty;
> + struct blk_plug plug;
> +
> + blk_start_plug(&plug);
> + for (index = 0; index < info->meta_bitmap_pages; index++, cnt++) {
> + if (cnt == 256) {
> + cnt = 0;
> + cond_resched();
> + }
> +
> + page = vmalloc_to_page(info->meta_bitmap +
> + (index << PAGE_SHIFT));
> + dirty = TestClearPageDirty(page);
> +
> + if (pending == 0 && dirty) {
> + start = index;
> + pending++;
> + continue;
> + } else if (pending == 0)
> + continue;
> + else if (pending > 0 && dirty) {
> + pending++;
> + continue;
> + }
> +
> + /* pending > 0 && !dirty */
> + atomic_inc(&data->cnt);
> + insitu_comp_write_meta(info, start, start + pending, data,
> + writeback_flush_io_done, WRITE);
> + pending = 0;
> + }
> +
> + if (pending > 0) {
> + atomic_inc(&data->cnt);
> + insitu_comp_write_meta(info, start, start + pending, data,
> + writeback_flush_io_done, WRITE);
> + }
> + blkdev_issue_flush(info->dev->bdev, GFP_NOIO, NULL);
> + blk_finish_plug(&plug);
> +}
> +
> +/* writeback thread flushs all dirty metadata to disk in an interval */
> +static int insitu_comp_meta_writeback_thread(void *data)
> +{
> + struct insitu_comp_info *info = data;
> + struct writeback_flush_data wb;
> +
> + atomic_set(&wb.cnt, 1);
> + init_completion(&wb.complete);
> +
> + while (!kthread_should_stop()) {
> + schedule_timeout_interruptible(
> + msecs_to_jiffies(info->writeback_delay * 1000));
> + insitu_comp_flush_dirty_meta(info, &wb);
> + }
> +
> + insitu_comp_flush_dirty_meta(info, &wb);
> +
> + writeback_flush_io_done(&wb, 0);
> + wait_for_completion(&wb.complete);
> + return 0;
> +}
> +
> +static int insitu_comp_init_meta(struct insitu_comp_info *info, bool new)
> +{
> + struct dm_io_region region;
> + struct dm_io_request req;
> + unsigned long io_error = 0;
> + struct blk_plug plug;
> + int ret;
> + ssize_t len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits, BITS_PER_LONG);
> +
> + len *= sizeof(unsigned long);
> +
> + region.bdev = info->dev->bdev;
> + region.sector = INSITU_COMP_META_START_SECTOR;
> + region.count = (len + 511) >> 9;
> +
> + req.mem.type = DM_IO_VMA;
> + req.mem.offset = 0;
> + req.mem.ptr.addr = info->meta_bitmap;
> + req.notify.fn = NULL;
> + req.client = info->io_client;
> +
> + blk_start_plug(&plug);
> + if (new) {
> + memset(info->meta_bitmap, 0, len);
> + req.bi_rw = WRITE_FLUSH;
> + ret = dm_io(&req, 1, ®ion, &io_error);
> + } else {
> + req.bi_rw = READ;
> + ret = dm_io(&req, 1, ®ion, &io_error);
> + }
> + blk_finish_plug(&plug);
> +
> + if (ret || io_error) {
> + info->ti->error = "Access metadata error";
> + return -EIO;
> + }
> +
> + if (info->write_mode == INSITU_COMP_WRITE_BACK) {
> + info->writeback_tsk = kthread_run(
> + insitu_comp_meta_writeback_thread,
> + info, "insitu_comp_writeback");
> + if (!info->writeback_tsk) {
> + info->ti->error = "Create writeback thread error";
> + return -EINVAL;
> + }
> + }
> +
> + return 0;
> +}
> +
> +static int insitu_comp_alloc_compressor(struct insitu_comp_info *info)
> +{
> + int i;
> +
> + for_each_possible_cpu(i) {
> + info->tfm[i] = crypto_alloc_comp(
> + compressors[info->comp_alg].name, 0, 0);
> + if (IS_ERR(info->tfm[i])) {
> + info->tfm[i] = NULL;
> + goto err;
> + }
> + }
> + return 0;
> +err:
> + for_each_possible_cpu(i) {
> + if (info->tfm[i]) {
> + crypto_free_comp(info->tfm[i]);
> + info->tfm[i] = NULL;
> + }
> + }
> + return -ENOMEM;
> +}
> +
> +static void insitu_comp_free_compressor(struct insitu_comp_info *info)
> +{
> + int i;
> +
> + for_each_possible_cpu(i) {
> + if (info->tfm[i]) {
> + crypto_free_comp(info->tfm[i]);
> + info->tfm[i] = NULL;
> + }
> + }
> +}
> +
> +static int insitu_comp_read_or_create_super(struct insitu_comp_info *info)
> +{
> + void *addr;
> + struct insitu_comp_super_block *super;
> + u64 total_blocks;
> + u64 data_blocks, meta_blocks;
> + u32 rem, cnt;
> + bool new_super = false;
> + int ret;
> + ssize_t len;
> +
> + total_blocks = i_size_read(info->dev->bdev->bd_inode) >>
> + INSITU_COMP_BLOCK_SHIFT;
> + data_blocks = total_blocks - 1;
> + rem = do_div(data_blocks, INSITU_COMP_BLOCK_SIZE * 8 +
> + INSITU_COMP_META_BITS);
> + meta_blocks = data_blocks * INSITU_COMP_META_BITS;
> + data_blocks *= INSITU_COMP_BLOCK_SIZE * 8;
> +
> + cnt = rem;
> + rem /= (INSITU_COMP_BLOCK_SIZE * 8 / INSITU_COMP_META_BITS + 1);
> + data_blocks += rem * (INSITU_COMP_BLOCK_SIZE * 8 /
> + INSITU_COMP_META_BITS);
> + meta_blocks += rem;
> +
> + cnt %= (INSITU_COMP_BLOCK_SIZE * 8 / INSITU_COMP_META_BITS + 1);
> + meta_blocks += 1;
> + data_blocks += cnt - 1;
> +
> + info->data_blocks = data_blocks;
> + info->data_start = (1 + meta_blocks) << INSITU_COMP_BLOCK_SECTOR_SHIFT;
> +
> + addr = kzalloc(INSITU_COMP_BLOCK_SIZE, GFP_KERNEL);
> + if (!addr) {
> + info->ti->error = "Cannot allocate super";
> + return -ENOMEM;
> + }
> +
> + super = addr;
> + ret = insitu_comp_access_super(info, addr, READ);
> + if (ret)
> + goto out;
> +
> + if (le64_to_cpu(super->magic) == INSITU_COMP_SUPER_MAGIC) {
> + if (le64_to_cpu(super->version) != INSITU_COMP_VERSION ||
> + le64_to_cpu(super->meta_blocks) != meta_blocks ||
> + le64_to_cpu(super->data_blocks) != data_blocks) {
> + info->ti->error = "Super is invalid";
> + ret = -EINVAL;
> + goto out;
> + }
> + if (!crypto_has_comp(compressors[super->comp_alg].name, 0, 0)) {
> + info->ti->error =
> + "Compressor algorithm doesn't support";
> + ret = -EINVAL;
> + goto out;
> + }
> + } else {
> + super->magic = cpu_to_le64(INSITU_COMP_SUPER_MAGIC);
> + super->version = cpu_to_le64(INSITU_COMP_VERSION);
> + super->meta_blocks = cpu_to_le64(meta_blocks);
> + super->data_blocks = cpu_to_le64(data_blocks);
> + super->comp_alg = default_compressor;
> + ret = insitu_comp_access_super(info, addr, WRITE_FUA);
> + if (ret) {
> + info->ti->error = "Access super fails";
> + goto out;
> + }
> + new_super = true;
> + }
> +
> + info->comp_alg = super->comp_alg;
> + if (insitu_comp_alloc_compressor(info)) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + info->meta_bitmap_bits = data_blocks * INSITU_COMP_META_BITS;
> + len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits, BITS_PER_LONG);
> + len *= sizeof(unsigned long);
> + info->meta_bitmap_pages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + info->meta_bitmap = vmalloc(info->meta_bitmap_pages * PAGE_SIZE);
> + if (!info->meta_bitmap) {
> + ret = -ENOMEM;
> + goto bitmap_err;
> + }
> +
> + ret = insitu_comp_init_meta(info, new_super);
> + if (ret)
> + goto meta_err;
> +
> + return 0;
> +meta_err:
> + vfree(info->meta_bitmap);
> +bitmap_err:
> + insitu_comp_free_compressor(info);
> +out:
> + kfree(addr);
> + return ret;
> +}
> +
> +/*
> + * <dev> <writethough>/<writeback> <meta_commit_delay>
> + */
> +static int insitu_comp_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> +{
> + struct insitu_comp_info *info;
> + char write_mode[15];
> + int ret, i;
> +
> + if (argc < 2) {
> + ti->error = "Invalid argument count";
> + return -EINVAL;
> + }
> +
> + info = kzalloc(sizeof(*info), GFP_KERNEL);
> + if (!info) {
> + ti->error = "Cannot allocate context";
> + return -ENOMEM;
> + }
> + info->ti = ti;
> +
> + if (sscanf(argv[1], "%s", write_mode) != 1) {
> + ti->error = "Invalid argument";
> + ret = -EINVAL;
> + goto err_para;
> + }
> +
> + if (strcmp(write_mode, "writeback") == 0) {
> + if (argc != 3) {
> + ti->error = "Invalid argument";
> + ret = -EINVAL;
> + goto err_para;
> + }
> + info->write_mode = INSITU_COMP_WRITE_BACK;
> + if (sscanf(argv[2], "%u", &info->writeback_delay) != 1) {
> + ti->error = "Invalid argument";
> + ret = -EINVAL;
> + goto err_para;
> + }
> + } else if (strcmp(write_mode, "writethrough") == 0) {
> + info->write_mode = INSITU_COMP_WRITE_THROUGH;
> + } else {
> + ti->error = "Invalid argument";
> + ret = -EINVAL;
> + goto err_para;
> + }
> +
> + if (dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
> + &info->dev)) {
> + ti->error = "Can't get device";
> + ret = -EINVAL;
> + goto err_para;
> + }
> +
> + info->io_client = dm_io_client_create();
> + if (!info->io_client) {
> + ti->error = "Can't create io client";
> + ret = -EINVAL;
> + goto err_ioclient;
> + }
> +
> + if (bdev_logical_block_size(info->dev->bdev) != 512) {
> + ti->error = "Can't logical block size too big";
> + ret = -EINVAL;
> + goto err_blocksize;
> + }
> +
> + ret = insitu_comp_read_or_create_super(info);
> + if (ret)
> + goto err_blocksize;
> +
> + for (i = 0; i < BITMAP_HASH_LEN; i++) {
> + info->bitmap_locks[i].io_running = 0;
> + spin_lock_init(&info->bitmap_locks[i].wait_lock);
> + INIT_LIST_HEAD(&info->bitmap_locks[i].wait_list);
> + }
> +
> + atomic64_set(&info->compressed_write_size, 0);
> + atomic64_set(&info->uncompressed_write_size, 0);
> + atomic64_set(&info->meta_write_size, 0);
> + ti->num_flush_bios = 1;
> + /* doesn't support discard yet */
> + ti->per_bio_data_size = sizeof(struct insitu_comp_req);
> + ti->private = info;
> + return 0;
> +err_blocksize:
> + dm_io_client_destroy(info->io_client);
> +err_ioclient:
> + dm_put_device(ti, info->dev);
> +err_para:
> + kfree(info);
> + return ret;
> +}
> +
> +static void insitu_comp_dtr(struct dm_target *ti)
> +{
> + struct insitu_comp_info *info = ti->private;
> +
> + if (info->write_mode == INSITU_COMP_WRITE_BACK)
> + kthread_stop(info->writeback_tsk);
> + insitu_comp_free_compressor(info);
> + vfree(info->meta_bitmap);
> + dm_io_client_destroy(info->io_client);
> + dm_put_device(ti, info->dev);
> + kfree(info);
> +}
> +
> +static u64 insitu_comp_sector_to_block(sector_t sect)
> +{
> + return sect >> INSITU_COMP_BLOCK_SECTOR_SHIFT;
> +}
> +
> +static struct insitu_comp_hash_lock *
> +insitu_comp_block_hash_lock(struct insitu_comp_info *info, u64 block_index)
> +{
> + return &info->bitmap_locks[(block_index >> HASH_LOCK_SHIFT) &
> + BITMAP_HASH_MASK];
> +}
> +
> +static struct insitu_comp_hash_lock *
> +insitu_comp_trylock_block(struct insitu_comp_info *info,
> + struct insitu_comp_req *req, u64 block_index)
> +{
> + struct insitu_comp_hash_lock *hash_lock;
> +
> + hash_lock = insitu_comp_block_hash_lock(req->info, block_index);
> +
> + spin_lock_irq(&hash_lock->wait_lock);
> + if (!hash_lock->io_running) {
> + hash_lock->io_running = 1;
> + spin_unlock_irq(&hash_lock->wait_lock);
> + return hash_lock;
> + }
> + list_add_tail(&req->sibling, &hash_lock->wait_list);
> + spin_unlock_irq(&hash_lock->wait_lock);
> + return NULL;
> +}
> +
> +static void insitu_comp_queue_req_list(struct insitu_comp_info *info,
> + struct list_head *list);
> +static void insitu_comp_unlock_block(struct insitu_comp_info *info,
> + struct insitu_comp_req *req, struct insitu_comp_hash_lock *hash_lock)
> +{
> + LIST_HEAD(pending_list);
> + unsigned long flags;
> +
> + spin_lock_irqsave(&hash_lock->wait_lock, flags);
> + /* wakeup all pending reqs to avoid live lock */
> + list_splice_init(&hash_lock->wait_list, &pending_list);
> + hash_lock->io_running = 0;
> + spin_unlock_irqrestore(&hash_lock->wait_lock, flags);
> +
> + insitu_comp_queue_req_list(info, &pending_list);
> +}
> +
> +static void insitu_comp_unlock_req_range(struct insitu_comp_req *req)
> +{
> + insitu_comp_unlock_block(req->info, req, req->lock);
> +}
> +
> +/* Check comments of HASH_LOCK_SHIFT. each request only need take one lock */
> +static int insitu_comp_lock_req_range(struct insitu_comp_req *req)
> +{
> + u64 block_index, tmp;
> +
> + block_index = insitu_comp_sector_to_block(insitu_req_start_sector(req));
> + tmp = insitu_comp_sector_to_block(insitu_req_end_sector(req) - 1);
> + BUG_ON(insitu_comp_block_hash_lock(req->info, block_index) !=
> + insitu_comp_block_hash_lock(req->info, tmp));
> +
> + req->lock = insitu_comp_trylock_block(req->info, req, block_index);
> + if (!req->lock)
> + return 0;
> +
> + return 1;
> +}
> +
> +static void insitu_comp_queue_req(struct insitu_comp_info *info,
> + struct insitu_comp_req *req)
> +{
> + unsigned long flags;
> + struct insitu_comp_io_worker *worker =
> + &insitu_comp_io_workers[req->cpu];
> +
> + spin_lock_irqsave(&worker->lock, flags);
> + list_add_tail(&req->sibling, &worker->pending);
> + spin_unlock_irqrestore(&worker->lock, flags);
> +
> + queue_work_on(req->cpu, insitu_comp_wq, &worker->work);
> +}
> +
> +static void insitu_comp_queue_req_list(struct insitu_comp_info *info,
> + struct list_head *list)
> +{
> + struct insitu_comp_req *req;
> + while (!list_empty(list)) {
> + req = list_first_entry(list, struct insitu_comp_req, sibling);
> + list_del_init(&req->sibling);
> + insitu_comp_queue_req(info, req);
> + }
> +}
> +
> +static void insitu_comp_get_req(struct insitu_comp_req *req)
> +{
> + atomic_inc(&req->io_pending);
> +}
> +
> +static void insitu_comp_free_io_range(struct insitu_comp_io_range *io)
> +{
> + kfree(io->decomp_data);
> + kfree(io->comp_data);
> + kmem_cache_free(insitu_comp_io_range_cachep, io);
> +}
> +
> +static void insitu_comp_put_req(struct insitu_comp_req *req)
> +{
> + struct insitu_comp_io_range *io;
> +
> + if (atomic_dec_return(&req->io_pending))
> + return;
> +
> + if (req->stage == STAGE_INIT) /* waiting for locking */
> + return;
> +
> + if (req->stage == STAGE_READ_DECOMP ||
> + req->stage == STAGE_WRITE_COMP ||
> + req->result)
> + req->stage = STAGE_DONE;
> +
> + if (req->stage != STAGE_DONE) {
> + insitu_comp_queue_req(req->info, req);
> + return;
> + }
> +
> + while (!list_empty(&req->all_io)) {
> + io = list_entry(req->all_io.next, struct insitu_comp_io_range,
> + next);
> + list_del(&io->next);
> + insitu_comp_free_io_range(io);
> + }
> +
> + insitu_comp_unlock_req_range(req);
> +
> + insitu_req_endio(req, req->result);
> +}
> +
> +static void insitu_comp_io_range_done(unsigned long error, void *context)
> +{
> + struct insitu_comp_io_range *io = context;
> +
> + if (error)
> + io->req->result = error;
> + insitu_comp_put_req(io->req);
> +}
> +
> +static inline int insitu_comp_compressor_len(struct insitu_comp_info *info,
> + int len)
> +{
> + if (compressors[info->comp_alg].comp_len)
> + return compressors[info->comp_alg].comp_len(len);
> + return len;
> +}
> +
> +/*
> + * caller should set region.sector, region.count. bi_rw. IO always to/from
> + * comp_data
> + */
> +static struct insitu_comp_io_range *
> +insitu_comp_create_io_range(struct insitu_comp_req *req, int comp_len,
> + int decomp_len)
> +{
> + struct insitu_comp_io_range *io;
> +
> + io = kmem_cache_alloc(insitu_comp_io_range_cachep, GFP_NOIO);
> + if (!io)
> + return NULL;
> +
> + io->comp_data = kmalloc(insitu_comp_compressor_len(req->info, comp_len),
> + GFP_NOIO);
> + io->decomp_data = kmalloc(decomp_len, GFP_NOIO);
> + if (!io->decomp_data || !io->comp_data) {
> + kfree(io->decomp_data);
> + kfree(io->comp_data);
> + kmem_cache_free(insitu_comp_io_range_cachep, io);
> + return NULL;
> + }
> +
> + io->io_req.notify.fn = insitu_comp_io_range_done;
> + io->io_req.notify.context = io;
> + io->io_req.client = req->info->io_client;
> + io->io_req.mem.type = DM_IO_KMEM;
> + io->io_req.mem.ptr.addr = io->comp_data;
> + io->io_req.mem.offset = 0;
> +
> + io->io_region.bdev = req->info->dev->bdev;
> +
> + io->decomp_len = decomp_len;
> + io->comp_len = comp_len;
> + io->req = req;
> + return io;
> +}
> +
> +static void insitu_comp_req_copy(struct insitu_comp_req *req, off_t req_off, void *buf,
> + ssize_t len, bool to_buf)
> +{
> + struct bio *bio = req->bio;
> + struct bvec_iter iter;
> + off_t buf_off = 0;
> + ssize_t size;
> + void *addr;
> +
> + iter = bio->bi_iter;
> + bio_advance_iter(bio, &iter, req_off);
> +
> + while (len) {
> + addr = kmap_atomic(bio_iter_page(bio, iter));
> + size = min_t(ssize_t, len, bio_iter_len(bio, iter));
> + if (to_buf)
> + memcpy(buf + buf_off, addr + bio_iter_offset(bio, iter),
> + size);
> + else
> + memcpy(addr + bio_iter_offset(bio, iter), buf + buf_off,
> + size);
> + kunmap_atomic(addr);
> +
> + buf_off += size;
> + len -= size;
> +
> + bio_advance_iter(bio, &iter, size);
> + }
> +}
> +
> +/*
> + * return value:
> + * < 0 : error
> + * == 0 : ok
> + * == 1 : ok, but comp/decomp is skipped
> + * Compressed data size is roundup of 512, which makes the payload.
> + * We store the actual compressed length in the last u32 of the payload.
> + * If there is no free space, we add 512 to the payload size.
> + */
> +static int insitu_comp_io_range_comp(struct insitu_comp_info *info,
> + void *comp_data, unsigned int *comp_len, void *decomp_data,
> + unsigned int decomp_len, bool do_comp)
> +{
> + struct crypto_comp *tfm;
> + u32 *addr;
> + unsigned int actual_comp_len;
> + int ret;
> +
> + if (do_comp) {
> + actual_comp_len = *comp_len;
> +
> + tfm = info->tfm[get_cpu()];
> + ret = crypto_comp_compress(tfm, decomp_data, decomp_len,
> + comp_data, &actual_comp_len);
> + put_cpu();
> +
> + atomic64_add(decomp_len, &info->uncompressed_write_size);
> + if (ret || decomp_len < actual_comp_len + sizeof(u32) + 512) {
> + *comp_len = decomp_len;
> + atomic64_add(*comp_len, &info->compressed_write_size);
> + return 1;
> + }
> +
> + *comp_len = round_up(actual_comp_len, 512);
> + if (*comp_len - actual_comp_len < sizeof(u32))
> + *comp_len += 512;
> + atomic64_add(*comp_len, &info->compressed_write_size);
> + addr = comp_data + *comp_len;
> + addr--;
> + *addr = cpu_to_le32(actual_comp_len);
> + } else {
> + if (*comp_len == decomp_len)
> + return 1;
> + addr = comp_data + *comp_len;
> + addr--;
> + actual_comp_len = le32_to_cpu(*addr);
> +
> + tfm = info->tfm[get_cpu()];
> + ret = crypto_comp_decompress(tfm, comp_data, actual_comp_len,
> + decomp_data, &decomp_len);
> + put_cpu();
> + if (ret)
> + return -EINVAL;
> + }
> + return 0;
> +}
> +
> +/*
> + * compressed data is updated. We decompress it and fill req. If there is no
> + * valid compressed data, we just zero req
> + */
> +static void insitu_comp_handle_read_decomp(struct insitu_comp_req *req)
> +{
> + struct insitu_comp_io_range *io;
> + off_t req_off = 0;
> + int ret;
> +
> + req->stage = STAGE_READ_DECOMP;
> +
> + if (req->result)
> + return;
> +
> + list_for_each_entry(io, &req->all_io, next) {
> + ssize_t dst_off = 0, src_off = 0, len;
> +
> + io->io_region.sector -= req->info->data_start;
> +
> + /* Do decomp here */
> + ret = insitu_comp_io_range_comp(req->info, io->comp_data,
> + &io->comp_len, io->decomp_data, io->decomp_len, false);
> + if (ret < 0) {
> + req->result = -EIO;
> + return;
> + }
> +
> + if (io->io_region.sector >= insitu_req_start_sector(req))
> + dst_off = (io->io_region.sector - insitu_req_start_sector(req))
> + << 9;
> + else
> + src_off = (insitu_req_start_sector(req) - io->io_region.sector)
> + << 9;
> + len = min_t(ssize_t, io->decomp_len - src_off,
> + (insitu_req_sectors(req) << 9) - dst_off);
> +
> + /* io range in all_io list is ordered for read IO */
> + while (req_off != dst_off) {
> + ssize_t size = min_t(ssize_t, PAGE_SIZE,
> + dst_off - req_off);
> + insitu_comp_req_copy(req, req_off,
> + empty_zero_page, size, false);
> + req_off += size;
> + }
> +
> + if (ret == 1) /* uncompressed, valid data is in .comp_data */
> + insitu_comp_req_copy(req, dst_off,
> + io->comp_data + src_off, len, false);
> + else
> + insitu_comp_req_copy(req, dst_off,
> + io->decomp_data + src_off, len, false);
> + req_off = dst_off + len;
> + }
> +
> + while (req_off != (insitu_req_sectors(req) << 9)) {
> + ssize_t size = min_t(ssize_t, PAGE_SIZE,
> + (insitu_req_sectors(req) << 9) - req_off);
> + insitu_comp_req_copy(req, req_off, empty_zero_page,
> + size, false);
> + req_off += size;
> + }
> +}
> +
> +/*
> + * read one extent data from disk. The extent starts from block @block and has
> + * @data_sectors data
> + */
> +static void insitu_comp_read_one_extent(struct insitu_comp_req *req, u64 block,
> + u16 logical_sectors, u16 data_sectors)
> +{
> + struct insitu_comp_io_range *io;
> +
> + io = insitu_comp_create_io_range(req, data_sectors << 9,
> + logical_sectors << 9);
> + if (!io) {
> + req->result = -EIO;
> + return;
> + }
> +
> + insitu_comp_get_req(req);
> + list_add_tail(&io->next, &req->all_io);
> +
> + io->io_region.sector = (block << INSITU_COMP_BLOCK_SECTOR_SHIFT) +
> + req->info->data_start;
> + io->io_region.count = data_sectors;
> +
> + io->io_req.bi_rw = READ;
> + dm_io(&io->io_req, 1, &io->io_region, NULL);
> +}
> +
> +static void insitu_comp_handle_read_read_existing(struct insitu_comp_req *req)
> +{
> + u64 block_index, first_block_index;
> + u16 logical_sectors, data_sectors;
> +
> + req->stage = STAGE_READ_EXISTING;
> +
> + block_index = insitu_comp_sector_to_block(insitu_req_start_sector(req));
> +again:
> + insitu_comp_get_extent(req->info, block_index, &first_block_index,
> + &logical_sectors, &data_sectors);
> + if (data_sectors > 0)
> + insitu_comp_read_one_extent(req, first_block_index,
> + logical_sectors, data_sectors);
> +
> + if (req->result)
> + return;
> +
> + block_index = first_block_index + (logical_sectors >>
> + INSITU_COMP_BLOCK_SECTOR_SHIFT);
> + /* the request might cover several extents */
> + if ((block_index << INSITU_COMP_BLOCK_SECTOR_SHIFT) <
> + insitu_req_end_sector(req))
> + goto again;
> +
> + /* A shortcut if all data is in already */
> + if (list_empty(&req->all_io))
> + insitu_comp_handle_read_decomp(req);
> +}
> +
> +static void insitu_comp_handle_read_request(struct insitu_comp_req *req)
> +{
> + insitu_comp_get_req(req);
> +
> + if (req->stage == STAGE_INIT) {
> + if (!insitu_comp_lock_req_range(req)) {
> + insitu_comp_put_req(req);
> + return;
> + }
> +
> + insitu_comp_handle_read_read_existing(req);
> + } else if (req->stage == STAGE_READ_EXISTING)
> + insitu_comp_handle_read_decomp(req);
> +
> + insitu_comp_put_req(req);
> +}
> +
> +static void insitu_comp_write_meta_done(void *context, unsigned long error)
> +{
> + struct insitu_comp_req *req = context;
> + insitu_comp_put_req(req);
> +}
> +
> +static u64 insitu_comp_block_meta_page_index(u64 block, bool end)
> +{
> + u64 bits = block * INSITU_COMP_META_BITS - !!end;
> + /* (1 << 3) bits per byte */
> + return bits >> (3 + PAGE_SHIFT);
> +}
> +
> +/*
> + * the request covers some extents partially. Decompress data of the extents,
> + * compress remaining valid data, and finally write them out
> + */
> +static int insitu_comp_handle_write_modify(struct insitu_comp_io_range *io,
> + u64 *meta_start, u64 *meta_end, bool *handle_req)
> +{
> + struct insitu_comp_req *req = io->req;
> + sector_t start, count;
> + unsigned int comp_len;
> + off_t offset;
> + u64 page_index;
> + int ret;
> +
> + io->io_region.sector -= req->info->data_start;
> +
> + /* decompress original data */
> + ret = insitu_comp_io_range_comp(req->info, io->comp_data, &io->comp_len,
> + io->decomp_data, io->decomp_len, false);
> + if (ret < 0) {
> + req->result = -EINVAL;
> + return -EIO;
> + }
> +
> + start = io->io_region.sector;
> + count = io->decomp_len >> 9;
> + if (start < insitu_req_start_sector(req) && start + count >
> + insitu_req_end_sector(req)) {
> + /* we don't split an extent */
> + if (ret == 1) {
> + memcpy(io->decomp_data, io->comp_data, io->decomp_len);
> + insitu_comp_req_copy(req, 0,
> + io->decomp_data + ((insitu_req_start_sector(req) - start) <<
> + 9), insitu_req_sectors(req) << 9, true);
> + } else {
> + insitu_comp_req_copy(req, 0,
> + io->decomp_data + ((insitu_req_start_sector(req) - start) <<
> + 9), insitu_req_sectors(req) << 9, true);
> + kfree(io->comp_data);
> + /* New compressed len might be bigger */
> + io->comp_data = kmalloc(insitu_comp_compressor_len(
> + req->info, io->decomp_len), GFP_NOIO);
> + io->comp_len = io->decomp_len;
> + if (!io->comp_data) {
> + req->result = -ENOMEM;
> + return -EIO;
> + }
> + io->io_req.mem.ptr.addr = io->comp_data;
> + }
> + /* need compress data */
> + ret = 0;
> + offset = 0;
> + *handle_req = false;
> + } else if (start < insitu_req_start_sector(req)) {
> + count = insitu_req_start_sector(req) - start;
> + offset = 0;
> + } else {
> + offset = insitu_req_end_sector(req) - start;
> + start = insitu_req_end_sector(req);
> + count = count - offset;
> + }
> +
> + /* Original data is uncompressed, we don't need writeback */
> + if (ret == 1) {
> + comp_len = count << 9;
> + goto handle_meta;
> + }
> +
> + /* assume compress less data uses less space (at least 4k lsess data) */
> + comp_len = io->comp_len;
> + ret = insitu_comp_io_range_comp(req->info, io->comp_data, &comp_len,
> + io->decomp_data + (offset << 9), count << 9, true);
> + if (ret < 0) {
> + req->result = -EIO;
> + return -EIO;
> + }
> +
> + insitu_comp_get_req(req);
> + if (ret == 1)
> + io->io_req.mem.ptr.addr = io->decomp_data + (offset << 9);
> + io->io_region.count = comp_len >> 9;
> + io->io_region.sector = start + req->info->data_start;
> +
> + io->io_req.bi_rw = insitu_req_rw(req);
> + dm_io(&io->io_req, 1, &io->io_region, NULL);
> +handle_meta:
> + insitu_comp_set_extent(req, start >> INSITU_COMP_BLOCK_SECTOR_SHIFT,
> + count >> INSITU_COMP_BLOCK_SECTOR_SHIFT, comp_len >> 9);
> +
> + page_index = insitu_comp_block_meta_page_index(start >>
> + INSITU_COMP_BLOCK_SECTOR_SHIFT, false);
> + if (*meta_start > page_index)
> + *meta_start = page_index;
> + page_index = insitu_comp_block_meta_page_index(
> + (start + count) >> INSITU_COMP_BLOCK_SECTOR_SHIFT, true);
> + if (*meta_end < page_index)
> + *meta_end = page_index;
> + return 0;
> +}
> +
> +/* Compress data and write it out */
> +static void insitu_comp_handle_write_comp(struct insitu_comp_req *req)
> +{
> + struct insitu_comp_io_range *io;
> + sector_t count;
> + unsigned int comp_len;
> + u64 meta_start = -1L, meta_end = 0, page_index;
> + int ret;
> + bool handle_req = true;
> +
> + req->stage = STAGE_WRITE_COMP;
> +
> + if (req->result)
> + return;
> +
> + list_for_each_entry(io, &req->all_io, next) {
> + if (insitu_comp_handle_write_modify(io, &meta_start, &meta_end,
> + &handle_req))
> + return;
> + }
> +
> + if (!handle_req)
> + goto update_meta;
> +
> + count = insitu_req_sectors(req);
> + io = insitu_comp_create_io_range(req, count << 9, count << 9);
> + if (!io) {
> + req->result = -EIO;
> + return;
> + }
> + insitu_comp_req_copy(req, 0, io->decomp_data, count << 9, true);
> +
> + /* compress data */
> + comp_len = io->comp_len;
> + ret = insitu_comp_io_range_comp(req->info, io->comp_data, &comp_len,
> + io->decomp_data, count << 9, true);
> + if (ret < 0) {
> + insitu_comp_free_io_range(io);
> + req->result = -EIO;
> + return;
> + }
> +
> + insitu_comp_get_req(req);
> + list_add_tail(&io->next, &req->all_io);
> + io->io_region.sector = insitu_req_start_sector(req) + req->info->data_start;
> + if (ret == 1)
> + io->io_req.mem.ptr.addr = io->decomp_data;
> + io->io_region.count = comp_len >> 9;
> + io->io_req.bi_rw = insitu_req_rw(req);
> + dm_io(&io->io_req, 1, &io->io_region, NULL);
> + insitu_comp_set_extent(req,
> + insitu_req_start_sector(req) >> INSITU_COMP_BLOCK_SECTOR_SHIFT,
> + count >> INSITU_COMP_BLOCK_SECTOR_SHIFT, comp_len >> 9);
> +
> + page_index = insitu_comp_block_meta_page_index(
> + insitu_req_start_sector(req) >> INSITU_COMP_BLOCK_SECTOR_SHIFT, false);
> + if (meta_start > page_index)
> + meta_start = page_index;
> + page_index = insitu_comp_block_meta_page_index(
> + (insitu_req_start_sector(req) + count) >> INSITU_COMP_BLOCK_SECTOR_SHIFT,
> + true);
> + if (meta_end < page_index)
> + meta_end = page_index;
> +update_meta:
> + if (req->info->write_mode == INSITU_COMP_WRITE_THROUGH ||
> + (insitu_req_rw(req) & REQ_FUA)) {
> + insitu_comp_get_req(req);
> + insitu_comp_write_meta(req->info, meta_start, meta_end + 1, req,
> + insitu_comp_write_meta_done, insitu_req_rw(req));
> + }
> +}
> +
> +/* request might cover some extents partially, read them first */
> +static void insitu_comp_handle_write_read_existing(struct insitu_comp_req *req)
> +{
> + u64 block_index, first_block_index;
> + u16 logical_sectors, data_sectors;
> +
> + req->stage = STAGE_READ_EXISTING;
> +
> + block_index = insitu_comp_sector_to_block(insitu_req_start_sector(req));
> + insitu_comp_get_extent(req->info, block_index, &first_block_index,
> + &logical_sectors, &data_sectors);
> + if (data_sectors > 0 && (first_block_index < block_index ||
> + first_block_index + insitu_comp_sector_to_block(logical_sectors) >
> + insitu_comp_sector_to_block(insitu_req_end_sector(req))))
> + insitu_comp_read_one_extent(req, first_block_index,
> + logical_sectors, data_sectors);
> +
> + if (req->result)
> + return;
> +
> + if (first_block_index + insitu_comp_sector_to_block(logical_sectors) >=
> + insitu_comp_sector_to_block(insitu_req_end_sector(req)))
> + goto out;
> +
> + block_index = insitu_comp_sector_to_block(insitu_req_end_sector(req)) - 1;
> + insitu_comp_get_extent(req->info, block_index, &first_block_index,
> + &logical_sectors, &data_sectors);
> + if (data_sectors > 0 &&
> + first_block_index + insitu_comp_sector_to_block(logical_sectors) >
> + block_index + 1)
> + insitu_comp_read_one_extent(req, first_block_index,
> + logical_sectors, data_sectors);
> +
> + if (req->result)
> + return;
> +out:
> + if (list_empty(&req->all_io))
> + insitu_comp_handle_write_comp(req);
> +}
> +
> +static void insitu_comp_handle_write_request(struct insitu_comp_req *req)
> +{
> + insitu_comp_get_req(req);
> +
> + if (req->stage == STAGE_INIT) {
> + if (!insitu_comp_lock_req_range(req)) {
> + insitu_comp_put_req(req);
> + return;
> + }
> +
> + insitu_comp_handle_write_read_existing(req);
> + } else if (req->stage == STAGE_READ_EXISTING)
> + insitu_comp_handle_write_comp(req);
> +
> + insitu_comp_put_req(req);
> +}
> +
> +/* For writeback mode */
> +static void insitu_comp_handle_flush_request(struct insitu_comp_req *req)
> +{
> + struct writeback_flush_data wb;
> +
> + atomic_set(&wb.cnt, 1);
> + init_completion(&wb.complete);
> +
> + insitu_comp_flush_dirty_meta(req->info, &wb);
> +
> + writeback_flush_io_done(&wb, 0);
> + wait_for_completion(&wb.complete);
> +
> + insitu_req_endio(req, 0);
> +}
> +
> +static void insitu_comp_handle_request(struct insitu_comp_req *req)
> +{
> + if (insitu_req_rw(req) & REQ_FLUSH)
> + insitu_comp_handle_flush_request(req);
> + else if (insitu_req_rw(req) & REQ_WRITE)
> + insitu_comp_handle_write_request(req);
> + else
> + insitu_comp_handle_read_request(req);
> +}
> +
> +static void insitu_comp_do_request_work(struct work_struct *work)
> +{
> + struct insitu_comp_io_worker *worker = container_of(work,
> + struct insitu_comp_io_worker, work);
> + LIST_HEAD(list);
> + struct insitu_comp_req *req;
> + struct blk_plug plug;
> + bool repeat;
> +
> + blk_start_plug(&plug);
> +again:
> + spin_lock_irq(&worker->lock);
> + list_splice_init(&worker->pending, &list);
> + spin_unlock_irq(&worker->lock);
> +
> + repeat = !list_empty(&list);
> + while (!list_empty(&list)) {
> + req = list_first_entry(&list, struct insitu_comp_req, sibling);
> + list_del(&req->sibling);
> +
> + insitu_comp_handle_request(req);
> + }
> + if (repeat)
> + goto again;
> + blk_finish_plug(&plug);
> +}
> +
> +static int insitu_comp_map(struct dm_target *ti, struct bio *bio)
> +{
> + struct insitu_comp_info *info = ti->private;
> + struct insitu_comp_req *req;
> +
> + req = dm_per_bio_data(bio, sizeof(struct insitu_comp_req));
> +
> + if ((bio->bi_rw & REQ_FLUSH) &&
> + info->write_mode == INSITU_COMP_WRITE_THROUGH) {
> + bio->bi_bdev = info->dev->bdev;
> + return DM_MAPIO_REMAPPED;
> + }
> +
> + req->bio = bio;
> + req->info = info;
> + atomic_set(&req->io_pending, 0);
> + INIT_LIST_HEAD(&req->all_io);
> + req->result = 0;
> + req->stage = STAGE_INIT;
> +
> + req->cpu = raw_smp_processor_id();
> + insitu_comp_queue_req(info, req);
> +
> + return DM_MAPIO_SUBMITTED;
> +}
> +
> +/*
> + * INFO: uncompressed_data_size compressed_data_size metadata_size
> + * TABLE: writethrough/writeback commit_delay
> + */
> +static void insitu_comp_status(struct dm_target *ti, status_type_t type,
> + unsigned status_flags, char *result, unsigned maxlen)
> +{
> + struct insitu_comp_info *info = ti->private;
> + unsigned int sz = 0;
> +
> + switch (type) {
> + case STATUSTYPE_INFO:
> + DMEMIT("%lu %lu %lu",
> + atomic64_read(&info->uncompressed_write_size),
> + atomic64_read(&info->compressed_write_size),
> + atomic64_read(&info->meta_write_size));
> + break;
> + case STATUSTYPE_TABLE:
> + if (info->write_mode == INSITU_COMP_WRITE_BACK)
> + DMEMIT("%s %s %d", info->dev->name, "writeback",
> + info->writeback_delay);
> + else
> + DMEMIT("%s %s", info->dev->name, "writethrough");
> + break;
> + }
> +}
> +
> +static int insitu_comp_iterate_devices(struct dm_target *ti,
> + iterate_devices_callout_fn fn, void *data)
> +{
> + struct insitu_comp_info *info = ti->private;
> +
> + return fn(ti, info->dev, info->data_start,
> + info->data_blocks << INSITU_COMP_BLOCK_SECTOR_SHIFT, data);
> +}
> +
> +static void insitu_comp_io_hints(struct dm_target *ti,
> + struct queue_limits *limits)
> +{
> + /* No blk_limits_logical_block_size */
> + limits->logical_block_size = limits->physical_block_size =
> + limits->io_min = INSITU_COMP_BLOCK_SIZE;
> + blk_limits_max_hw_sectors(limits, INSITU_COMP_MAX_SIZE >> 9);
> +}
> +
> +static int insitu_comp_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
> + struct bio_vec *biovec, int max_size)
> +{
> + /* Guarantee request can only cover one aligned 128k range */
> + return min_t(int, max_size, INSITU_COMP_MAX_SIZE - bvm->bi_size -
> + ((bvm->bi_sector << 9) % INSITU_COMP_MAX_SIZE));
> +}
> +
> +static struct target_type insitu_comp_target = {
> + .name = "insitu_comp",
> + .version = {1, 0, 0},
> + .module = THIS_MODULE,
> + .ctr = insitu_comp_ctr,
> + .dtr = insitu_comp_dtr,
> + .map = insitu_comp_map,
> + .status = insitu_comp_status,
> + .iterate_devices = insitu_comp_iterate_devices,
> + .io_hints = insitu_comp_io_hints,
> + .merge = insitu_comp_merge,
> +};
> +
> +static int __init insitu_comp_init(void)
> +{
> + int r;
> +
> + for (r = 0; r < ARRAY_SIZE(compressors); r++)
> + if (crypto_has_comp(compressors[r].name, 0, 0))
> + break;
> + if (r >= ARRAY_SIZE(compressors)) {
> + DMWARN("No crypto compressors are supported");
> + return -EINVAL;
> + }
> +
> + default_compressor = r;
> +
> + r = -ENOMEM;
> + insitu_comp_io_range_cachep = kmem_cache_create("insitu_comp_io_range",
> + sizeof(struct insitu_comp_io_range), 0, 0, NULL);
> + if (!insitu_comp_io_range_cachep) {
> + DMWARN("Can't create io_range cache");
> + goto err;
> + }
> +
> + insitu_comp_meta_io_cachep = kmem_cache_create("insitu_comp_meta_io",
> + sizeof(struct insitu_comp_meta_io), 0, 0, NULL);
> + if (!insitu_comp_meta_io_cachep) {
> + DMWARN("Can't create meta_io cache");
> + goto err;
> + }
> +
> + insitu_comp_wq = alloc_workqueue("insitu_comp_io",
> + WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
> + if (!insitu_comp_wq) {
> + DMWARN("Can't create io workqueue");
> + goto err;
> + }
> +
> + r = dm_register_target(&insitu_comp_target);
> + if (r < 0) {
> + DMWARN("target registration failed");
> + goto err;
> + }
> +
> + for_each_possible_cpu(r) {
> + INIT_LIST_HEAD(&insitu_comp_io_workers[r].pending);
> + spin_lock_init(&insitu_comp_io_workers[r].lock);
> + INIT_WORK(&insitu_comp_io_workers[r].work,
> + insitu_comp_do_request_work);
> + }
> + return 0;
> +err:
> + if (insitu_comp_io_range_cachep)
> + kmem_cache_destroy(insitu_comp_io_range_cachep);
> + if (insitu_comp_meta_io_cachep)
> + kmem_cache_destroy(insitu_comp_meta_io_cachep);
> + if (insitu_comp_wq)
> + destroy_workqueue(insitu_comp_wq);
> +
> + return r;
> +}
> +
> +static void __exit insitu_comp_exit(void)
> +{
> + dm_unregister_target(&insitu_comp_target);
> + kmem_cache_destroy(insitu_comp_io_range_cachep);
> + kmem_cache_destroy(insitu_comp_meta_io_cachep);
> + destroy_workqueue(insitu_comp_wq);
> +}
> +
> +module_init(insitu_comp_init);
> +module_exit(insitu_comp_exit);
> +
> +MODULE_AUTHOR("Shaohua Li <shli@...nel.org>");
> +MODULE_DESCRIPTION(DM_NAME " target with insitu data compression for SSD");
> +MODULE_LICENSE("GPL");
> Index: linux/drivers/md/dm-insitu-comp.h
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux/drivers/md/dm-insitu-comp.h 2014-02-17 18:37:07.108425465 +0800
> @@ -0,0 +1,158 @@
> +#ifndef __DM_INSITU_COMPRESSION_H__
> +#define __DM_INSITU_COMPRESSION_H__
> +#include <linux/types.h>
> +
> +struct insitu_comp_super_block {
> + __le64 magic;
> + __le64 version;
> + __le64 meta_blocks;
> + __le64 data_blocks;
> + u8 comp_alg;
> +} __attribute__((packed));
> +
> +#define INSITU_COMP_SUPER_MAGIC 0x106526c206506c09
> +#define INSITU_COMP_VERSION 1
> +#define INSITU_COMP_ALG_LZO 0
> +#define INSITU_COMP_ALG_ZLIB 1
> +
> +#ifdef __KERNEL__
> +struct insitu_comp_compressor_data {
> + char *name;
> + int (*comp_len)(int comp_len);
> +};
> +
> +static inline int lzo_comp_len(int comp_len)
> +{
> + return lzo1x_worst_compress(comp_len);
> +}
> +
> +/*
> + * Minium logical sector size of this target is 4096 byte, which is a block.
> + * Data of a block is compressed. Compressed data is round up to 512B, which is
> + * the payload. For each block, we have 5 bits meta data. bit 0 - 3 stands
> + * payload length. 0 - 8 sectors. If compressed payload length is 8 sectors, we
> + * just store uncompressed data. Actual compressed data length is stored at the
> + * last 32 bits of payload if data is compressed. In disk, payload is stored at
> + * the begining of logical sector of the block. If IO size is bigger than one
> + * block, we store the whole data as an extent. Bit 4 stands tail for an
> + * extent. Max allowed extent size is 128k.
> + */
> +#define INSITU_COMP_BLOCK_SIZE 4096
> +#define INSITU_COMP_BLOCK_SHIFT 12
> +#define INSITU_COMP_BLOCK_SECTOR_SHIFT (INSITU_COMP_BLOCK_SHIFT - 9)
> +
> +#define INSITU_COMP_MIN_SIZE 4096
> +/* Change this should change HASH_LOCK_SHIFT too */
> +#define INSITU_COMP_MAX_SIZE (128 * 1024)
> +
> +#define INSITU_COMP_LENGTH_MASK ((1 << 4) - 1)
> +#define INSITU_COMP_TAIL_MASK (1 << 4)
> +#define INSITU_COMP_META_BITS 5
> +
> +#define INSITU_COMP_META_START_SECTOR (INSITU_COMP_BLOCK_SIZE >> 9)
> +
> +enum INSITU_COMP_WRITE_MODE {
> + INSITU_COMP_WRITE_BACK,
> + INSITU_COMP_WRITE_THROUGH,
> +};
> +
> +/*
> + * request can cover one aligned 128k (4k * (1 << 5)) range. Since maxium
> + * request size is 128k, we only need take one lock for each request
> + */
> +#define HASH_LOCK_SHIFT 5
> +
> +#define BITMAP_HASH_SHIFT 9
> +#define BITMAP_HASH_MASK ((1 << BITMAP_HASH_SHIFT) - 1)
> +#define BITMAP_HASH_LEN (1 << BITMAP_HASH_SHIFT)
> +
> +struct insitu_comp_hash_lock {
> + int io_running;
> + spinlock_t wait_lock;
> + struct list_head wait_list;
> +};
> +
> +struct insitu_comp_info {
> + struct dm_target *ti;
> + struct dm_dev *dev;
> +
> + int comp_alg;
> + struct crypto_comp *tfm[NR_CPUS];
> +
> + sector_t data_start;
> + u64 data_blocks;
> +
> + char *meta_bitmap;
> + u64 meta_bitmap_bits;
> + u64 meta_bitmap_pages;
> + struct insitu_comp_hash_lock bitmap_locks[BITMAP_HASH_LEN];
> +
> + enum INSITU_COMP_WRITE_MODE write_mode;
> + unsigned int writeback_delay; /* second unit */
> + struct task_struct *writeback_tsk;
> + struct dm_io_client *io_client;
> +
> + atomic64_t compressed_write_size;
> + atomic64_t uncompressed_write_size;
> + atomic64_t meta_write_size;
> +};
> +
> +struct insitu_comp_meta_io {
> + struct dm_io_request io_req;
> + struct dm_io_region io_region;
> + void *data;
> + void (*fn)(void *data, unsigned long error);
> +};
> +
> +struct insitu_comp_io_range {
> + struct dm_io_request io_req;
> + struct dm_io_region io_region;
> + void *decomp_data;
> + unsigned int decomp_len;
> + void *comp_data;
> + unsigned int comp_len; /* For write, this is estimated */
> + struct list_head next;
> + struct insitu_comp_req *req;
> +};
> +
> +enum INSITU_COMP_REQ_STAGE {
> + STAGE_INIT,
> + STAGE_READ_EXISTING,
> + STAGE_READ_DECOMP,
> + STAGE_WRITE_COMP,
> + STAGE_DONE,
> +};
> +
> +struct insitu_comp_req {
> + struct bio *bio;
> + struct insitu_comp_info *info;
> + struct list_head sibling;
> +
> + struct list_head all_io;
> + atomic_t io_pending;
> + enum INSITU_COMP_REQ_STAGE stage;
> +
> + struct insitu_comp_hash_lock *lock;
> + int result;
> +
> + int cpu;
> +};
> +
> +#define insitu_req_start_sector(req) (req->bio->bi_iter.bi_sector)
> +#define insitu_req_end_sector(req) (bio_end_sector(req->bio))
> +#define insitu_req_rw(req) (req->bio->bi_rw)
> +#define insitu_req_sectors(req) (bio_sectors(req->bio))
> +
> +static inline void insitu_req_endio(struct insitu_comp_req *req, int error)
> +{
> + bio_endio(req->bio, error);
> +}
> +
> +struct insitu_comp_io_worker {
> + struct list_head pending;
> + spinlock_t lock;
> + struct work_struct work;
> +};
> +#endif
> +
> +#endif
> Index: linux/Documentation/device-mapper/insitu-comp.txt
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux/Documentation/device-mapper/insitu-comp.txt 2014-02-17 17:34:45.427464765 +0800
> @@ -0,0 +1,50 @@
> +This is a simple DM target supporting compression for SSD only. Under layer SSD
> +must support 512B sector size, the target only supports 4k sector size.
> +
> +Disk layout:
> +|super|...meta...|..data...|
> +
> +Store unit is 4k (a block). Super is 1 block, which stores meta and data size
> +and compression algorithm. Meta is a bitmap. For each data block, there are 5
> +bits meta.
> +
> +Data:
> +Data of a block is compressed. Compressed data is round up to 512B, which is
> +the payload. In disk, payload is stored at the begining of logical sector of
> +the block. Let's look at an example. Say we store data to block A, which is in
> +sector B(A*8), its orginal size is 4k, compressed size is 1500. Compressed data
> +(CD) will use 3 sectors (512B). The 3 sectors are the payload. Payload will be
> +stored at sector B.
> +
> +---------------------------------------------------
> +... | CD1 | CD2 | CD3 | | | | | | ...
> +---------------------------------------------------
> + ^B ^B+1 ^B+2 ^B+7 ^B+8
> +
> +For this block, we will not use sector B+3 to B+7 (a hole). We use 4 meta bits
> +to present payload size. The compressed size (1500) isn't stored in meta
> +directly. Instead, we store it at the last 32bits of payload. In this example,
> +we store it at the end of sector B+2. If compressed size + sizeof(32bits)
> +crosses a sector, payload size will increase one sector. If payload uses 8
> +sectors, we store uncompressed data directly.
> +
> +If IO size is bigger than one block, we can store the data as an extent. Data
> +of the whole extent will compressed and stored in the similar way like above.
> +The first block of the extent is the head, all others are the tail. If extent
> +is 1 block, the block is head. We have 1 bit of meta to present if a block is
> +head or tail. If 4 meta bits of head block can't store extent payload size, we
> +will borrow tail block meta bits to store payload size. Max allowd extent size
> +is 128k, so we don't compress/decompress too big size data.
> +
> +Meta:
> +Modifying data will modify meta too. Meta will be written(flush) to disk
> +depending on meta write policy. We support writeback and writethrough mode. In
> +writeback mode, meta will be written to disk in an interval or a FLUSH request.
> +In writethrough mode, data and meta data will be written to disk together.
> +
> +=========================
> +Parameters: <dev> [<writethrough>|<writeback> <meta_commit_delay>]
> + <dev>: underlying device
> + <writethrough>: metadata flush to disk with writetrough mode
> + <writeback>: metadata flush to disk with writeback mode
> + <meta_commit_delay>: metadata flush to disk interval in writeback mode
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists