linux-kernel - [PATCH 2/3] zram: support page-based parallel write

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1474526565-6676-2-git-send-email-minchan@kernel.org>
Date:   Thu, 22 Sep 2016 15:42:44 +0900
From:   Minchan Kim <minchan@...nel.org>
To:     Andrew Morton <akpm@...ux-foundation.org>
Cc:     Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
        linux-kernel@...r.kernel.org, Minchan Kim <minchan@...nel.org>
Subject: [PATCH 2/3] zram: support page-based parallel write

zram supports stream-based parallel compression. IOW, it can support
parallel compression on SMP system only if each cpus has streams.
For example, assuming 4 CPU system, there are 4 sources for compressing
in system and each source must be located in each CPUs for full
parallel compression.

So, if there is *one* stream in the system, it cannot be compressed
in parallel although the system supports multiple CPUs. This patch
aims to overcome such weakness.

The idea is to use multiple background threads to compress pages
in idle CPU and foreground just queues BIOs without interrupting
while other CPUs consumes pages in BIO to compress.
It means zram begins to support asynchronous writeback to increase
write bandwidth.

1) test cp A to B as an example of single stream compression and
enhanced 36%.

x86_64, 4 CPU
Copy kernel source to zram
old: 3.4s, new: 2.2s

2) test per-process reclaim to swap: 524M
x86_64, 4 CPU:
old: 1.2s new: 0.3s

3) FIO benchamrk
random read was worse so it supports only write at the moment.
Later, We might revisit asynchronous read.

x86_64, 4 CPU:

FIO 1 job, /dev/zram0

      seq-write       5538      10116     182.67%
     rand-write      59616      98195     164.71%
       seq-read       7369       7257      98.48%
      rand-read      82373      82925     100.67%
   mixed-seq(R)       2815       4059     144.19%
   mixed-seq(W)       2805       4045     144.21%
  mixed-rand(R)      32943      42378     128.64%
  mixed-rand(W)      32869      42282     128.64%

FIO 4 job, /dev/zram0

      seq-write      16786      31770     189.26%
     rand-write     215260     408069     189.57%
       seq-read      57146      56888      99.55%
      rand-read     664328     671475     101.08%
   mixed-seq(R)      11238      20624     183.52%
   mixed-seq(W)      11199      20551     183.51%
  mixed-rand(R)     104841     188426     179.73%
  mixed-rand(W)     104605     188001     179.72%

FIO 1 job, ext4

      seq-write      16305      27601     169.28%
     rand-write     177941     275477     154.81%
       seq-read      38066      37855      99.45%
      rand-read     206088     200845      97.46%
   mixed-seq(R)      11983      17357     144.85%
   mixed-seq(W)      11941      17296     144.85%
  mixed-rand(R)      87949     107100     121.78%
  mixed-rand(W)      87750     106859     121.78%

FIO 4 job, ext4

      seq-write      55879      53056      94.95%
     rand-write     536301     519199      96.81%
       seq-read     127204     126713      99.61%
      rand-read     629397     636117     101.07%
   mixed-seq(R)      40089      38045      94.90%
   mixed-seq(W)      39949      37911      94.90%
  mixed-rand(R)     276194     263282      95.33%
  mixed-rand(W)     275571     262688      95.32%

For single stream, it is much better or tie for many cases but
4-stream cases, but seem to be not true. It's expected because
asynchronous IO causes more allocations, instructions, context sw
and lock contention so it loses 4~5% in write benchmark. Therefore,
asynchronous model is not win for full-stream workload compared
to old. Also, it wouldn't be good for power-sensitive system so
I want to make it optinal feature at this moment.

Signed-off-by: Minchan Kim <minchan@...nel.org>
---
 Documentation/ABI/testing/sysfs-block-zram |   8 +
 Documentation/blockdev/zram.txt            |   1 +
 drivers/block/zram/Kconfig                 |  15 +
 drivers/block/zram/zram_drv.c              | 450 ++++++++++++++++++++++++++++-
 drivers/block/zram/zram_drv.h              |   3 +
 5 files changed, 474 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram
index 4518d30b8c2e..5b1cd62aec66 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -175,3 +175,11 @@ Contact:	Sergey Senozhatsky <sergey.senozhatsky@...il.com>
 		device's debugging info useful for kernel developers. Its
 		format is not documented intentionally and may change
 		anytime without any notice.
+
+What:		/sys/block/zram<id>/use_aio
+Date:		July 2017
+Contact:	Minchan Kim <minchan@...nel.org>
+Description:
+		The use_aio file is read/write and to enable asynchronous
+		write support using multiple CPUs.
+		It increases write throughput in multiple CPU system.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 0535ae1f73e5..3692be05d1c7 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -183,6 +183,7 @@ pages_compacted   RO    the number of pages freed during compaction
                         (available only via zram<id>/mm_stat node)
 compact           WO    trigger memory compaction
 debug_stat        RO    this file is used for zram debugging purposes
+use_aio		  RW	enable asynchronous compression using multiple CPUs
 
 WARNING
 =======
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index b8ecba6dcd3b..20ca28563c05 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -13,3 +13,18 @@ config ZRAM
 	  disks and maybe many more.
 
 	  See zram.txt for more information.
+
+config ZRAM_ASYNC_IO
+	bool "support asynchronous IO"
+	depends on ZRAM && SMP
+	default n
+	help
+	  By default zram uses stream-based compression so it can use
+	  every CPU only if every CPU has streams. However, it gives
+	  poor performance on single-stream workload.
+
+	  This implementation uses a background thread per core and
+	  compress a stream by multiple core.
+
+	  You can disable the feature "echo 0 > /sys/block/zramx/use_aio"
+	  before setting disksize.
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 6fa3ee6fabb7..abab76e2cf28 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -27,6 +27,7 @@
 #include <linux/slab.h>
 #include <linux/string.h>
 #include <linux/vmalloc.h>
+#include <linux/kthread.h>
 #include <linux/err.h>
 #include <linux/idr.h>
 #include <linux/sysfs.h>
@@ -366,6 +367,48 @@ static ssize_t comp_algorithm_store(struct device *dev,
 	return len;
 }
 
+#ifdef CONFIG_ZRAM_ASYNC_IO
+static ssize_t use_aio_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	bool val;
+	struct zram *zram = dev_to_zram(dev);
+
+	down_read(&zram->init_lock);
+	val = zram->use_aio;
+	up_read(&zram->init_lock);
+
+	return scnprintf(buf, PAGE_SIZE, "%d\n", val);
+}
+
+static ssize_t use_aio_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	int ret;
+	u16 do_async;
+	struct zram *zram  = dev_to_zram(dev);
+
+	ret = kstrtou16(buf, 10, &do_async);
+	if (ret)
+		return ret;
+
+	down_write(&zram->init_lock);
+	if (init_done(zram)) {
+		up_write(&zram->init_lock);
+		pr_info("Can't change for initialized device\n");
+		return -EBUSY;
+	}
+
+	if (do_async)
+		zram->use_aio = true;
+	else
+		zram->use_aio = false;
+	up_write(&zram->init_lock);
+
+	return len;
+}
+#endif
+
 static ssize_t compact_store(struct device *dev,
 		struct device_attribute *attr, const char *buf, size_t len)
 {
@@ -923,6 +966,389 @@ static void __zram_make_sync_request(struct zram *zram, struct bio *bio)
 	zram_meta_put(zram);
 }
 
+
+#ifdef CONFIG_ZRAM_ASYNC_IO
+/*
+ * A number of page a worker can isolate from request queue all at once.
+ * It's a magic number got from the experiment via testing x86 and ARM.
+ */
+const int NR_BATCH_PAGES = 32;
+
+struct zram_worker {
+	struct task_struct *task;
+	struct list_head list;
+};
+
+struct zram_workers {
+	struct list_head req_list;
+	unsigned int nr_req;
+
+	int nr_running;
+	struct list_head worker_list;
+	wait_queue_head_t req_wait;
+	spinlock_t req_lock;
+} workers;
+
+struct bio_request {
+	struct bio *bio;   /* a bio structure to free when IO is done */
+	atomic_t nr_pages; /* the number of pages a bio include */
+};
+
+struct page_request {
+	struct zram *zram;
+	struct bio_request *bio_req;
+	struct bio_vec bvec;
+	u32 index;
+	int offset;
+	bool write;
+	struct list_head list;
+};
+
+static void worker_wake_up(void)
+{
+	/*
+	 * Unless it's enough workes to handle accumulated requests,
+	 * wake up new workers.
+	 */
+	if (workers.nr_running * NR_BATCH_PAGES < workers.nr_req) {
+		int nr_wakeup = (workers.nr_req + NR_BATCH_PAGES) /
+				NR_BATCH_PAGES - workers.nr_running;
+
+		wake_up_nr(&workers.req_wait, nr_wakeup);
+	}
+}
+
+static void zram_unplug(struct blk_plug_cb *cb, bool from_schedule)
+{
+	spin_lock(&workers.req_lock);
+	if (workers.nr_req)
+		worker_wake_up();
+	spin_unlock(&workers.req_lock);
+	kfree(cb);
+}
+
+static int zram_check_plugged(void)
+{
+	return !!blk_check_plugged(zram_unplug, NULL,
+			sizeof(struct blk_plug_cb));
+}
+
+int queue_page_request(struct zram *zram, struct bio_vec *bvec, u32 index,
+			int offset, bool write)
+{
+	struct page_request *page_req = kmalloc(sizeof(*page_req), GFP_NOIO);
+
+	if (!page_req)
+		return -ENOMEM;
+
+	page_req->bio_req = NULL;
+	page_req->zram = zram;
+	page_req->bvec = *bvec;
+	page_req->index = index;
+	page_req->offset = offset;
+	page_req->write = write;
+
+	spin_lock(&workers.req_lock);
+	list_add(&page_req->list, &workers.req_list);
+	workers.nr_req++;
+	if (!zram_check_plugged())
+		worker_wake_up();
+	spin_unlock(&workers.req_lock);
+
+	return 0;
+}
+
+int queue_page_request_list(struct zram *zram, struct bio_request *bio_req,
+			struct bio_vec *bvec, u32 index, int offset,
+			bool write, struct list_head *page_list)
+{
+	struct page_request *page_req = kmalloc(sizeof(*page_req), GFP_NOIO);
+
+	if (!page_req) {
+		while (!list_empty(page_list)) {
+			page_req = list_first_entry(page_list,
+					struct page_request, list);
+			list_del(&page_req->list);
+			kfree(page_req);
+		}
+
+		return -ENOMEM;
+	}
+
+	page_req->bio_req = bio_req;
+	page_req->zram = zram;
+	page_req->bvec = *bvec;
+	page_req->index = index;
+	page_req->offset = offset;
+	page_req->write = write;
+	atomic_inc(&bio_req->nr_pages);
+
+	list_add_tail(&page_req->list, page_list);
+
+	return 0;
+}
+
+/*
+ * @page_list: pages isolated from request queue
+ */
+static void get_page_requests(struct list_head *page_list)
+{
+	struct page_request *page_req;
+	int nr_pages;
+
+	for (nr_pages = 0; nr_pages < NR_BATCH_PAGES; nr_pages++) {
+		if  (list_empty(&workers.req_list))
+			break;
+
+		page_req = list_first_entry(&workers.req_list,
+					struct page_request, list);
+		list_move(&page_req->list, page_list);
+	}
+
+	workers.nr_req -= nr_pages;
+}
+
+/*
+ * move @page_list to request queue and wake up workers if it is necessary.
+ */
+static void run_worker(struct bio *bio, struct list_head *page_list,
+			unsigned int nr_pages)
+{
+	WARN_ON(list_empty(page_list));
+
+	spin_lock(&workers.req_lock);
+	list_splice_tail(page_list, &workers.req_list);
+	workers.nr_req += nr_pages;
+	if (bio->bi_opf & REQ_SYNC || !zram_check_plugged())
+		worker_wake_up();
+	spin_unlock(&workers.req_lock);
+}
+
+static bool __zram_make_async_request(struct zram *zram, struct bio *bio)
+{
+	int offset;
+	u32 index;
+	struct bio_vec bvec;
+	struct bvec_iter iter;
+	LIST_HEAD(page_list);
+	struct bio_request *bio_req;
+	unsigned int nr_pages = 0;
+	bool write = op_is_write(bio_op(bio));
+
+	if (!zram->use_aio || !op_is_write(bio_op(bio)))
+		return false;
+
+	bio_req = kmalloc(sizeof(*bio_req), GFP_NOIO);
+	if (!bio_req)
+		return false;
+
+	index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
+	offset = (bio->bi_iter.bi_sector &
+		  (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
+
+	bio_req->bio = bio;
+	atomic_set(&bio_req->nr_pages, 0);
+
+	bio_for_each_segment(bvec, bio, iter) {
+		int max_transfer_size = PAGE_SIZE - offset;
+
+		if (bvec.bv_len > max_transfer_size) {
+			struct bio_vec bv;
+
+			bv.bv_page = bvec.bv_page;
+			bv.bv_len = max_transfer_size;
+			bv.bv_offset = bvec.bv_offset;
+
+			if (queue_page_request_list(zram, bio_req, &bv,
+					index, offset, write, &page_list))
+				goto fail;
+
+			bv.bv_len = bvec.bv_len - max_transfer_size;
+			bv.bv_offset += max_transfer_size;
+			if (queue_page_request_list(zram, bio_req, &bv,
+					index + 1, 0, write, &page_list))
+				goto fail;
+		} else {
+			if (queue_page_request_list(zram, bio_req, &bvec,
+					index, offset, write, &page_list))
+				goto fail;
+		}
+
+		nr_pages++;
+		update_position(&index, &offset, &bvec);
+	}
+
+	run_worker(bio, &page_list, nr_pages);
+
+	return true;
+fail:
+	kfree(bio_req);
+
+	WARN_ON(!list_empty(&page_list));
+
+	return false;
+}
+
+void page_requests_rw(struct list_head *page_list)
+{
+	struct page_request *page_req;
+	bool write;
+	struct page *page;
+	struct zram *zram;
+	int err;
+	bool free_bio;
+	struct bio_request *bio_req;
+
+	while (!list_empty(page_list)) {
+		free_bio = false;
+		page_req = list_last_entry(page_list, struct page_request,
+					list);
+		write = page_req->write;
+		page = page_req->bvec.bv_page;
+		zram = page_req->zram;
+		bio_req = page_req->bio_req;
+		if (bio_req && atomic_dec_and_test(&bio_req->nr_pages))
+			free_bio = true;
+		list_del(&page_req->list);
+
+		err = zram_bvec_rw(zram, &page_req->bvec, page_req->index,
+			page_req->offset, page_req->write);
+
+		kfree(page_req);
+
+		/* page-based request */
+		if (!bio_req) {
+			page_endio(page, write, err);
+			zram_meta_put(zram);
+		/* bio-based request */
+		} else if (free_bio) {
+			if (likely(!err))
+				bio_endio(bio_req->bio);
+			else
+				bio_io_error(bio_req->bio);
+			kfree(bio_req);
+			zram_meta_put(zram);
+		}
+	}
+}
+
+static int zram_thread(void *data)
+{
+	DEFINE_WAIT(wait);
+	LIST_HEAD(page_list);
+
+	spin_lock(&workers.req_lock);
+	workers.nr_running++;
+
+	while (!kthread_should_stop()) {
+		if (list_empty(&workers.req_list)) {
+			prepare_to_wait_exclusive(&workers.req_wait, &wait,
+					TASK_INTERRUPTIBLE);
+			workers.nr_running--;
+			spin_unlock(&workers.req_lock);
+			schedule();
+			spin_lock(&workers.req_lock);
+			finish_wait(&workers.req_wait, &wait);
+			workers.nr_running++;
+			continue;
+		}
+
+		get_page_requests(&page_list);
+		if (list_empty(&page_list))
+			continue;
+
+		spin_unlock(&workers.req_lock);
+		page_requests_rw(&page_list);
+		cond_resched();
+		spin_lock(&workers.req_lock);
+	}
+
+	workers.nr_running--;
+	spin_unlock(&workers.req_lock);
+
+	return 0;
+}
+
+static void destroy_workers(void)
+{
+	struct zram_worker *worker;
+
+	while (!list_empty(&workers.worker_list)) {
+		worker = list_first_entry(&workers.worker_list,
+				struct zram_worker,
+				list);
+		kthread_stop(worker->task);
+		list_del(&worker->list);
+		kfree(worker);
+	}
+
+	WARN_ON(workers.nr_running);
+}
+
+static int create_workers(void)
+{
+	int i;
+	int nr_cpu = num_online_cpus();
+	struct zram_worker *worker;
+
+	INIT_LIST_HEAD(&workers.worker_list);
+	INIT_LIST_HEAD(&workers.req_list);
+	spin_lock_init(&workers.req_lock);
+	init_waitqueue_head(&workers.req_wait);
+
+	for (i = 0; i < nr_cpu; i++) {
+		worker = kmalloc(sizeof(*worker), GFP_KERNEL);
+		if (!worker)
+			goto error;
+
+		worker->task = kthread_run(zram_thread, NULL, "zramd-%d", i);
+		if (IS_ERR(worker->task)) {
+			kfree(worker);
+			goto error;
+		}
+
+		list_add(&worker->list, &workers.worker_list);
+	}
+
+	return 0;
+
+error:
+	destroy_workers();
+	return 1;
+}
+
+static int zram_rw_async_page(struct zram *zram,
+			struct bio_vec *bv, u32 index, int offset,
+			bool is_write)
+{
+	if (!zram->use_aio || !is_write)
+		return 1;
+
+	return queue_page_request(zram, bv, index, offset, is_write);
+}
+#else
+static bool __zram_make_async_request(struct zram *zram, struct bio *bio)
+{
+	return false;
+}
+
+static int zram_rw_async_page(struct zram *zram,
+			struct bio_vec *bv, u32 index, int offset,
+			bool is_write)
+{
+	return 1;
+}
+
+static int create_workers(void)
+{
+	return 0;
+}
+
+static void destroy_workers(void)
+{
+}
+#endif
+
 /*
  * Handler function for all zram I/O requests.
  */
@@ -954,6 +1380,9 @@ static blk_qc_t zram_make_request(struct request_queue *queue, struct bio *bio)
 		goto out;
 	}
 
+	if (__zram_make_async_request(zram, bio))
+		goto out;
+
 	__zram_make_sync_request(zram, bio);
 out:
 	return BLK_QC_T_NONE;
@@ -1028,8 +1457,12 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector,
 	bv.bv_len = PAGE_SIZE;
 	bv.bv_offset = 0;
 
-	err = zram_rw_sync_page(bdev, zram, &bv, index, offset, is_write);
+	err = zram_rw_async_page(zram, &bv, index, offset, is_write);
+	if (!err)
+		goto out;
 
+	err = zram_rw_sync_page(bdev, zram, &bv, index, offset, is_write);
+out:
 	return err;
 }
 
@@ -1066,7 +1499,6 @@ static void zram_reset_device(struct zram *zram)
 	/* Reset stats */
 	memset(&zram->stats, 0, sizeof(zram->stats));
 	zram->disksize = 0;
-
 	set_capacity(zram->disk, 0);
 	part_stat_set_all(&zram->disk->part0, 0);
 
@@ -1211,6 +1643,9 @@ static DEVICE_ATTR_RW(mem_limit);
 static DEVICE_ATTR_RW(mem_used_max);
 static DEVICE_ATTR_RW(max_comp_streams);
 static DEVICE_ATTR_RW(comp_algorithm);
+#ifdef CONFIG_ZRAM_ASYNC_IO
+static DEVICE_ATTR_RW(use_aio);
+#endif
 
 static struct attribute *zram_disk_attrs[] = {
 	&dev_attr_disksize.attr,
@@ -1231,6 +1666,9 @@ static struct attribute *zram_disk_attrs[] = {
 	&dev_attr_mem_used_max.attr,
 	&dev_attr_max_comp_streams.attr,
 	&dev_attr_comp_algorithm.attr,
+#ifdef CONFIG_ZRAM_ASYNC_IO
+	&dev_attr_use_aio.attr,
+#endif
 	&dev_attr_io_stat.attr,
 	&dev_attr_mm_stat.attr,
 	&dev_attr_debug_stat.attr,
@@ -1261,7 +1699,9 @@ static int zram_add(void)
 	device_id = ret;
 
 	init_rwsem(&zram->init_lock);
-
+#ifdef CONFIG_ZRAM_ASYNC_IO
+	zram->use_aio = true;
+#endif
 	queue = blk_alloc_queue(GFP_KERNEL);
 	if (!queue) {
 		pr_err("Error allocating disk queue for device %d\n",
@@ -1485,6 +1925,9 @@ static int __init zram_init(void)
 		num_devices--;
 	}
 
+	if (create_workers())
+		goto out_error;
+
 	return 0;
 
 out_error:
@@ -1495,6 +1938,7 @@ static int __init zram_init(void)
 static void __exit zram_exit(void)
 {
 	destroy_devices();
+	destroy_workers();
 }
 
 module_init(zram_init);
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 74fcf10da374..3f62501f619f 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -119,5 +119,8 @@ struct zram {
 	 * zram is claimed so open request will be failed
 	 */
 	bool claim; /* Protected by bdev->bd_mutex */
+#ifdef CONFIG_ZRAM_ASYNC_IO
+	bool use_aio;
+#endif
 };
 #endif
-- 
2.7.4