linux-kernel - [PATCH 2/2] mm: zcache: core functions added

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1374331018-11045-3-git-send-email-bob.liu@oracle.com>
Date:	Sat, 20 Jul 2013 22:36:58 +0800
From:	Bob Liu <lliubbo@...il.com>
To:	unlisted-recipients:; (no To-header on input)
Cc:	linux-kernel@...r.kernel.org, sjenning@...ux.vnet.ibm.com,
	gregkh@...uxfoundation.org, ngupta@...are.org, minchan@...nel.org,
	konrad.wilk@...cle.com, rcj@...ux.vnet.ibm.com, mgorman@...e.de,
	riel@...hat.com, penberg@...nel.org, akpm@...ux-foundation.org,
	Bob Liu <bob.liu@...cle.com>
Subject: [PATCH 2/2] mm: zcache: core functions added

This patch adds the cleancache backend for clean file pages compression.

Nitin Gupta have already done many works on this topic in 2010.
You can see his work from below links:
http://lwn.net/Articles/396467/
http://thread.gmane.org/gmane.linux.kernel.mm/50523
But at that time neither the allocation layer nor cleancache have been merged
into upstream.

(Most of below comments are copyed from his patch.)
Frequently accessed filesystem data is stored in memory to reduce access to
(much) slower backing disks. Under memory pressure, these pages are freed and
when needed again, they have to be read from disks again. When combined
working set of all running application exceeds amount of physical RAM, we get
extereme slowdown as reading a page from disk can take time in order of
milliseconds.

Memory compression increases effective memory size and allows more pages to
stay in RAM. Since de/compressing memory pages is several orders of magnitude
faster than disk I/O, this can provide signifant performance gains for many
workloads. Also, with multi-cores becoming common, benefits of reduced disk
I/O should easily outweigh the problem of increased CPU usage.

It is implemented as a "backend" for cleancache which provides
callbacks for events such as when a page is to be removed from the page cache
and when it is required again. We use them to implement a 'second chance'
cache for these evicted page cache pages by compressing and storing them in
memory itself.

We use zbud memory allocator which is already merged and used by zswap for the
same purpose.

A separate "pool" is created for each mount instance for a cleancache-aware
filesystem. Each incoming page is identified with <pool_id, inode_no, index>
where inode_no identifies file within the filesystem corresponding to pool_id
and index is offset of the page within this inode. Within a pool, inodes are
maintained in an rb-tree and each of its nodes points to a separate radix-tree
which maintains list of pages within that inode.

Tests were done on physical machine (not in a VM)
Only played with iozone for some simple tests.
Kernel booted with command line "single mem=1G zcache.enabled=1"
Kernel 3.10.0+ (with Seth's zswap patches)
CPU: Core-i5 4-Core
RAM: 1G
Boot into single mode
Swapoff -a to avoid I/O interference
Command for testing:
iozone -a -M -B -s 1G -y 4k -+u -R -b iozone.xls

Results:
Without zcache
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read
         1048576       4   73070   36713    75187    74916    1292     674   26724    45514     8491
         1048576       8   68777   34407    71162    73407    2703    1233   27800    42417     5548
         1048576      16   70030   35156    71496    71186    4462    2181   31480    42040     8737
         1048576      32   68796   35442    72557    72734    6955    3615   40783    44627    10959
         1048576      64   76576   35247    68685    72279   11383    6615   49913    44620    14777
         1048576     128   77516   33878    72538    71231   17897   10896   64393    48613    35265
         1048576     256   73465   37450    22857    22854   22845   16655   61152    49170    40571
         1048576     512   68096   35303    22265    22398   36154   23952   58286    48594    39972
         1048576    1024   69442   36192    22206    22541   38927   29643   60695    50483    44120
         1048576    2048   69404   34891    22951    23286   40036   35984   62095    50704    43764
         1048576    4096   68871   35716    23147    23272   41587   37619   66809    47613    45229
         1048576    8192   66775   36409    22997    23070   39023   39674   69262    55548    49324
         1048576   16384   68728   34792    23829    24022   39922   42496   65714    44732    39104
Average:		   70734   35507    45529    45938   23760   19326   52700    47283    29682

With zcache
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read
         1048576       4   80330   35970   120942   115559    1491     686   29923    47864    11177
         1048576       8   69553   36658   134402   105351    2978    1332   33291    45885     6331
         1048576      16   79538   36719   149905   155649    5970    2305   38136    42365    14070
         1048576      32   68326   37064   150126   151728   10462    4128   44693    44975    15607
         1048576      64   74450   35379   147918   130771   19199    7468   54065    43630    22104
         1048576     128   71655   36125   137384   139080   27495   11507   65742    47471    49833
         1048576     256   78867   36030    51814    42656   41578   17829   77058    44247    54368
         1048576     512   68671   35378    52359    44516   52296   25944   80121    47270    57271
         1048576    1024   69495   37064    53579    56057   60121   31833   83420    46752    65267
         1048576    2048   69014   35589    52216    49919   69441   37169   92552    45746    68838
         1048576    4096   67338   36194    56068    56855   71569   39895   98462    49440    72496
         1048576    8192   65153   35345    54983    58602   78402   43841   98809    47334    74803
         1048576   16384   63579   34842    50888    54429   68070   42518   87534    45665    64967
Average:                   71228   36027    93345    89321   39159   20497   67985    46050    44395
Chnage :                      1%      1%     105%      94%     65%      6%     29%      -3%      50%

You can see that reading side performance get improved a lot after using zcache.
Because pages are cached by zcache, we only need to decompress them instead of
reading from disk.

Signed-off-by: Bob Liu <bob.liu@...cle.com>
---
 mm/Kconfig  |   18 ++
 mm/Makefile |    1 +
 mm/zcache.c |  840 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 859 insertions(+)
 create mode 100644 mm/zcache.c

diff --git a/mm/Kconfig b/mm/Kconfig
index eec97f2..2b68103 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -507,3 +507,21 @@ config ZSWAP
 	  interactions don't cause any known issues on simple memory setups,
 	  they have not be fully explored on the large set of potential
 	  configurations and workloads that exist.
+
+config ZCACHE
+	bool "Compressed cache for swap and clean file pages (EXPERIMENTAL)"
+	depends on FRONTSWAP && CRYPTO && CLEANCACHE
+	select CRYPTO_LZO
+	select ZBUD
+	default n
+	help
+	  A compressed cache for swap and clean file pages.
+
+	  It can takes pages that are in the process of being swapped out and
+	  attempts to compress them into a dynamically allocated RAM-based memory pool.
+	  This can result in a significant I/O reduction on swap device and,
+	  in the case where decompressing from RAM is faster that swap device
+	  reads, can also improve workload performance.
+
+	  Besides swap pages, zcache can also compress the clean file pages into a memory pool.
+	  This can reduce the refaults of reading those file pages back from disks.
diff --git a/mm/Makefile b/mm/Makefile
index f008033..a29232b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_BOUNCE)	+= bounce.o
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
 obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_ZSWAP)	+= zswap.o
+obj-$(CONFIG_ZCACHE)	+= zcache.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
diff --git a/mm/zcache.c b/mm/zcache.c
new file mode 100644
index 0000000..a2408e8
--- /dev/null
+++ b/mm/zcache.c
@@ -0,0 +1,840 @@
+/*
+ * zcache.c - zcache driver file
+ *
+ * The goal of zcache is implement a generic memory compression layer.
+ * It's a backend of both frontswap and cleancache.
+ *
+ * This file only implemented cleancache part currently.
+ * Concepts based on original zcache by Dan Magenheimer.
+ *
+ * Copyright (C) 2013  Bob Liu <bob.liu@...cle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+*/
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/cpu.h>
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/types.h>
+#include <linux/atomic.h>
+#include <linux/cleancache.h>
+#include <linux/rbtree.h>
+#include <linux/radix-tree.h>
+#include <linux/swap.h>
+#include <linux/crypto.h>
+#include <linux/mempool.h>
+#include <linux/zbud.h>
+
+#include <linux/mm_types.h>
+#include <linux/page-flags.h>
+#include <linux/swapops.h>
+#include <linux/writeback.h>
+#include <linux/pagemap.h>
+
+/* Enable/disable zcache (disabled by default) */
+static bool zcache_enabled __read_mostly;
+module_param_named(enabled, zcache_enabled, bool, 0);
+
+/* Enable/disable cleancache part of zcache */
+static bool zcache_nocleancache __read_mostly;
+module_param_named(nocleancache, zcache_nocleancache, bool, 0);
+
+/* Compressor to be used by zcache */
+#define ZCACHE_COMPRESSOR_DEFAULT "lzo"
+static char *zcache_compressor = ZCACHE_COMPRESSOR_DEFAULT;
+module_param_named(compressor, zcache_compressor, charp, 0);
+
+/* The maximum percentage of memory that the compressed pool can occupy */
+static unsigned int zcache_max_pool_percent = 10;
+module_param_named(max_pool_percent,
+			zcache_max_pool_percent, uint, 0644);
+
+/* zcache cleancache part statistics */
+static u64 zcache_cleancache_pool_pages;
+static u64 zcache_cleancache_pool_limit_hit;
+static u64 zcache_cleancache_written_back_pages;
+static u64 zcache_cleancache_dup_entry;
+static u64 zcache_cleancache_reclaim_fail;
+static u64 zcache_cleancache_zbud_alloc_fail;
+static atomic_t zcache_cleancache_stored_pages = ATOMIC_INIT(0);
+
+struct zcache_cleancache_meta {
+	int ra_index;
+	int length;	/* compressed page size */
+};
+
+#define MAX_ZCACHE_POOLS 32 /* arbitrary */
+
+/* Red-Black tree node. Maps inode to its page-tree */
+struct zcache_rb_entry {
+	int rb_index;
+	struct kref refcount;
+
+	struct radix_tree_root ra_root; /* maps inode index to page */
+	spinlock_t ra_lock;		/* protects radix tree */
+	struct rb_node rb_node;
+};
+
+/* One zcache pool per (cleancache aware) filesystem mount instance */
+struct zcache_pool {
+	struct rb_root rb_root;		/* maps inode number to page tree */
+	rwlock_t rb_lock;		/* protects inode_tree */
+	struct zbud_pool *pool;         /* zbud pool used */
+};
+
+/* Manage all zcache pools */
+struct _zcache {
+	struct zcache_pool *pools[MAX_ZCACHE_POOLS];
+	u32 num_pools;		/* current no. of zcache pools */
+	spinlock_t pool_lock;	/* protects pools[] and num_pools */
+};
+struct _zcache zcache;
+
+static struct kmem_cache *zcache_cleancache_entry_cache;
+
+/*********************************
+* compression functions
+**********************************/
+/* per-cpu compression transforms */
+static struct crypto_comp * __percpu *zcache_comp_pcpu_tfms;
+
+enum comp_op {
+	ZCACHE_COMPOP_COMPRESS,
+	ZCACHE_COMPOP_DECOMPRESS
+};
+
+static int zcache_comp_op(enum comp_op op, const u8 *src, unsigned int slen,
+				u8 *dst, unsigned int *dlen)
+{
+	struct crypto_comp *tfm;
+	int ret;
+
+	tfm = *per_cpu_ptr(zcache_comp_pcpu_tfms, get_cpu());
+	switch (op) {
+	case ZCACHE_COMPOP_COMPRESS:
+		ret = crypto_comp_compress(tfm, src, slen, dst, dlen);
+		break;
+	case ZCACHE_COMPOP_DECOMPRESS:
+		ret = crypto_comp_decompress(tfm, src, slen, dst, dlen);
+		break;
+	default:
+		ret = -EINVAL;
+	}
+
+	put_cpu();
+	return ret;
+}
+
+static int __init zcache_comp_init(void)
+{
+	if (!crypto_has_comp(zcache_compressor, 0, 0)) {
+		pr_info("%s compressor not available\n", zcache_compressor);
+		/* fall back to default compressor */
+		zcache_compressor = ZCACHE_COMPRESSOR_DEFAULT;
+		if (!crypto_has_comp(zcache_compressor, 0, 0))
+			/* can't even load the default compressor */
+			return -ENODEV;
+	}
+	pr_info("using %s compressor\n", zcache_compressor);
+
+	/* alloc percpu transforms */
+	zcache_comp_pcpu_tfms = alloc_percpu(struct crypto_comp *);
+	if (!zcache_comp_pcpu_tfms)
+		return -ENOMEM;
+	return 0;
+}
+
+static void zcache_comp_exit(void)
+{
+	/* free percpu transforms */
+	if (zcache_comp_pcpu_tfms)
+		free_percpu(zcache_comp_pcpu_tfms);
+}
+
+/*********************************
+* per-cpu code
+**********************************/
+static DEFINE_PER_CPU(u8 *, zcache_dstmem);
+
+static int __zcache_cpu_notifier(unsigned long action, unsigned long cpu)
+{
+	struct crypto_comp *tfm;
+	u8 *dst;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+		tfm = crypto_alloc_comp(zcache_compressor, 0, 0);
+		if (IS_ERR(tfm)) {
+			pr_err("can't allocate compressor transform\n");
+			return NOTIFY_BAD;
+		}
+		*per_cpu_ptr(zcache_comp_pcpu_tfms, cpu) = tfm;
+		dst = kmalloc(PAGE_SIZE * 2, GFP_KERNEL);
+		if (!dst) {
+			pr_err("can't allocate compressor buffer\n");
+			crypto_free_comp(tfm);
+			*per_cpu_ptr(zcache_comp_pcpu_tfms, cpu) = NULL;
+			return NOTIFY_BAD;
+		}
+		per_cpu(zcache_dstmem, cpu) = dst;
+		break;
+	case CPU_DEAD:
+	case CPU_UP_CANCELED:
+		tfm = *per_cpu_ptr(zcache_comp_pcpu_tfms, cpu);
+		if (tfm) {
+			crypto_free_comp(tfm);
+			*per_cpu_ptr(zcache_comp_pcpu_tfms, cpu) = NULL;
+		}
+		dst = per_cpu(zcache_dstmem, cpu);
+		kfree(dst);
+		per_cpu(zcache_dstmem, cpu) = NULL;
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static int zcache_cpu_notifier(struct notifier_block *nb,
+				unsigned long action, void *pcpu)
+{
+	unsigned long cpu = (unsigned long)pcpu;
+	return __zcache_cpu_notifier(action, cpu);
+}
+
+static struct notifier_block zcache_cpu_notifier_block = {
+	.notifier_call = zcache_cpu_notifier
+};
+
+static int zcache_cpu_init(void)
+{
+	unsigned long cpu;
+
+	get_online_cpus();
+	for_each_online_cpu(cpu)
+		if (__zcache_cpu_notifier(CPU_UP_PREPARE, cpu) != NOTIFY_OK)
+			goto cleanup;
+	register_cpu_notifier(&zcache_cpu_notifier_block);
+	put_online_cpus();
+	return 0;
+
+cleanup:
+	for_each_online_cpu(cpu)
+		__zcache_cpu_notifier(CPU_UP_CANCELED, cpu);
+	put_online_cpus();
+	return -ENOMEM;
+}
+
+/*********************************
+* helpers
+**********************************/
+static bool zcache_is_full(void)
+{
+	return (totalram_pages * zcache_max_pool_percent / 100 <
+		zcache_cleancache_pool_pages);
+}
+
+static int zcache_cleancache_entry_cache_create(void)
+{
+	zcache_cleancache_entry_cache = KMEM_CACHE(zcache_rb_entry, 0);
+	return (zcache_cleancache_entry_cache == NULL);
+}
+static void zcache_cleancache_entry_cache_destory(void)
+{
+	kmem_cache_destroy(zcache_cleancache_entry_cache);
+}
+
+static struct zcache_rb_entry *zcache_find_rb_entry(struct rb_root *root,
+		int index, struct rb_node **rb_parent, struct rb_node ***rb_link)
+{
+	struct zcache_rb_entry *entry;
+	struct rb_node **__rb_link, *__rb_parent, *rb_prev;
+
+	__rb_link = &root->rb_node;
+	rb_prev = __rb_parent = NULL;
+
+	while (*__rb_link) {
+		__rb_parent = *__rb_link;
+		entry = rb_entry(__rb_parent, struct zcache_rb_entry, rb_node);
+		if (entry->rb_index > index)
+			__rb_link = &__rb_parent->rb_left;
+		else if (entry->rb_index < index) {
+			rb_prev = __rb_parent;
+			__rb_link = &__rb_parent->rb_right;
+		} else
+			return entry;
+	}
+
+	if (rb_parent)
+		*rb_parent = __rb_parent;
+	if (rb_link)
+		*rb_link = __rb_link;
+	return NULL;
+}
+
+static struct zcache_rb_entry *zcache_find_get_rb_entry(struct zcache_pool *zpool,
+					int rb_index)
+{
+	unsigned long flags;
+	struct zcache_rb_entry *rb_entry;
+
+	read_lock_irqsave(&zpool->rb_lock, flags);
+	rb_entry = zcache_find_rb_entry(&zpool->rb_root, rb_index, 0, 0);
+	if (rb_entry)
+		kref_get(&rb_entry->refcount);
+	read_unlock_irqrestore(&zpool->rb_lock, flags);
+	return rb_entry;
+}
+
+/*
+ * kref_put callback for zcache rb_entry.
+ *
+ * The entry must have been isolated from rbtree already.
+ */
+static void zcache_rb_entry_release(struct kref *kref)
+{
+	struct zcache_rb_entry *rb_entry;
+
+	rb_entry = container_of(kref, struct zcache_rb_entry, refcount);
+	BUG_ON(rb_entry->ra_root.rnode);
+	kmem_cache_free(zcache_cleancache_entry_cache, rb_entry);
+}
+
+/*
+ * Called under zcache_rb_entry->ra_lock
+ */
+static int zcache_rb_entry_is_empty(struct zcache_rb_entry *rb_entry)
+{
+	return rb_entry->ra_root.rnode == NULL;
+}
+
+/* Remove rb_entry from rbtree */
+static void zcache_rb_entry_isolate(struct zcache_pool *zpool,
+		struct zcache_rb_entry *rb_entry, bool hold_rblock)
+{
+	unsigned long flags;
+
+	if (!hold_rblock)
+		write_lock_irqsave(&zpool->rb_lock, flags);
+	/*
+	 * Someone can get reference on this node before we could
+	 * acquire write lock above. We want to remove it from its
+	 * inode_tree when only the caller and corresponding inode_tree
+	 * holds a reference to it. This ensures that a racing zcache
+	 * put will not end up adding a page to an isolated node and
+	 * thereby losing that memory.
+	 *
+	 */
+	if (atomic_read(&rb_entry->refcount.refcount) == 2) {
+		rb_erase(&rb_entry->rb_node, &zpool->rb_root);
+		RB_CLEAR_NODE(&rb_entry->rb_node);
+		kref_put(&rb_entry->refcount, zcache_rb_entry_release);
+	}
+	if (!hold_rblock)
+		write_unlock_irqrestore(&zpool->rb_lock, flags);
+}
+
+
+static int zcache_store_handle(struct zcache_pool *zpool,
+		unsigned long handle, int rb_index, int ra_index)
+{
+	unsigned long flags;
+	struct zcache_rb_entry *rb_entry, *tmp;
+	struct rb_node **link = NULL, *parent = NULL;
+	int ret;
+	void *dup_handlep;
+
+	rb_entry = zcache_find_get_rb_entry(zpool, rb_index);
+	if (!rb_entry) {
+		/* alloc new rb_entry */
+		rb_entry = kmem_cache_alloc(zcache_cleancache_entry_cache, GFP_KERNEL);
+		if (!rb_entry)
+			return -ENOMEM;
+
+		INIT_RADIX_TREE(&rb_entry->ra_root, GFP_ATOMIC|__GFP_NOWARN);
+		spin_lock_init(&rb_entry->ra_lock);
+		rb_entry->rb_index = rb_index;
+		kref_init(&rb_entry->refcount);
+		RB_CLEAR_NODE(&rb_entry->rb_node);
+
+		/* add new entry to rb tree */
+		write_lock_irqsave(&zpool->rb_lock, flags);
+
+		tmp = zcache_find_rb_entry(&zpool->rb_root, rb_index, &parent, &link);
+		if (tmp) {
+			/* somebody else allocated new entry */
+			kmem_cache_free(zcache_cleancache_entry_cache, rb_entry);
+			rb_entry = tmp;
+		} else {
+			rb_link_node(&rb_entry->rb_node, parent, link);
+			rb_insert_color(&rb_entry->rb_node, &zpool->rb_root);
+		}
+
+		kref_get(&rb_entry->refcount);
+		write_unlock_irqrestore(&zpool->rb_lock, flags);
+	}
+
+	/* Succ get rb_entry and refcount after arrived here */
+	spin_lock_irqsave(&rb_entry->ra_lock, flags);
+	dup_handlep = radix_tree_delete(&rb_entry->ra_root, ra_index);
+	if (unlikely(dup_handlep)) {
+		WARN_ON("duplicated entry, will be replaced!\n");
+		zbud_free(zpool->pool, (unsigned long)dup_handlep);
+		atomic_dec(&zcache_cleancache_stored_pages);
+		zcache_cleancache_pool_pages = zbud_get_pool_size(zpool->pool);
+		zcache_cleancache_dup_entry++;
+	}
+	ret = radix_tree_insert(&rb_entry->ra_root, ra_index, (void *)handle);
+
+	if (unlikely(ret))
+		if (zcache_rb_entry_is_empty(rb_entry))
+			zcache_rb_entry_isolate(zpool, rb_entry, 0);
+	spin_unlock_irqrestore(&rb_entry->ra_lock, flags);
+
+	kref_put(&rb_entry->refcount, zcache_rb_entry_release);
+	return ret;
+}
+
+/* Load the handle, and delete it */
+static unsigned long *zcache_load_delete_handle(struct zcache_pool *zpool, int rb_index,
+				int ra_index)
+{
+	struct zcache_rb_entry *rb_entry;
+	void *handlep = NULL;
+	unsigned long flags;
+
+	rb_entry = zcache_find_get_rb_entry(zpool, rb_index);
+	if (!rb_entry)
+		goto out;
+
+	BUG_ON(rb_entry->rb_index != rb_index);
+
+	spin_lock_irqsave(&rb_entry->ra_lock, flags);
+	handlep = radix_tree_delete(&rb_entry->ra_root, ra_index);
+	if (zcache_rb_entry_is_empty(rb_entry))
+		/* If no more nodes in the rb_entry->radix_tree,
+		 * rm rb_entry from the rbtree and drop the refcount
+		 */
+		zcache_rb_entry_isolate(zpool, rb_entry, 0);
+	spin_unlock_irqrestore(&rb_entry->ra_lock, flags);
+
+	/* After arrive here, rb_entry have dropped from rbtree */
+	kref_put(&rb_entry->refcount, zcache_rb_entry_release);
+out:
+	return handlep;
+}
+
+static void zcache_cleancache_store_page(int pool_id, struct cleancache_filekey key,
+			pgoff_t index, struct page *page)
+{
+	unsigned int dlen = PAGE_SIZE, len;
+	unsigned long handle;
+	char *buf;
+	u8 *src, *dst;
+	struct zcache_cleancache_meta *zmeta;
+	int ret;
+
+	struct zcache_pool *zpool = zcache.pools[pool_id];
+
+	/* reclaim space if needed */
+	if (zcache_is_full()) {
+		/* Reclaim will be implemented in following version */
+		zcache_cleancache_pool_limit_hit++;
+		return;
+	}
+
+	/* compress */
+	dst = get_cpu_var(zcache_dstmem);
+	src = kmap_atomic(page);
+	ret = zcache_comp_op(ZCACHE_COMPOP_COMPRESS, src, PAGE_SIZE, dst, &dlen);
+	kunmap_atomic(src);
+	if (ret) {
+		pr_err("zcache_cleancache compress error ret %d\n", ret);
+		put_cpu_var(zcache_dstmem);
+		return;
+	}
+
+	/* store handle with meta data */
+	len = dlen + sizeof(struct zcache_cleancache_meta);
+	ret = zbud_alloc(zpool->pool, len, __GFP_NORETRY | __GFP_NOWARN, &handle);
+	if (ret) {
+		zcache_cleancache_zbud_alloc_fail++;
+		put_cpu_var(zcache_dstmem);
+		return;
+	}
+
+	zmeta = zbud_map(zpool->pool, handle);
+	zmeta->ra_index = index;
+	zmeta->length = dlen;
+	buf = (u8 *)(zmeta + 1);
+	memcpy(buf, dst, dlen);
+	zbud_unmap(zpool->pool, handle);
+	put_cpu_var(zcache_dstmem);
+
+	/* populate entry */
+	ret = zcache_store_handle(zpool, handle, key.u.ino, index);
+	if (ret) {
+		pr_err("%s: store handle error %d\n", __func__, ret);
+		zbud_free(zpool->pool, handle);
+	}
+
+	/* update stats */
+	atomic_inc(&zcache_cleancache_stored_pages);
+	zcache_cleancache_pool_pages = zbud_get_pool_size(zpool->pool);
+	return;
+}
+
+static int zcache_cleancache_load_page(int pool_id, struct cleancache_filekey key,
+			pgoff_t index, struct page *page)
+{
+	struct zcache_pool *zpool = zcache.pools[pool_id];
+	u8 *src, *dst;
+	unsigned int dlen;
+	int ret;
+	unsigned long *handlep;
+	struct zcache_cleancache_meta *zmeta;
+
+	handlep = zcache_load_delete_handle(zpool, key.u.ino, index);
+	if (!handlep)
+		return -1;
+
+	zmeta = (struct zcache_cleancache_meta *)zbud_map(zpool->pool, (unsigned long)handlep);
+	src = (u8 *)(zmeta + 1);
+
+	/* decompress */
+	dlen = PAGE_SIZE;
+	dst = kmap_atomic(page);
+	ret = zcache_comp_op(ZCACHE_COMPOP_DECOMPRESS, src, zmeta->length, dst, &dlen);
+	kunmap_atomic(dst);
+	zbud_unmap(zpool->pool, (unsigned long)handlep);
+	zbud_free(zpool->pool, (unsigned long)handlep);
+
+	WARN_ON(ret);	/* decompress err, will fetch from real disk */
+	/* update stats */
+	atomic_dec(&zcache_cleancache_stored_pages);
+	zcache_cleancache_pool_pages = zbud_get_pool_size(zpool->pool);
+	return ret;
+}
+
+static void zcache_cleancache_flush_page(int pool_id, struct cleancache_filekey key,
+			pgoff_t index)
+{
+	struct zcache_pool *zpool = zcache.pools[pool_id];
+	unsigned long *handlep = NULL;
+
+	handlep = zcache_load_delete_handle(zpool, key.u.ino, index);
+	if (handlep) {
+		zbud_free(zpool->pool, (unsigned long)handlep);
+		atomic_dec(&zcache_cleancache_stored_pages);
+		zcache_cleancache_pool_pages = zbud_get_pool_size(zpool->pool);
+	}
+}
+
+#define FREE_BATCH 16
+static void zcache_cleancache_flush_ratree(struct zcache_pool *zpool,
+				struct zcache_rb_entry *entry)
+{
+	int count, i;
+	unsigned long index = 0;
+
+	do {
+		struct zcache_cleancache_meta *handles[FREE_BATCH];
+
+		count = radix_tree_gang_lookup(&entry->ra_root,
+				(void **)handles, index, FREE_BATCH);
+
+		for (i = 0; i < count; i++) {
+			index = handles[i]->ra_index;
+			radix_tree_delete(&entry->ra_root, index);
+			zbud_free(zpool->pool, (unsigned long)handles[i]);
+			atomic_dec(&zcache_cleancache_stored_pages);
+			zcache_cleancache_pool_pages = zbud_get_pool_size(zpool->pool);
+		}
+
+		index++;
+	} while (count == FREE_BATCH);
+}
+
+static void zcache_cleancache_flush_inode(int pool_id,
+					struct cleancache_filekey key)
+{
+	struct zcache_rb_entry *rb_entry;
+	unsigned long flags1, flags2;
+	struct zcache_pool *zpool = zcache.pools[pool_id];
+
+	/* refuse new pages added in to the same inode */
+	write_lock_irqsave(&zpool->rb_lock, flags1);
+	rb_entry = zcache_find_rb_entry(&zpool->rb_root, key.u.ino, 0, 0);
+	if (!rb_entry) {
+		write_unlock_irqrestore(&zpool->rb_lock, flags1);
+		return;
+	}
+
+	kref_get(&rb_entry->refcount);
+
+	spin_lock_irqsave(&rb_entry->ra_lock, flags2);
+	zcache_cleancache_flush_ratree(zpool, rb_entry);
+	if (zcache_rb_entry_is_empty(rb_entry))
+		zcache_rb_entry_isolate(zpool, rb_entry, 1);
+	spin_unlock_irqrestore(&rb_entry->ra_lock, flags2);
+
+	write_unlock_irqrestore(&zpool->rb_lock, flags1);
+	kref_put(&rb_entry->refcount, zcache_rb_entry_release);
+}
+
+static void zcache_destroy_pool(struct zcache_pool *zpool);
+static void zcache_cleancache_flush_fs(int pool_id)
+{
+	struct zcache_rb_entry *entry = NULL;
+	struct rb_node *node;
+	unsigned long flags1, flags2;
+	struct zcache_pool *zpool = zcache.pools[pool_id];
+
+	if (!zpool)
+		return;
+
+	/* refuse new pages added in to the same inode */
+	write_lock_irqsave(&zpool->rb_lock, flags1);
+
+	node = rb_first(&zpool->rb_root);
+	while (node) {
+		entry = rb_entry(node, struct zcache_rb_entry, rb_node);
+		node = rb_next(node);
+		if (entry) {
+			kref_get(&entry->refcount);
+			spin_lock_irqsave(&entry->ra_lock, flags2);
+			zcache_cleancache_flush_ratree(zpool, entry);
+			if (zcache_rb_entry_is_empty(entry))
+				zcache_rb_entry_isolate(zpool, entry, 1);
+			spin_unlock_irqrestore(&entry->ra_lock, flags2);
+			kref_put(&entry->refcount, zcache_rb_entry_release);
+		}
+	}
+
+	write_unlock_irqrestore(&zpool->rb_lock, flags1);
+
+	zcache_destroy_pool(zpool);
+}
+
+static int zcache_cleancache_evict_entry(struct zbud_pool *pool,
+		unsigned long handle)
+{
+	return -1;
+}
+
+static struct zbud_ops zcache_cleancache_zbud_ops = {
+	.evict = zcache_cleancache_evict_entry
+};
+
+static void zcache_destroy_pool(struct zcache_pool *zpool)
+{
+	int i;
+
+	if (!zpool)
+		return;
+
+	spin_lock(&zcache.pool_lock);
+	zcache.num_pools--;
+	for (i = 0; i < MAX_ZCACHE_POOLS; i++)
+		if (zcache.pools[i] == zpool)
+			break;
+	zcache.pools[i] = NULL;
+	spin_unlock(&zcache.pool_lock);
+
+	if (!RB_EMPTY_ROOT(&zpool->rb_root)) {
+		WARN_ON("Memory leak detected. Freeing non-empty pool!\n");
+	}
+
+	zbud_destroy_pool(zpool->pool);
+	kfree(zpool);
+}
+
+/* return pool id */
+static int zcache_create_pool(void)
+{
+	int ret;
+	struct zcache_pool *zpool;
+
+	zpool = kzalloc(sizeof(*zpool), GFP_KERNEL);
+	if (!zpool) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	zpool->pool = zbud_create_pool(GFP_KERNEL, &zcache_cleancache_zbud_ops);
+	if (!zpool->pool) {
+		kfree(zpool);
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	spin_lock(&zcache.pool_lock);
+	if (zcache.num_pools == MAX_ZCACHE_POOLS) {
+		pr_info("Cannot create new pool (limit: %u)\n",
+					MAX_ZCACHE_POOLS);
+		zbud_destroy_pool(zpool->pool);
+		kfree(zpool);
+		ret = -EPERM;
+		goto out_unlock;
+	}
+
+	rwlock_init(&zpool->rb_lock);
+	zpool->rb_root = RB_ROOT;
+
+	/* Add to pool list */
+	for (ret = 0; ret < MAX_ZCACHE_POOLS; ret++)
+		if (!zcache.pools[ret])
+			break;
+	zcache.pools[ret] = zpool;
+	zcache.num_pools++;
+	pr_info("New pool created id:%d\n", ret);
+
+out_unlock:
+	spin_unlock(&zcache.pool_lock);
+out:
+	return ret;
+}
+
+static int zcache_cleancache_init_fs(size_t pagesize)
+{
+	int ret;
+
+	if (pagesize != PAGE_SIZE) {
+		pr_info("Unsupported page size: %zu", pagesize);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = zcache_create_pool();
+	if (ret < 0) {
+		pr_info("Failed to create new pool\n");
+		ret = -ENOMEM;
+		goto out;
+	}
+
+out:
+	return ret;
+}
+
+static int zcache_cleancache_init_shared_fs(char *uuid, size_t pagesize)
+{
+	/* shared pools are unsupported and map to private */
+	return zcache_cleancache_init_fs(pagesize);
+}
+
+static struct cleancache_ops zcache_cleancache_ops = {
+	.put_page = zcache_cleancache_store_page,
+	.get_page = zcache_cleancache_load_page,
+	.invalidate_page = zcache_cleancache_flush_page,
+	.invalidate_inode = zcache_cleancache_flush_inode,
+	.invalidate_fs = zcache_cleancache_flush_fs,
+	.init_shared_fs = zcache_cleancache_init_shared_fs,
+	.init_fs = zcache_cleancache_init_fs
+};
+
+/*********************************
+* debugfs functions
+**********************************/
+#ifdef CONFIG_DEBUG_FS
+#include <linux/debugfs.h>
+
+static struct dentry *zcache_cleancache_debugfs_root;
+
+static int __init zcache_debugfs_init(void)
+{
+	if (!debugfs_initialized())
+		return -ENODEV;
+
+	if (!zcache_nocleancache) {
+		zcache_cleancache_debugfs_root = debugfs_create_dir("zcache_cleancache", NULL);
+		if (!zcache_cleancache_debugfs_root)
+			return -ENOMEM;
+
+		debugfs_create_u64("pool_limit_hit", S_IRUGO,
+				zcache_cleancache_debugfs_root, &zcache_cleancache_pool_limit_hit);
+		debugfs_create_u64("reclaim_fail", S_IRUGO,
+				zcache_cleancache_debugfs_root, &zcache_cleancache_reclaim_fail);
+		debugfs_create_u64("reject_alloc_fail", S_IRUGO,
+				zcache_cleancache_debugfs_root, &zcache_cleancache_zbud_alloc_fail);
+		debugfs_create_u64("written_back_pages", S_IRUGO,
+				zcache_cleancache_debugfs_root, &zcache_cleancache_written_back_pages);
+		debugfs_create_u64("duplicate_entry", S_IRUGO,
+				zcache_cleancache_debugfs_root, &zcache_cleancache_dup_entry);
+		debugfs_create_u64("pool_pages", S_IRUGO,
+				zcache_cleancache_debugfs_root, &zcache_cleancache_pool_pages);
+		debugfs_create_atomic_t("stored_pages", S_IRUGO,
+				zcache_cleancache_debugfs_root, &zcache_cleancache_stored_pages);
+	}
+	return 0;
+}
+
+static void __exit zcache_debugfs_exit(void)
+{
+	debugfs_remove_recursive(zcache_cleancache_debugfs_root);
+}
+#else
+static int __init zcache_debugfs_init(void)
+{
+	return 0;
+}
+static void __exit zcache_debugfs_exit(void)
+{
+}
+#endif
+
+/*********************************
+* module init and exit
+**********************************/
+static int __init init_zcache(void)
+{
+	if (!zcache_enabled)
+		return 0;
+
+	pr_info("loading zcache..\n");
+	if (!zcache_nocleancache)
+		if (zcache_cleancache_entry_cache_create()) {
+			pr_err("entry cache creation failed\n");
+			goto error;
+		}
+
+	if (zcache_comp_init()) {
+		pr_err("compressor initialization failed\n");
+		goto compfail;
+	}
+	if (zcache_cpu_init()) {
+		pr_err("per-cpu initialization failed\n");
+		goto pcpufail;
+	}
+
+	spin_lock_init(&zcache.pool_lock);
+	if (!zcache_nocleancache)
+		cleancache_register_ops(&zcache_cleancache_ops);
+
+	if (zcache_debugfs_init())
+		pr_warn("debugfs initialization failed\n");
+	return 0;
+pcpufail:
+	zcache_comp_exit();
+compfail:
+	zcache_cleancache_entry_cache_destory();
+error:
+	return -ENOMEM;
+}
+/* must be late so crypto has time to come up */
+late_initcall(init_zcache);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Bob Liu <bob.liu@...cle.com>");
+MODULE_DESCRIPTION("Compressed cache for clean file pages");
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/