linux-kernel - Re: [PATCH v2 2/4] zram: implement deduplication in zram

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20170417022250.GC1351@js1304-desktop>
Date:   Mon, 17 Apr 2017 11:22:52 +0900
From:   Joonsoo Kim <js1304@...il.com>
To:     Minchan Kim <minchan@...nel.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
        linux-kernel@...r.kernel.org, kernel-team@....com
Subject: Re: [PATCH v2 2/4] zram: implement deduplication in zram

On Mon, Apr 17, 2017 at 10:38:16AM +0900, Minchan Kim wrote:
> Hi Joonsoo,
> 
> I reviewed this patch and overall, looks great! Thanks.

Thanks!

> However, as you know, recently, zram had lots of clean up so
> this patchset should be rebased on it massively.
> Sorry for the inconvenience.

No problem.

> 
> And there are some minor in below. I hope you handle them in
> next submit, please.

Okay.

> On Thu, Mar 30, 2017 at 02:38:07PM +0900, js1304@...il.com wrote:
> > From: Joonsoo Kim <iamjoonsoo.kim@....com>
> > 
> > This patch implements deduplication feature in zram. The purpose
> > of this work is naturally to save amount of memory usage by zram.
> > 
> > Android is one of the biggest users to use zram as swap and it's
> > really important to save amount of memory usage. There is a paper
> > that reports that duplication ratio of Android's memory content is
> > rather high [1]. And, there is a similar work on zswap that also
> > reports that experiments has shown that around 10-15% of pages
> > stored in zswp are duplicates and deduplicate them provides some
> > benefits [2].
> > 
> > Also, there is a different kind of workload that uses zram as blockdev
> > and store build outputs into it to reduce wear-out problem of real
> > blockdev. In this workload, deduplication hit is very high due to
> > temporary files and intermediate object files. Detailed analysis is
> > on the bottom of this description.
> > 
> > Anyway, if we can detect duplicated content and avoid to store duplicated
> > content at different memory space, we can save memory. This patch
> > tries to do that.
> > 
> > Implementation is almost simple and intuitive but I should note
> > one thing about implementation detail.
> > 
> > To check duplication, this patch uses checksum of the page and
> > collision of this checksum could be possible. There would be
> > many choices to handle this situation but this patch chooses
> > to allow entry with duplicated checksum to be added to the hash,
> > but, not to compare all entries with duplicated checksum
> > when checking duplication. I guess that checksum collision is quite
> > rare event and we don't need to pay any attention to such a case.
> > Therefore, I decided the most simplest way to implement the feature.
> > If there is a different opinion, I can accept and go that way.
> > 
> > Following is the result of this patch.
> > 
> > Test result #1 (Swap):
> > Android Marshmallow, emulator, x86_64, Backporting to kernel v3.18
> > 
> > orig_data_size: 145297408
> > compr_data_size: 32408125
> > mem_used_total: 32276480
> > dup_data_size: 3188134
> > meta_data_size: 1444272
> > 
> > Last two metrics added to mm_stat are related to this work.
> > First one, dup_data_size, is amount of saved memory by avoiding
> > to store duplicated page. Later one, meta_data_size, is the amount of
> > data structure to support deduplication. If dup > meta, we can judge
> > that the patch improves memory usage.
> > 
> > In Adnroid, we can save 5% of memory usage by this work.
> > 
> > Test result #2 (Blockdev):
> > build the kernel and store output to ext4 FS on zram
> > 
> > <no-dedup>
> > Elapsed time: 249 s
> > mm_stat: 430845952 191014886 196898816 0 196898816 28320 0 0 0
> > 
> > <dedup>
> > Elapsed time: 250 s
> > mm_stat: 430505984 190971334 148365312 0 148365312 28404 0 47287038  3945792
> > 
> > There is no performance degradation and save 23% memory.
> > 
> > Test result #3 (Blockdev):
> > copy android build output dir(out/host) to ext4 FS on zram
> > 
> > <no-dedup>
> > Elapsed time: out/host: 88 s
> > mm_stat: 8834420736 3658184579 3834208256 0 3834208256 32889 0 0 0
> > 
> > <dedup>
> > Elapsed time: out/host: 100 s
> > mm_stat: 8832929792 3657329322 2832015360 0 2832015360 32609 0 952568877 80880336
> > 
> > It shows performance degradation roughly 13% and save 24% memory. Maybe,
> > it is due to overhead of calculating checksum and comparison.
> > 
> > Test result #4 (Blockdev):
> > copy android build output dir(out/target/common) to ext4 FS on zram
> > 
> > <no-dedup>
> > Elapsed time: out/host: 203 s
> > mm_stat: 4041678848 2310355010 2346577920 0 2346582016 500 4 0 0
> > 
> > <dedup>
> > Elapsed time: out/host: 201 s
> > mm_stat: 4041666560 2310488276 1338150912 0 1338150912 476 0 989088794 24564336
> > 
> > Memory is saved by 42% and performance is the same. Even if there is overhead
> > of calculating checksum and comparison, large hit ratio compensate it since
> > hit leads to less compression attempt.
> > 
> > I checked the detailed reason of savings on kernel build workload and
> > there are some cases that deduplication happens.
> > 
> > 1) *.cmd
> > Build command is usually similar in one directory so content of these file
> > are very similar. In my system, more than 789 lines in fs/ext4/.namei.o.cmd
> > and fs/ext4/.inode.o.cmd are the same in 944 and 938 lines of the file,
> > respectively.
> > 
> > 2) intermediate object files
> > built-in.o and temporary object file have the similar contents. More than
> > 50% of fs/ext4/ext4.o is the same with fs/ext4/built-in.o.
> > 
> > 3) vmlinux
> > .tmp_vmlinux1 and .tmp_vmlinux2 and arch/x86/boo/compressed/vmlinux.bin
> > have the similar contents.
> > 
> > Android test has similar case that some of object files(.class
> > and .so) are similar with another ones.
> > (./host/linux-x86/lib/libartd.so and
> > ./host/linux-x86-lib/libartd-comiler.so)
> > 
> > Anyway, benefit seems to be largely dependent on the workload so
> > following patch will make this feature optional. However, this feature
> > can help some usecases so is deserved to be merged.
> > 
> > [1]: MemScope: Analyzing Memory Duplication on Android Systems,
> > dl.acm.org/citation.cfm?id=2797023
> > [2]: zswap: Optimize compressed pool memory utilization,
> > lkml.kernel.org/r/1341407574.7551.1471584870761.JavaMail.weblogic@...as3p2
> > 
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@....com>
> > ---
> >  Documentation/blockdev/zram.txt |   2 +
> >  drivers/block/zram/Makefile     |   2 +-
> >  drivers/block/zram/zram_dedup.c | 222 ++++++++++++++++++++++++++++++++++++++++
> >  drivers/block/zram/zram_dedup.h |  25 +++++
> >  drivers/block/zram/zram_drv.c   |  62 ++++++++---
> >  drivers/block/zram/zram_drv.h   |  18 ++++
> >  6 files changed, 314 insertions(+), 17 deletions(-)
> >  create mode 100644 drivers/block/zram/zram_dedup.c
> >  create mode 100644 drivers/block/zram/zram_dedup.h
> > 
> > diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
> > index 4fced8a..2cdc303 100644
> > --- a/Documentation/blockdev/zram.txt
> > +++ b/Documentation/blockdev/zram.txt
> > @@ -217,6 +217,8 @@ line of text and contains the following stats separated by whitespace:
> >   same_pages       the number of same element filled pages written to this disk.
> >                    No memory is allocated for such pages.
> >   pages_compacted  the number of pages freed during compaction
> > + dup_data_size	  deduplicated data size
> > + meta_data_size	  the amount of metadata allocated for deduplication feature
> >  
> >  9) Deactivate:
> >  	swapoff /dev/zram0
> > diff --git a/drivers/block/zram/Makefile b/drivers/block/zram/Makefile
> > index 9e2b79e..29cb008 100644
> > --- a/drivers/block/zram/Makefile
> > +++ b/drivers/block/zram/Makefile
> > @@ -1,3 +1,3 @@
> > -zram-y	:=	zcomp.o zram_drv.o
> > +zram-y	:=	zcomp.o zram_drv.o zram_dedup.o
> >  
> >  obj-$(CONFIG_ZRAM)	+=	zram.o
> > diff --git a/drivers/block/zram/zram_dedup.c b/drivers/block/zram/zram_dedup.c
> > new file mode 100644
> > index 0000000..d313fc8
> > --- /dev/null
> > +++ b/drivers/block/zram/zram_dedup.c
> > @@ -0,0 +1,222 @@
> > +/*
> > + * Copyright (C) 2017 Joonsoo Kim.
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version
> > + * 2 of the License, or (at your option) any later version.
> > + */
> > +
> > +#include <linux/vmalloc.h>
> > +#include <linux/jhash.h>
> > +
> > +#include "zram_drv.h"
> > +
> > +/* One slot will contain 128 pages theoretically */
> > +#define ZRAM_HASH_SHIFT		7
> > +#define ZRAM_HASH_SIZE_MIN	(1 << 10)
> > +#define ZRAM_HASH_SIZE_MAX	(1 << 31)
> > +
> > +u64 zram_dedup_dup_size(struct zram *zram)
> > +{
> > +	return (u64)atomic64_read(&zram->stats.dup_data_size);
> > +}
> > +
> > +u64 zram_dedup_meta_size(struct zram *zram)
> > +{
> > +	return (u64)atomic64_read(&zram->stats.meta_data_size);
> > +}
> > +
> > +static u32 zram_dedup_checksum(unsigned char *mem)
> > +{
> > +	return jhash(mem, PAGE_SIZE, 0);
> > +}
> > +
> > +void zram_dedup_insert(struct zram *zram, struct zram_entry *new,
> > +				u32 checksum)
> > +{
> > +	struct zram_meta *meta = zram->meta;
> > +	struct zram_hash *hash;
> > +	struct rb_root *rb_root;
> > +	struct rb_node **rb_node, *parent = NULL;
> > +	struct zram_entry *entry;
> > +
> > +	new->checksum = checksum;
> > +	hash = &meta->hash[checksum % meta->hash_size];
> > +	rb_root = &hash->rb_root;
> > +
> > +	spin_lock(&hash->lock);
> > +	rb_node = &rb_root->rb_node;
> > +	while (*rb_node) {
> > +		parent = *rb_node;
> > +		entry = rb_entry(parent, struct zram_entry, rb_node);
> > +		if (checksum < entry->checksum)
> > +			rb_node = &parent->rb_left;
> > +		else if (checksum > entry->checksum)
> > +			rb_node = &parent->rb_right;
> > +		else
> > +			rb_node = &parent->rb_left;
> > +	}
> > +
> > +	rb_link_node(&new->rb_node, parent, rb_node);
> > +	rb_insert_color(&new->rb_node, rb_root);
> > +	spin_unlock(&hash->lock);
> > +}
> > +
> > +static bool zram_dedup_match(struct zram *zram, struct zram_entry *entry,
> > +				unsigned char *mem)
> > +{
> > +	bool match = false;
> > +	unsigned char *cmem;
> > +	struct zram_meta *meta = zram->meta;
> > +	struct zcomp_strm *zstrm;
> > +
> > +	cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_RO);
> > +	if (entry->len == PAGE_SIZE) {
> > +		match = !memcmp(mem, cmem, PAGE_SIZE);
> > +	} else {
> > +		zstrm = zcomp_stream_get(zram->comp);
> > +		if (!zcomp_decompress(zstrm, cmem, entry->len, zstrm->buffer))
> > +			match = !memcmp(mem, zstrm->buffer, PAGE_SIZE);
> > +		zcomp_stream_put(zram->comp);
> > +	}
> > +	zs_unmap_object(meta->mem_pool, entry->handle);
> > +
> > +	return match;
> > +}
> > +
> > +static unsigned long zram_dedup_put(struct zram *zram, struct zram_meta *meta,
> > +				struct zram_entry *entry)
> > +{
> > +	struct zram_hash *hash;
> > +	u32 checksum;
> > +	unsigned long refcount;
> > +
> > +	checksum = entry->checksum;
> > +	hash = &meta->hash[checksum % meta->hash_size];
> > +
> > +	spin_lock(&hash->lock);
> > +	entry->refcount--;
> > +	refcount = entry->refcount;
> > +	if (!entry->refcount) {
> > +		rb_erase(&entry->rb_node, &hash->rb_root);
> > +		RB_CLEAR_NODE(&entry->rb_node);
> 
> What's the purpose of this RB_CLEAR_NODE?
> I think we don't need it.

I thought that it is needed to initialize the rb entry but with
re-check it seems to be unnecessary. I will remove it.

> > +	} else if (zram) {
> > +		atomic64_sub(entry->len, &zram->stats.dup_data_size);
> > +	}
> > +	spin_unlock(&hash->lock);
> > +
> > +	return refcount;
> > +}
> > +
> > +static struct zram_entry *zram_dedup_get(struct zram *zram,
> > +				unsigned char *mem, u32 checksum)
> > +{
> > +	struct zram_meta *meta = zram->meta;
> > +	struct zram_hash *hash;
> > +	struct zram_entry *entry;
> > +	struct rb_node *rb_node;
> > +
> > +	hash = &meta->hash[checksum % meta->hash_size];
> > +
> > +	spin_lock(&hash->lock);
> > +	rb_node = hash->rb_root.rb_node;
> > +	while (rb_node) {
> > +		entry = rb_entry(rb_node, struct zram_entry, rb_node);
> > +		if (checksum == entry->checksum) {
> > +			entry->refcount++;
> > +			atomic64_add(entry->len, &zram->stats.dup_data_size);
> > +			spin_unlock(&hash->lock);
> > +
> > +			if (zram_dedup_match(zram, entry, mem))
> > +				return entry;
> > +
> > +			zram_entry_free(zram, meta, entry);
> > +
> > +			return NULL;
> > +		}
> > +
> > +		if (checksum < entry->checksum)
> > +			rb_node = rb_node->rb_left;
> > +		else
> > +			rb_node = rb_node->rb_right;
> > +	}
> > +	spin_unlock(&hash->lock);
> > +
> > +	return NULL;
> > +}
> > +
> > +struct zram_entry *zram_dedup_find(struct zram *zram, unsigned char *mem,
> > +				u32 *checksum)
> > +{
> > +	*checksum = zram_dedup_checksum(mem);
> > +
> > +	return zram_dedup_get(zram, mem, *checksum);
> > +}
> > +
> > +struct zram_entry *zram_dedup_alloc(struct zram *zram,
> > +				unsigned long handle, unsigned int len,
> > +				gfp_t flags)
> > +{
> > +	struct zram_entry *entry;
> > +
> > +	entry = kzalloc(sizeof(*entry),
> > +			flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
> 
> zram_entry_alloc can mask off GFP_HIGHMEM|__GFP_MOVABLE and
> it can pass filtered gfp flags to zs_malloc and zram_dedup_alloc.

Okay.

> > +	if (!entry)
> > +		return NULL;
> > +
> > +	entry->handle = handle;
> > +	RB_CLEAR_NODE(&entry->rb_node);
> 
> Ditto. I think we don't need to clear rb_node.

Okay.

> > +	entry->refcount = 1;
> > +	entry->len = len;
> > +
> > +	atomic64_add(sizeof(*entry), &zram->stats.meta_data_size);
> > +
> > +	return entry;
> > +}
> > +
> > +unsigned long zram_dedup_free(struct zram *zram, struct zram_meta *meta,
> > +			struct zram_entry *entry)
> > +{
> > +	unsigned long handle;
> > +
> > +	if (zram_dedup_put(zram, meta, entry))
> > +		return 0;
> > +
> > +	handle = entry->handle;
> > +	kfree(entry);
> > +
> > +	/* !zram happens when reset/fail and updating stat is useless */
> 
> With recent version, it's not true so your code would be more clean. :)

Good!

> 
> > +	if (zram)
> > +		atomic64_sub(sizeof(*entry), &zram->stats.meta_data_size);
> > +
> > +	return handle;
> > +}
> > +
> > +int zram_dedup_init(struct zram_meta *meta, size_t num_pages)
> > +{
> > +	int i;
> > +	struct zram_hash *hash;
> > +
> > +	meta->hash_size = num_pages >> ZRAM_HASH_SHIFT;
> > +	meta->hash_size = min_t(size_t, ZRAM_HASH_SIZE_MAX, meta->hash_size);
> > +	meta->hash_size = max_t(size_t, ZRAM_HASH_SIZE_MIN, meta->hash_size);
> > +	meta->hash = vzalloc(meta->hash_size * sizeof(struct zram_hash));
> > +	if (!meta->hash) {
> > +		pr_err("Error allocating zram entry hash\n");
> > +		return -ENOMEM;
> > +	}
> > +
> > +	for (i = 0; i < meta->hash_size; i++) {
> > +		hash = &meta->hash[i];
> > +		spin_lock_init(&hash->lock);
> > +		hash->rb_root = RB_ROOT;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +void zram_dedup_fini(struct zram_meta *meta)
> > +{
> > +	vfree(meta->hash);
> > +}
> > diff --git a/drivers/block/zram/zram_dedup.h b/drivers/block/zram/zram_dedup.h
> > new file mode 100644
> > index 0000000..7071f32
> > --- /dev/null
> > +++ b/drivers/block/zram/zram_dedup.h
> > @@ -0,0 +1,25 @@
> > +#ifndef _ZRAM_DEDUP_H_
> > +#define _ZRAM_DEDUP_H_
> > +
> > +struct zram;
> > +struct zram_meta;
> > +struct zram_entry;
> > +
> > +u64 zram_dedup_dup_size(struct zram *zram);
> > +u64 zram_dedup_meta_size(struct zram *zram);
> > +
> > +void zram_dedup_insert(struct zram *zram, struct zram_entry *new,
> > +				u32 checksum);
> > +struct zram_entry *zram_dedup_find(struct zram *zram, unsigned char *mem,
> > +				u32 *checksum);
> > +
> > +struct zram_entry *zram_dedup_alloc(struct zram *zram,
> > +			unsigned long handle, unsigned int len,
> > +			gfp_t flags);
> > +unsigned long zram_dedup_free(struct zram *zram, struct zram_meta *meta,
> > +				struct zram_entry *entry);
> > +
> > +int zram_dedup_init(struct zram_meta *meta, size_t num_pages);
> > +void zram_dedup_fini(struct zram_meta *meta);
> > +
> > +#endif /* _ZRAM_DEDUP_H_ */
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index f3949da..15cecd6 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -385,14 +385,16 @@ static ssize_t mm_stat_show(struct device *dev,
> >  	max_used = atomic_long_read(&zram->stats.max_used_pages);
> >  
> >  	ret = scnprintf(buf, PAGE_SIZE,
> > -			"%8llu %8llu %8llu %8lu %8ld %8llu %8lu\n",
> > +			"%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu %8llu\n",
> >  			orig_size << PAGE_SHIFT,
> >  			(u64)atomic64_read(&zram->stats.compr_data_size),
> >  			mem_used << PAGE_SHIFT,
> >  			zram->limit_pages << PAGE_SHIFT,
> >  			max_used << PAGE_SHIFT,
> >  			(u64)atomic64_read(&zram->stats.same_pages),
> > -			pool_stats.pages_compacted);
> > +			pool_stats.pages_compacted,
> > +			zram_dedup_dup_size(zram),
> > +			zram_dedup_meta_size(zram));
> >  	up_read(&zram->init_lock);
> >  
> >  	return ret;
> > @@ -424,25 +426,29 @@ static struct zram_entry *zram_entry_alloc(struct zram *zram,
> >  {
> >  	struct zram_meta *meta = zram->meta;
> >  	struct zram_entry *entry;
> > +	unsigned long handle;
> >  
> > -	entry = kzalloc(sizeof(*entry), flags);
> > -	if (!entry)
> > +	handle = zs_malloc(meta->mem_pool, len, flags);
> > +	if (!handle)
> >  		return NULL;
> >  
> > -	entry->handle = zs_malloc(meta->mem_pool, len, flags);
> > -	if (!entry->handle) {
> > -		kfree(entry);
> > +	entry = zram_dedup_alloc(zram, handle, len, flags);
> 
> I don't like to hide zram_entry allocation behind zram_dedup_alloc.
> It's never wrong because zram_table allocation happens on only
> with dedup feature but zram_table is owned by zram_drv so I want
> for zram_drv to control zram_table alloc/free.

Okay.

> How about this?
> 
> zram_entry_alloc
>         unsigned long handle;
>         struct zram_entry *entry;
> 
>         handle = zs_malloc;
>         if (zram_dedup_enabled())
>                 return handle;
> 
>         entry = kzalloc();
>         zram_dedup_init(entry, handle, len);
> 
>         return entry;
> 
> 

I will change it as you suggested.

> > +	if (!entry) {
> > +		zs_free(meta->mem_pool, handle);
> >  		return NULL;
> >  	}
> >  
> >  	return entry;
> >  }
> >  
> > -static inline void zram_entry_free(struct zram_meta *meta,
> > -			struct zram_entry *entry)
> > +void zram_entry_free(struct zram *zram, struct zram_meta *meta,
> > +				struct zram_entry *entry)
> >  {
> > -	zs_free(meta->mem_pool, entry->handle);
> > -	kfree(entry);
> > +	unsigned long handle;
> > +
> > +	handle = zram_dedup_free(zram, meta, entry);
> 
> Ditto.
>         unsigned long handle = entry->handle;
> 
>         if (zram_dedup_put(zram, meta, entry))
>                 kfree(entry);
> 
>         zs_free(meta->mem_pool, handle);
> 
> With this, In [4/4], I want that zram_entry_handle
> doesn't rely on zram_dedup_handle.

Okay.

Thanks.