linux-kernel - Re: [PATCH v2 2/4] zram: implement deduplication in zram

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170417013816.GA18215@bbox>
Date:   Mon, 17 Apr 2017 10:38:16 +0900
From:   Minchan Kim <minchan@...nel.org>
To:     <js1304@...il.com>
CC:     Andrew Morton <akpm@...ux-foundation.org>,
        Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
        <linux-kernel@...r.kernel.org>, <kernel-team@....com>,
        Joonsoo Kim <iamjoonsoo.kim@....com>
Subject: Re: [PATCH v2 2/4] zram: implement deduplication in zram

Hi Joonsoo,

I reviewed this patch and overall, looks great! Thanks.

However, as you know, recently, zram had lots of clean up so
this patchset should be rebased on it massively.
Sorry for the inconvenience.

And there are some minor in below. I hope you handle them in
next submit, please.

On Thu, Mar 30, 2017 at 02:38:07PM +0900, js1304@...il.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@....com>
> 
> This patch implements deduplication feature in zram. The purpose
> of this work is naturally to save amount of memory usage by zram.
> 
> Android is one of the biggest users to use zram as swap and it's
> really important to save amount of memory usage. There is a paper
> that reports that duplication ratio of Android's memory content is
> rather high [1]. And, there is a similar work on zswap that also
> reports that experiments has shown that around 10-15% of pages
> stored in zswp are duplicates and deduplicate them provides some
> benefits [2].
> 
> Also, there is a different kind of workload that uses zram as blockdev
> and store build outputs into it to reduce wear-out problem of real
> blockdev. In this workload, deduplication hit is very high due to
> temporary files and intermediate object files. Detailed analysis is
> on the bottom of this description.
> 
> Anyway, if we can detect duplicated content and avoid to store duplicated
> content at different memory space, we can save memory. This patch
> tries to do that.
> 
> Implementation is almost simple and intuitive but I should note
> one thing about implementation detail.
> 
> To check duplication, this patch uses checksum of the page and
> collision of this checksum could be possible. There would be
> many choices to handle this situation but this patch chooses
> to allow entry with duplicated checksum to be added to the hash,
> but, not to compare all entries with duplicated checksum
> when checking duplication. I guess that checksum collision is quite
> rare event and we don't need to pay any attention to such a case.
> Therefore, I decided the most simplest way to implement the feature.
> If there is a different opinion, I can accept and go that way.
> 
> Following is the result of this patch.
> 
> Test result #1 (Swap):
> Android Marshmallow, emulator, x86_64, Backporting to kernel v3.18
> 
> orig_data_size: 145297408
> compr_data_size: 32408125
> mem_used_total: 32276480
> dup_data_size: 3188134
> meta_data_size: 1444272
> 
> Last two metrics added to mm_stat are related to this work.
> First one, dup_data_size, is amount of saved memory by avoiding
> to store duplicated page. Later one, meta_data_size, is the amount of
> data structure to support deduplication. If dup > meta, we can judge
> that the patch improves memory usage.
> 
> In Adnroid, we can save 5% of memory usage by this work.
> 
> Test result #2 (Blockdev):
> build the kernel and store output to ext4 FS on zram
> 
> <no-dedup>
> Elapsed time: 249 s
> mm_stat: 430845952 191014886 196898816 0 196898816 28320 0 0 0
> 
> <dedup>
> Elapsed time: 250 s
> mm_stat: 430505984 190971334 148365312 0 148365312 28404 0 47287038  3945792
> 
> There is no performance degradation and save 23% memory.
> 
> Test result #3 (Blockdev):
> copy android build output dir(out/host) to ext4 FS on zram
> 
> <no-dedup>
> Elapsed time: out/host: 88 s
> mm_stat: 8834420736 3658184579 3834208256 0 3834208256 32889 0 0 0
> 
> <dedup>
> Elapsed time: out/host: 100 s
> mm_stat: 8832929792 3657329322 2832015360 0 2832015360 32609 0 952568877 80880336
> 
> It shows performance degradation roughly 13% and save 24% memory. Maybe,
> it is due to overhead of calculating checksum and comparison.
> 
> Test result #4 (Blockdev):
> copy android build output dir(out/target/common) to ext4 FS on zram
> 
> <no-dedup>
> Elapsed time: out/host: 203 s
> mm_stat: 4041678848 2310355010 2346577920 0 2346582016 500 4 0 0
> 
> <dedup>
> Elapsed time: out/host: 201 s
> mm_stat: 4041666560 2310488276 1338150912 0 1338150912 476 0 989088794 24564336
> 
> Memory is saved by 42% and performance is the same. Even if there is overhead
> of calculating checksum and comparison, large hit ratio compensate it since
> hit leads to less compression attempt.
> 
> I checked the detailed reason of savings on kernel build workload and
> there are some cases that deduplication happens.
> 
> 1) *.cmd
> Build command is usually similar in one directory so content of these file
> are very similar. In my system, more than 789 lines in fs/ext4/.namei.o.cmd
> and fs/ext4/.inode.o.cmd are the same in 944 and 938 lines of the file,
> respectively.
> 
> 2) intermediate object files
> built-in.o and temporary object file have the similar contents. More than
> 50% of fs/ext4/ext4.o is the same with fs/ext4/built-in.o.
> 
> 3) vmlinux
> .tmp_vmlinux1 and .tmp_vmlinux2 and arch/x86/boo/compressed/vmlinux.bin
> have the similar contents.
> 
> Android test has similar case that some of object files(.class
> and .so) are similar with another ones.
> (./host/linux-x86/lib/libartd.so and
> ./host/linux-x86-lib/libartd-comiler.so)
> 
> Anyway, benefit seems to be largely dependent on the workload so
> following patch will make this feature optional. However, this feature
> can help some usecases so is deserved to be merged.
> 
> [1]: MemScope: Analyzing Memory Duplication on Android Systems,
> dl.acm.org/citation.cfm?id=2797023
> [2]: zswap: Optimize compressed pool memory utilization,
> lkml.kernel.org/r/1341407574.7551.1471584870761.JavaMail.weblogic@...as3p2
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@....com>
> ---
>  Documentation/blockdev/zram.txt |   2 +
>  drivers/block/zram/Makefile     |   2 +-
>  drivers/block/zram/zram_dedup.c | 222 ++++++++++++++++++++++++++++++++++++++++
>  drivers/block/zram/zram_dedup.h |  25 +++++
>  drivers/block/zram/zram_drv.c   |  62 ++++++++---
>  drivers/block/zram/zram_drv.h   |  18 ++++
>  6 files changed, 314 insertions(+), 17 deletions(-)
>  create mode 100644 drivers/block/zram/zram_dedup.c
>  create mode 100644 drivers/block/zram/zram_dedup.h
> 
> diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
> index 4fced8a..2cdc303 100644
> --- a/Documentation/blockdev/zram.txt
> +++ b/Documentation/blockdev/zram.txt
> @@ -217,6 +217,8 @@ line of text and contains the following stats separated by whitespace:
>   same_pages       the number of same element filled pages written to this disk.
>                    No memory is allocated for such pages.
>   pages_compacted  the number of pages freed during compaction
> + dup_data_size	  deduplicated data size
> + meta_data_size	  the amount of metadata allocated for deduplication feature
>  
>  9) Deactivate:
>  	swapoff /dev/zram0
> diff --git a/drivers/block/zram/Makefile b/drivers/block/zram/Makefile
> index 9e2b79e..29cb008 100644
> --- a/drivers/block/zram/Makefile
> +++ b/drivers/block/zram/Makefile
> @@ -1,3 +1,3 @@
> -zram-y	:=	zcomp.o zram_drv.o
> +zram-y	:=	zcomp.o zram_drv.o zram_dedup.o
>  
>  obj-$(CONFIG_ZRAM)	+=	zram.o
> diff --git a/drivers/block/zram/zram_dedup.c b/drivers/block/zram/zram_dedup.c
> new file mode 100644
> index 0000000..d313fc8
> --- /dev/null
> +++ b/drivers/block/zram/zram_dedup.c
> @@ -0,0 +1,222 @@
> +/*
> + * Copyright (C) 2017 Joonsoo Kim.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +#include <linux/vmalloc.h>
> +#include <linux/jhash.h>
> +
> +#include "zram_drv.h"
> +
> +/* One slot will contain 128 pages theoretically */
> +#define ZRAM_HASH_SHIFT		7
> +#define ZRAM_HASH_SIZE_MIN	(1 << 10)
> +#define ZRAM_HASH_SIZE_MAX	(1 << 31)
> +
> +u64 zram_dedup_dup_size(struct zram *zram)
> +{
> +	return (u64)atomic64_read(&zram->stats.dup_data_size);
> +}
> +
> +u64 zram_dedup_meta_size(struct zram *zram)
> +{
> +	return (u64)atomic64_read(&zram->stats.meta_data_size);
> +}
> +
> +static u32 zram_dedup_checksum(unsigned char *mem)
> +{
> +	return jhash(mem, PAGE_SIZE, 0);
> +}
> +
> +void zram_dedup_insert(struct zram *zram, struct zram_entry *new,
> +				u32 checksum)
> +{
> +	struct zram_meta *meta = zram->meta;
> +	struct zram_hash *hash;
> +	struct rb_root *rb_root;
> +	struct rb_node **rb_node, *parent = NULL;
> +	struct zram_entry *entry;
> +
> +	new->checksum = checksum;
> +	hash = &meta->hash[checksum % meta->hash_size];
> +	rb_root = &hash->rb_root;
> +
> +	spin_lock(&hash->lock);
> +	rb_node = &rb_root->rb_node;
> +	while (*rb_node) {
> +		parent = *rb_node;
> +		entry = rb_entry(parent, struct zram_entry, rb_node);
> +		if (checksum < entry->checksum)
> +			rb_node = &parent->rb_left;
> +		else if (checksum > entry->checksum)
> +			rb_node = &parent->rb_right;
> +		else
> +			rb_node = &parent->rb_left;
> +	}
> +
> +	rb_link_node(&new->rb_node, parent, rb_node);
> +	rb_insert_color(&new->rb_node, rb_root);
> +	spin_unlock(&hash->lock);
> +}
> +
> +static bool zram_dedup_match(struct zram *zram, struct zram_entry *entry,
> +				unsigned char *mem)
> +{
> +	bool match = false;
> +	unsigned char *cmem;
> +	struct zram_meta *meta = zram->meta;
> +	struct zcomp_strm *zstrm;
> +
> +	cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_RO);
> +	if (entry->len == PAGE_SIZE) {
> +		match = !memcmp(mem, cmem, PAGE_SIZE);
> +	} else {
> +		zstrm = zcomp_stream_get(zram->comp);
> +		if (!zcomp_decompress(zstrm, cmem, entry->len, zstrm->buffer))
> +			match = !memcmp(mem, zstrm->buffer, PAGE_SIZE);
> +		zcomp_stream_put(zram->comp);
> +	}
> +	zs_unmap_object(meta->mem_pool, entry->handle);
> +
> +	return match;
> +}
> +
> +static unsigned long zram_dedup_put(struct zram *zram, struct zram_meta *meta,
> +				struct zram_entry *entry)
> +{
> +	struct zram_hash *hash;
> +	u32 checksum;
> +	unsigned long refcount;
> +
> +	checksum = entry->checksum;
> +	hash = &meta->hash[checksum % meta->hash_size];
> +
> +	spin_lock(&hash->lock);
> +	entry->refcount--;
> +	refcount = entry->refcount;
> +	if (!entry->refcount) {
> +		rb_erase(&entry->rb_node, &hash->rb_root);
> +		RB_CLEAR_NODE(&entry->rb_node);

What's the purpose of this RB_CLEAR_NODE?
I think we don't need it.

> +	} else if (zram) {
> +		atomic64_sub(entry->len, &zram->stats.dup_data_size);
> +	}
> +	spin_unlock(&hash->lock);
> +
> +	return refcount;
> +}
> +
> +static struct zram_entry *zram_dedup_get(struct zram *zram,
> +				unsigned char *mem, u32 checksum)
> +{
> +	struct zram_meta *meta = zram->meta;
> +	struct zram_hash *hash;
> +	struct zram_entry *entry;
> +	struct rb_node *rb_node;
> +
> +	hash = &meta->hash[checksum % meta->hash_size];
> +
> +	spin_lock(&hash->lock);
> +	rb_node = hash->rb_root.rb_node;
> +	while (rb_node) {
> +		entry = rb_entry(rb_node, struct zram_entry, rb_node);
> +		if (checksum == entry->checksum) {
> +			entry->refcount++;
> +			atomic64_add(entry->len, &zram->stats.dup_data_size);
> +			spin_unlock(&hash->lock);
> +
> +			if (zram_dedup_match(zram, entry, mem))
> +				return entry;
> +
> +			zram_entry_free(zram, meta, entry);
> +
> +			return NULL;
> +		}
> +
> +		if (checksum < entry->checksum)
> +			rb_node = rb_node->rb_left;
> +		else
> +			rb_node = rb_node->rb_right;
> +	}
> +	spin_unlock(&hash->lock);
> +
> +	return NULL;
> +}
> +
> +struct zram_entry *zram_dedup_find(struct zram *zram, unsigned char *mem,
> +				u32 *checksum)
> +{
> +	*checksum = zram_dedup_checksum(mem);
> +
> +	return zram_dedup_get(zram, mem, *checksum);
> +}
> +
> +struct zram_entry *zram_dedup_alloc(struct zram *zram,
> +				unsigned long handle, unsigned int len,
> +				gfp_t flags)
> +{
> +	struct zram_entry *entry;
> +
> +	entry = kzalloc(sizeof(*entry),
> +			flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE));

zram_entry_alloc can mask off GFP_HIGHMEM|__GFP_MOVABLE and
it can pass filtered gfp flags to zs_malloc and zram_dedup_alloc.

> +	if (!entry)
> +		return NULL;
> +
> +	entry->handle = handle;
> +	RB_CLEAR_NODE(&entry->rb_node);

Ditto. I think we don't need to clear rb_node.

> +	entry->refcount = 1;
> +	entry->len = len;
> +
> +	atomic64_add(sizeof(*entry), &zram->stats.meta_data_size);
> +
> +	return entry;
> +}
> +
> +unsigned long zram_dedup_free(struct zram *zram, struct zram_meta *meta,
> +			struct zram_entry *entry)
> +{
> +	unsigned long handle;
> +
> +	if (zram_dedup_put(zram, meta, entry))
> +		return 0;
> +
> +	handle = entry->handle;
> +	kfree(entry);
> +
> +	/* !zram happens when reset/fail and updating stat is useless */

With recent version, it's not true so your code would be more clean. :)

> +	if (zram)
> +		atomic64_sub(sizeof(*entry), &zram->stats.meta_data_size);
> +
> +	return handle;
> +}
> +
> +int zram_dedup_init(struct zram_meta *meta, size_t num_pages)
> +{
> +	int i;
> +	struct zram_hash *hash;
> +
> +	meta->hash_size = num_pages >> ZRAM_HASH_SHIFT;
> +	meta->hash_size = min_t(size_t, ZRAM_HASH_SIZE_MAX, meta->hash_size);
> +	meta->hash_size = max_t(size_t, ZRAM_HASH_SIZE_MIN, meta->hash_size);
> +	meta->hash = vzalloc(meta->hash_size * sizeof(struct zram_hash));
> +	if (!meta->hash) {
> +		pr_err("Error allocating zram entry hash\n");
> +		return -ENOMEM;
> +	}
> +
> +	for (i = 0; i < meta->hash_size; i++) {
> +		hash = &meta->hash[i];
> +		spin_lock_init(&hash->lock);
> +		hash->rb_root = RB_ROOT;
> +	}
> +
> +	return 0;
> +}
> +
> +void zram_dedup_fini(struct zram_meta *meta)
> +{
> +	vfree(meta->hash);
> +}
> diff --git a/drivers/block/zram/zram_dedup.h b/drivers/block/zram/zram_dedup.h
> new file mode 100644
> index 0000000..7071f32
> --- /dev/null
> +++ b/drivers/block/zram/zram_dedup.h
> @@ -0,0 +1,25 @@
> +#ifndef _ZRAM_DEDUP_H_
> +#define _ZRAM_DEDUP_H_
> +
> +struct zram;
> +struct zram_meta;
> +struct zram_entry;
> +
> +u64 zram_dedup_dup_size(struct zram *zram);
> +u64 zram_dedup_meta_size(struct zram *zram);
> +
> +void zram_dedup_insert(struct zram *zram, struct zram_entry *new,
> +				u32 checksum);
> +struct zram_entry *zram_dedup_find(struct zram *zram, unsigned char *mem,
> +				u32 *checksum);
> +
> +struct zram_entry *zram_dedup_alloc(struct zram *zram,
> +			unsigned long handle, unsigned int len,
> +			gfp_t flags);
> +unsigned long zram_dedup_free(struct zram *zram, struct zram_meta *meta,
> +				struct zram_entry *entry);
> +
> +int zram_dedup_init(struct zram_meta *meta, size_t num_pages);
> +void zram_dedup_fini(struct zram_meta *meta);
> +
> +#endif /* _ZRAM_DEDUP_H_ */
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index f3949da..15cecd6 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -385,14 +385,16 @@ static ssize_t mm_stat_show(struct device *dev,
>  	max_used = atomic_long_read(&zram->stats.max_used_pages);
>  
>  	ret = scnprintf(buf, PAGE_SIZE,
> -			"%8llu %8llu %8llu %8lu %8ld %8llu %8lu\n",
> +			"%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu %8llu\n",
>  			orig_size << PAGE_SHIFT,
>  			(u64)atomic64_read(&zram->stats.compr_data_size),
>  			mem_used << PAGE_SHIFT,
>  			zram->limit_pages << PAGE_SHIFT,
>  			max_used << PAGE_SHIFT,
>  			(u64)atomic64_read(&zram->stats.same_pages),
> -			pool_stats.pages_compacted);
> +			pool_stats.pages_compacted,
> +			zram_dedup_dup_size(zram),
> +			zram_dedup_meta_size(zram));
>  	up_read(&zram->init_lock);
>  
>  	return ret;
> @@ -424,25 +426,29 @@ static struct zram_entry *zram_entry_alloc(struct zram *zram,
>  {
>  	struct zram_meta *meta = zram->meta;
>  	struct zram_entry *entry;
> +	unsigned long handle;
>  
> -	entry = kzalloc(sizeof(*entry), flags);
> -	if (!entry)
> +	handle = zs_malloc(meta->mem_pool, len, flags);
> +	if (!handle)
>  		return NULL;
>  
> -	entry->handle = zs_malloc(meta->mem_pool, len, flags);
> -	if (!entry->handle) {
> -		kfree(entry);
> +	entry = zram_dedup_alloc(zram, handle, len, flags);

I don't like to hide zram_entry allocation behind zram_dedup_alloc.
It's never wrong because zram_table allocation happens on only
with dedup feature but zram_table is owned by zram_drv so I want
for zram_drv to control zram_table alloc/free.

How about this?

zram_entry_alloc
        unsigned long handle;
        struct zram_entry *entry;

        handle = zs_malloc;
        if (zram_dedup_enabled())
                return handle;

        entry = kzalloc();
        zram_dedup_init(entry, handle, len);

        return entry;


> +	if (!entry) {
> +		zs_free(meta->mem_pool, handle);
>  		return NULL;
>  	}
>  
>  	return entry;
>  }
>  
> -static inline void zram_entry_free(struct zram_meta *meta,
> -			struct zram_entry *entry)
> +void zram_entry_free(struct zram *zram, struct zram_meta *meta,
> +				struct zram_entry *entry)
>  {
> -	zs_free(meta->mem_pool, entry->handle);
> -	kfree(entry);
> +	unsigned long handle;
> +
> +	handle = zram_dedup_free(zram, meta, entry);

Ditto.
        unsigned long handle = entry->handle;

        if (zram_dedup_put(zram, meta, entry))
                kfree(entry);

        zs_free(meta->mem_pool, handle);

With this, In [4/4], I want that zram_entry_handle
doesn't rely on zram_dedup_handle.

unsigned long zram_entry_handle(strucdt zram *zram, strucdt zram_entry *entry)
{
        unsigned long handle;

        if (!zram_dedup_enabled(zram->meta))
                handle = (unsigned long)entry;
        else
                handle = entry->handle;
        return handle;
}

> +	if (handle)
> +		zs_free(meta->mem_pool, handle);
>  }