linux-kernel - Re: [PATCHv5 1/8] zsmalloc: add to mm/

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5123BC4D.1010404@linux.vnet.ibm.com>
Date:	Tue, 19 Feb 2013 11:54:21 -0600
From:	Seth Jennings <sjenning@...ux.vnet.ibm.com>
To:	Joonsoo Kim <iamjoonsoo.kim@....com>,
	Nitin Gupta <ngupta@...are.org>
CC:	Andrew Morton <akpm@...ux-foundation.org>,
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Minchan Kim <minchan@...nel.org>,
	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
	Dan Magenheimer <dan.magenheimer@...cle.com>,
	Robert Jennings <rcj@...ux.vnet.ibm.com>,
	Jenifer Hopper <jhopper@...ibm.com>,
	Mel Gorman <mgorman@...e.de>,
	Johannes Weiner <jweiner@...hat.com>,
	Rik van Riel <riel@...hat.com>,
	Larry Woodman <lwoodman@...hat.com>,
	Benjamin Herrenschmidt <benh@...nel.crashing.org>,
	Dave Hansen <dave@...ux.vnet.ibm.com>,
	Joe Perches <joe@...ches.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, devel@...verdev.osuosl.org
Subject: Re: [PATCHv5 1/8] zsmalloc: add to mm/

On 02/19/2013 03:18 AM, Joonsoo Kim wrote:
> Hello, Seth.
> I'm not sure that this is right time to review, because I already have
> seen many effort of various people to promote zxxx series. I don't want to
> be a stopper to promote these. :)

Any time is good review time :)  Thanks for your review!

> 
> But, I read the code, now, and then some comments below.
> 
> On Wed, Feb 13, 2013 at 12:38:44PM -0600, Seth Jennings wrote:
>> =========
>> DO NOT MERGE, FOR REVIEW ONLY
>> This patch introduces zsmalloc as new code, however, it already
>> exists in drivers/staging.  In order to build successfully, you
>> must select EITHER to driver/staging version OR this version.
>> Once zsmalloc is reviewed in this format (and hopefully accepted),
>> I will create a new patchset that properly promotes zsmalloc from
>> staging.
>> =========
>>
>> This patchset introduces a new slab-based memory allocator,
>> zsmalloc, for storing compressed pages.  It is designed for
>> low fragmentation and high allocation success rate on
>> large object, but <= PAGE_SIZE allocations.
>>
>> zsmalloc differs from the kernel slab allocator in two primary
>> ways to achieve these design goals.
>>
>> zsmalloc never requires high order page allocations to back
>> slabs, or "size classes" in zsmalloc terms. Instead it allows
>> multiple single-order pages to be stitched together into a
>> "zspage" which backs the slab.  This allows for higher allocation
>> success rate under memory pressure.
>>
>> Also, zsmalloc allows objects to span page boundaries within the
>> zspage.  This allows for lower fragmentation than could be had
>> with the kernel slab allocator for objects between PAGE_SIZE/2
>> and PAGE_SIZE.  With the kernel slab allocator, if a page compresses
>> to 60% of it original size, the memory savings gained through
>> compression is lost in fragmentation because another object of
>> the same size can't be stored in the leftover space.
>>
>> This ability to span pages results in zsmalloc allocations not being
>> directly addressable by the user.  The user is given an
>> non-dereferencable handle in response to an allocation request.
>> That handle must be mapped, using zs_map_object(), which returns
>> a pointer to the mapped region that can be used.  The mapping is
>> necessary since the object data may reside in two different
>> noncontigious pages.
>>
>> zsmalloc fulfills the allocation needs for zram and zswap.
>>
>> Acked-by: Nitin Gupta <ngupta@...are.org>
>> Acked-by: Minchan Kim <minchan@...nel.org>
>> Signed-off-by: Seth Jennings <sjenning@...ux.vnet.ibm.com>
>> ---
>>  include/linux/zsmalloc.h |   49 ++
>>  mm/Kconfig               |   24 +
>>  mm/Makefile              |    1 +
>>  mm/zsmalloc.c            | 1124 ++++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 1198 insertions(+)
>>  create mode 100644 include/linux/zsmalloc.h
>>  create mode 100644 mm/zsmalloc.c
>>
>> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
>> new file mode 100644
>> index 0000000..eb6efb6
>> --- /dev/null
>> +++ b/include/linux/zsmalloc.h
>> @@ -0,0 +1,49 @@
>> +/*
>> + * zsmalloc memory allocator
>> + *
>> + * Copyright (C) 2011  Nitin Gupta
>> + *
>> + * This code is released using a dual license strategy: BSD/GPL
>> + * You can choose the license that better fits your requirements.
>> + *
>> + * Released under the terms of 3-clause BSD License
>> + * Released under the terms of GNU General Public License Version 2.0
>> + */
>> +
>> +#ifndef _ZS_MALLOC_H_
>> +#define _ZS_MALLOC_H_
>> +
>> +#include <linux/types.h>
>> +#include <linux/mm_types.h>
>> +
>> +/*
>> + * zsmalloc mapping modes
>> + *
>> + * NOTE: These only make a difference when a mapped object spans pages
>> +*/
>> +enum zs_mapmode {
>> +	ZS_MM_RW, /* normal read-write mapping */
>> +	ZS_MM_RO, /* read-only (no copy-out at unmap time) */
>> +	ZS_MM_WO /* write-only (no copy-in at map time) */
>> +};
> 
> 
> These makes no difference for PGTABLE_MAPPING.
> Please add some comment for this.

Yes. Will do.

> 
>> +struct zs_ops {
>> +	struct page * (*alloc)(gfp_t);
>> +	void (*free)(struct page *);
>> +};
>> +
>> +struct zs_pool;
>> +
>> +struct zs_pool *zs_create_pool(gfp_t flags, struct zs_ops *ops);
>> +void zs_destroy_pool(struct zs_pool *pool);
>> +
>> +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags);
>> +void zs_free(struct zs_pool *pool, unsigned long obj);
>> +
>> +void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>> +			enum zs_mapmode mm);
>> +void zs_unmap_object(struct zs_pool *pool, unsigned long handle);
>> +
>> +u64 zs_get_total_size_bytes(struct zs_pool *pool);
>> +
>> +#endif
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 278e3ab..25b8f38 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -446,3 +446,27 @@ config FRONTSWAP
>>  	  and swap data is stored as normal on the matching swap device.
>>  
>>  	  If unsure, say Y to enable frontswap.
>> +
>> +config ZSMALLOC
>> +	tristate "Memory allocator for compressed pages"
>> +	default n
>> +	help
>> +	  zsmalloc is a slab-based memory allocator designed to store
>> +	  compressed RAM pages.  zsmalloc uses virtual memory mapping
>> +	  in order to reduce fragmentation.  However, this results in a
>> +	  non-standard allocator interface where a handle, not a pointer, is
>> +	  returned by an alloc().  This handle must be mapped in order to
>> +	  access the allocated space.
>> +
>> +config PGTABLE_MAPPING
>> +	bool "Use page table mapping to access object in zsmalloc"
>> +	depends on ZSMALLOC
>> +	help
>> +	  By default, zsmalloc uses a copy-based object mapping method to
>> +	  access allocations that span two pages. However, if a particular
>> +	  architecture (ex, ARM) performs VM mapping faster than copying,
>> +	  then you should select this. This causes zsmalloc to use page table
>> +	  mapping rather than copying for object mapping.
>> +
>> +	  You can check speed with zsmalloc benchmark[1].
>> +	  [1] https://github.com/spartacus06/zsmalloc
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 3a46287..0f6ef0a 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -58,3 +58,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>>  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
>>  obj-$(CONFIG_CLEANCACHE) += cleancache.o
>>  obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
>> +obj-$(CONFIG_ZSMALLOC)	+= zsmalloc.o
>> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
>> new file mode 100644
>> index 0000000..34378ef
>> --- /dev/null
>> +++ b/mm/zsmalloc.c
>> @@ -0,0 +1,1124 @@
>> +/*
>> + * zsmalloc memory allocator
>> + *
>> + * Copyright (C) 2011  Nitin Gupta
>> + *
>> + * This code is released using a dual license strategy: BSD/GPL
>> + * You can choose the license that better fits your requirements.
>> + *
>> + * Released under the terms of 3-clause BSD License
>> + * Released under the terms of GNU General Public License Version 2.0
>> + */
>> +
>> +
>> +/*
>> + * This allocator is designed for use with zcache and zram. Thus, the
>> + * allocator is supposed to work well under low memory conditions. In
>> + * particular, it never attempts higher order page allocation which is
>> + * very likely to fail under memory pressure. On the other hand, if we
>> + * just use single (0-order) pages, it would suffer from very high
>> + * fragmentation -- any object of size PAGE_SIZE/2 or larger would occupy
>> + * an entire page. This was one of the major issues with its predecessor
>> + * (xvmalloc).
>> + *
>> + * To overcome these issues, zsmalloc allocates a bunch of 0-order pages
>> + * and links them together using various 'struct page' fields. These linked
>> + * pages act as a single higher-order page i.e. an object can span 0-order
>> + * page boundaries. The code refers to these linked pages as a single entity
>> + * called zspage.
>> + *
>> + * For simplicity, zsmalloc can only allocate objects of size up to PAGE_SIZE
>> + * since this satisfies the requirements of all its current users (in the
>> + * worst case, page is incompressible and is thus stored "as-is" i.e. in
>> + * uncompressed form). For allocation requests larger than this size, failure
>> + * is returned (see zs_malloc).
>> + *
>> + * Additionally, zs_malloc() does not return a dereferenceable pointer.
>> + * Instead, it returns an opaque handle (unsigned long) which encodes actual
>> + * location of the allocated object. The reason for this indirection is that
>> + * zsmalloc does not keep zspages permanently mapped since that would cause
>> + * issues on 32-bit systems where the VA region for kernel space mappings
>> + * is very small. So, before using the allocating memory, the object has to
>> + * be mapped using zs_map_object() to get a usable pointer and subsequently
>> + * unmapped using zs_unmap_object().
>> + *
>> + * Following is how we use various fields and flags of underlying
>> + * struct page(s) to form a zspage.
>> + *
>> + * Usage of struct page fields:
>> + *	page->first_page: points to the first component (0-order) page
>> + *	page->index (union with page->freelist): offset of the first object
>> + *		starting in this page. For the first page, this is
>> + *		always 0, so we use this field (aka freelist) to point
>> + *		to the first free object in zspage.
>> + *	page->lru: links together all component pages (except the first page)
>> + *		of a zspage
>> + *
>> + *	For _first_ page only:
>> + *
>> + *	page->private (union with page->first_page): refers to the
>> + *		component page after the first page
>> + *	page->freelist: points to the first free object in zspage.
>> + *		Free objects are linked together using in-place
>> + *		metadata.
>> + *	page->objects: maximum number of objects we can store in this
>> + *		zspage (class->zspage_order * PAGE_SIZE / class->size)
> 
> How about just embedding maximum number of objects to size_class?
> For the SLUB, each slab can have difference number of objects.
> But, for the zsmalloc, it is not possible, so there is no reason
> to maintain it within metadata of zspage. Just to embed it to size_class
> is sufficient.

Yes, a little code massaging and this can go away.

However, there might be some value in having variable sized zspages in
the same size_class.  It could improve allocation success rate at the
expense of efficiency by not failing in alloc_zspage() if we can't
allocate the optimal number of pages.  As long as we can allocate the
first page, then we can proceed.

Nitin care to weigh in?

> 
> 
>> + *	page->lru: links together first pages of various zspages.
>> + *		Basically forming list of zspages in a fullness group.
>> + *	page->mapping: class index and fullness group of the zspage
>> + *
>> + * Usage of struct page flags:
>> + *	PG_private: identifies the first component page
>> + *	PG_private2: identifies the last component page
>> + *
>> + */
>> +
>> +#ifdef CONFIG_ZSMALLOC_DEBUG
>> +#define DEBUG
>> +#endif
> 
> Is this obsolete?

Yes, I'll remove it.

> 
>> +#include <linux/module.h>
>> +#include <linux/kernel.h>
>> +#include <linux/bitops.h>
>> +#include <linux/errno.h>
>> +#include <linux/highmem.h>
>> +#include <linux/init.h>
>> +#include <linux/string.h>
>> +#include <linux/slab.h>
>> +#include <asm/tlbflush.h>
>> +#include <asm/pgtable.h>
>> +#include <linux/cpumask.h>
>> +#include <linux/cpu.h>
>> +#include <linux/vmalloc.h>
>> +#include <linux/hardirq.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/types.h>
>> +
>> +#include <linux/zsmalloc.h>
>> +
>> +/*
>> + * This must be power of 2 and greater than of equal to sizeof(link_free).
>> + * These two conditions ensure that any 'struct link_free' itself doesn't
>> + * span more than 1 page which avoids complex case of mapping 2 pages simply
>> + * to restore link_free pointer values.
>> + */
>> +#define ZS_ALIGN		8
>> +
>> +/*
>> + * A single 'zspage' is composed of up to 2^N discontiguous 0-order (single)
>> + * pages. ZS_MAX_ZSPAGE_ORDER defines upper limit on N.
>> + */
>> +#define ZS_MAX_ZSPAGE_ORDER 2
>> +#define ZS_MAX_PAGES_PER_ZSPAGE (_AC(1, UL) << ZS_MAX_ZSPAGE_ORDER)
>> +
>> +/*
>> + * Object location (<PFN>, <obj_idx>) is encoded as
>> + * as single (unsigned long) handle value.
>> + *
>> + * Note that object index <obj_idx> is relative to system
>> + * page <PFN> it is stored in, so for each sub-page belonging
>> + * to a zspage, obj_idx starts with 0.
>> + *
>> + * This is made more complicated by various memory models and PAE.
>> + */
>> +
>> +#ifndef MAX_PHYSMEM_BITS
>> +#ifdef CONFIG_HIGHMEM64G
>> +#define MAX_PHYSMEM_BITS 36
>> +#else /* !CONFIG_HIGHMEM64G */
>> +/*
>> + * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just
>> + * be PAGE_SHIFT
>> + */
>> +#define MAX_PHYSMEM_BITS BITS_PER_LONG
>> +#endif
>> +#endif
>> +#define _PFN_BITS		(MAX_PHYSMEM_BITS - PAGE_SHIFT)
>> +#define OBJ_INDEX_BITS	(BITS_PER_LONG - _PFN_BITS)
>> +#define OBJ_INDEX_MASK	((_AC(1, UL) << OBJ_INDEX_BITS) - 1)
>> +
>> +#define MAX(a, b) ((a) >= (b) ? (a) : (b))
>> +/* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */
>> +#define ZS_MIN_ALLOC_SIZE \
>> +	MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
>> +#define ZS_MAX_ALLOC_SIZE	PAGE_SIZE
>> +
>> +/*
>> + * On systems with 4K page size, this gives 254 size classes! There is a
>> + * trader-off here:
>> + *  - Large number of size classes is potentially wasteful as free page are
>> + *    spread across these classes
>> + *  - Small number of size classes causes large internal fragmentation
>> + *  - Probably its better to use specific size classes (empirically
>> + *    determined). NOTE: all those class sizes must be set as multiple of
>> + *    ZS_ALIGN to make sure link_free itself never has to span 2 pages.
>> + *
>> + *  ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
>> + *  (reason above)
>> + */
>> +#define ZS_SIZE_CLASS_DELTA	(PAGE_SIZE >> 8)
>> +#define ZS_SIZE_CLASSES		((ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) / \
>> +					ZS_SIZE_CLASS_DELTA + 1)
>> +
>> +/*
>> + * We do not maintain any list for completely empty or full pages
>> + */
>> +enum fullness_group {
>> +	ZS_ALMOST_FULL,
>> +	ZS_ALMOST_EMPTY,
>> +	_ZS_NR_FULLNESS_GROUPS,
>> +
>> +	ZS_EMPTY,
>> +	ZS_FULL
>> +};
>> +
>> +/*
>> + * We assign a page to ZS_ALMOST_EMPTY fullness group when:
>> + *	n <= N / f, where
>> + * n = number of allocated objects
>> + * N = total number of objects zspage can store
>> + * f = 1/fullness_threshold_frac
>> + *
>> + * Similarly, we assign zspage to:
>> + *	ZS_ALMOST_FULL	when n > N / f
>> + *	ZS_EMPTY	when n == 0
>> + *	ZS_FULL		when n == N
>> + *
>> + * (see: fix_fullness_group())
>> + */
>> +static const int fullness_threshold_frac = 4;
>> +
>> +struct size_class {
>> +	/*
>> +	 * Size of objects stored in this class. Must be multiple
>> +	 * of ZS_ALIGN.
>> +	 */
>> +	int size;
>> +	unsigned int index;
>> +
>> +	/* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */
>> +	int pages_per_zspage;
>> +
>> +	spinlock_t lock;
>> +
>> +	/* stats */
>> +	u64 pages_allocated;
>> +
>> +	struct page *fullness_list[_ZS_NR_FULLNESS_GROUPS];
>> +};
> 
> Instead of simple pointer, how about using list_head?
> With this, fullness_list management is easily consolidated to
> set_zspage_mapping() and we can remove remove_zspage(), insert_zspage().

Makes sense to me.  Nitin what do you think?

> And how about maintaining FULL, EMPTY list?
> There is not much memory waste and it can be used for debugging and
> implementing other functionality.

The EMPTY list would always be empty.  There might be some merit to
maintaining a FULL list though just so zsmalloc always has a handle on
every zspage.

> 
>> +
>> +/*
>> + * Placed within free objects to form a singly linked list.
>> + * For every zspage, first_page->freelist gives head of this list.
>> + *
>> + * This must be power of 2 and less than or equal to ZS_ALIGN
>> + */
>> +struct link_free {
>> +	/* Handle of next free chunk (encodes <PFN, obj_idx>) */
>> +	void *next;
>> +};
>> +
>> +struct zs_pool {
>> +	struct size_class size_class[ZS_SIZE_CLASSES];
>> +
>> +	struct zs_ops *ops;
>> +};
>> +
>> +/*
>> + * A zspage's class index and fullness group
>> + * are encoded in its (first)page->mapping
>> + */
>> +#define CLASS_IDX_BITS	28
>> +#define FULLNESS_BITS	4
>> +#define CLASS_IDX_MASK	((1 << CLASS_IDX_BITS) - 1)
>> +#define FULLNESS_MASK	((1 << FULLNESS_BITS) - 1)
>> +
>> +struct mapping_area {
>> +#ifdef CONFIG_PGTABLE_MAPPING
>> +	struct vm_struct *vm; /* vm area for mapping object that span pages */
>> +#else
>> +	char *vm_buf; /* copy buffer for objects that span pages */
>> +#endif
>> +	char *vm_addr; /* address of kmap_atomic()'ed pages */
>> +	enum zs_mapmode vm_mm; /* mapping mode */
>> +};
>> +
>> +/* default page alloc/free ops */
>> +struct page *zs_alloc_page(gfp_t flags)
>> +{
>> +	return alloc_page(flags);
>> +}
>> +
>> +void zs_free_page(struct page *page)
>> +{
>> +	__free_page(page);
>> +}
>> +
>> +struct zs_ops zs_default_ops = {
>> +	.alloc = zs_alloc_page,
>> +	.free = zs_free_page
>> +};
>> +
>> +/* per-cpu VM mapping areas for zspage accesses that cross page boundaries */
>> +static DEFINE_PER_CPU(struct mapping_area, zs_map_area);
>> +
>> +static int is_first_page(struct page *page)
>> +{
>> +	return PagePrivate(page);
>> +}
>> +
>> +static int is_last_page(struct page *page)
>> +{
>> +	return PagePrivate2(page);
>> +}
>> +
>> +static void get_zspage_mapping(struct page *page, unsigned int *class_idx,
>> +				enum fullness_group *fullness)
>> +{
>> +	unsigned long m;
>> +	BUG_ON(!is_first_page(page));
>> +
>> +	m = (unsigned long)page->mapping;
>> +	*fullness = m & FULLNESS_MASK;
>> +	*class_idx = (m >> FULLNESS_BITS) & CLASS_IDX_MASK;
>> +}
>> +
>> +static void set_zspage_mapping(struct page *page, unsigned int class_idx,
>> +				enum fullness_group fullness)
>> +{
>> +	unsigned long m;
>> +	BUG_ON(!is_first_page(page));
>> +
>> +	m = ((class_idx & CLASS_IDX_MASK) << FULLNESS_BITS) |
>> +			(fullness & FULLNESS_MASK);
>> +	page->mapping = (struct address_space *)m;
>> +}
>> +
>> +/*
>> + * zsmalloc divides the pool into various size classes where each
>> + * class maintains a list of zspages where each zspage is divided
>> + * into equal sized chunks. Each allocation falls into one of these
>> + * classes depending on its size. This function returns index of the
>> + * size class which has chunk size big enough to hold the give size.
>> + */
>> +static int get_size_class_index(int size)
>> +{
>> +	int idx = 0;
>> +
>> +	if (likely(size > ZS_MIN_ALLOC_SIZE))
>> +		idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
>> +				ZS_SIZE_CLASS_DELTA);
>> +
>> +	return idx;
>> +}
>> +
>> +/*
>> + * For each size class, zspages are divided into different groups
>> + * depending on how "full" they are. This was done so that we could
>> + * easily find empty or nearly empty zspages when we try to shrink
>> + * the pool (not yet implemented). This function returns fullness
>> + * status of the given page.
>> + */
>> +static enum fullness_group get_fullness_group(struct page *page)
>> +{
>> +	int inuse, max_objects;
>> +	enum fullness_group fg;
>> +	BUG_ON(!is_first_page(page));
>> +
>> +	inuse = page->inuse;
>> +	max_objects = page->objects;
>> +
>> +	if (inuse == 0)
>> +		fg = ZS_EMPTY;
>> +	else if (inuse == max_objects)
>> +		fg = ZS_FULL;
>> +	else if (inuse <= max_objects / fullness_threshold_frac)
>> +		fg = ZS_ALMOST_EMPTY;
>> +	else
>> +		fg = ZS_ALMOST_FULL;
>> +
>> +	return fg;
>> +}
>> +
>> +/*
>> + * Each size class maintains various freelists and zspages are assigned
>> + * to one of these freelists based on the number of live objects they
>> + * have. This functions inserts the given zspage into the freelist
>> + * identified by <class, fullness_group>.
>> + */
>> +static void insert_zspage(struct page *page, struct size_class *class,
>> +				enum fullness_group fullness)
>> +{
>> +	struct page **head;
>> +
>> +	BUG_ON(!is_first_page(page));
>> +
>> +	if (fullness >= _ZS_NR_FULLNESS_GROUPS)
>> +		return;
>> +
>> +	head = &class->fullness_list[fullness];
>> +	if (*head)
>> +		list_add_tail(&page->lru, &(*head)->lru);
>> +
>> +	*head = page;
>> +}
>> +
>> +/*
>> + * This function removes the given zspage from the freelist identified
>> + * by <class, fullness_group>.
>> + */
>> +static void remove_zspage(struct page *page, struct size_class *class,
>> +				enum fullness_group fullness)
>> +{
>> +	struct page **head;
>> +
>> +	BUG_ON(!is_first_page(page));
>> +
>> +	if (fullness >= _ZS_NR_FULLNESS_GROUPS)
>> +		return;
>> +
>> +	head = &class->fullness_list[fullness];
>> +	BUG_ON(!*head);
>> +	if (list_empty(&(*head)->lru))
>> +		*head = NULL;
>> +	else if (*head == page)
>> +		*head = (struct page *)list_entry((*head)->lru.next,
>> +					struct page, lru);
>> +
>> +	list_del_init(&page->lru);
>> +}
>> +
>> +/*
>> + * Each size class maintains zspages in different fullness groups depending
>> + * on the number of live objects they contain. When allocating or freeing
>> + * objects, the fullness status of the page can change, say, from ALMOST_FULL
>> + * to ALMOST_EMPTY when freeing an object. This function checks if such
>> + * a status change has occurred for the given page and accordingly moves the
>> + * page from the freelist of the old fullness group to that of the new
>> + * fullness group.
>> + */
>> +static enum fullness_group fix_fullness_group(struct zs_pool *pool,
>> +						struct page *page)
>> +{
>> +	int class_idx;
>> +	struct size_class *class;
>> +	enum fullness_group currfg, newfg;
>> +
>> +	BUG_ON(!is_first_page(page));
>> +
>> +	get_zspage_mapping(page, &class_idx, &currfg);
>> +	newfg = get_fullness_group(page);
>> +	if (newfg == currfg)
>> +		goto out;
>> +
>> +	class = &pool->size_class[class_idx];
>> +	remove_zspage(page, class, currfg);
>> +	insert_zspage(page, class, newfg);
>> +	set_zspage_mapping(page, class_idx, newfg);
>> +
>> +out:
>> +	return newfg;
>> +}
>> +
>> +/*
>> + * We have to decide on how many pages to link together
>> + * to form a zspage for each size class. This is important
>> + * to reduce wastage due to unusable space left at end of
>> + * each zspage which is given as:
>> + *	wastage = Zp - Zp % size_class
>> + * where Zp = zspage size = k * PAGE_SIZE where k = 1, 2, ...
>> + *
>> + * For example, for size class of 3/8 * PAGE_SIZE, we should
>> + * link together 3 PAGE_SIZE sized pages to form a zspage
>> + * since then we can perfectly fit in 8 such objects.
>> + */
>> +static int get_pages_per_zspage(int class_size)
>> +{
>> +	int i, max_usedpc = 0;
>> +	/* zspage order which gives maximum used size per KB */
>> +	int max_usedpc_order = 1;
>> +
>> +	for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) {
>> +		int zspage_size;
>> +		int waste, usedpc;
>> +
>> +		zspage_size = i * PAGE_SIZE;
>> +		waste = zspage_size % class_size;
>> +		usedpc = (zspage_size - waste) * 100 / zspage_size;
>> +
>> +		if (usedpc > max_usedpc) {
>> +			max_usedpc = usedpc;
>> +			max_usedpc_order = i;
>> +		}
>> +	}
>> +
>> +	return max_usedpc_order;
>> +}
>> +
>> +/*
>> + * A single 'zspage' is composed of many system pages which are
>> + * linked together using fields in struct page. This function finds
>> + * the first/head page, given any component page of a zspage.
>> + */
>> +static struct page *get_first_page(struct page *page)
>> +{
>> +	if (is_first_page(page))
>> +		return page;
>> +	else
>> +		return page->first_page;
>> +}
>> +
>> +static struct page *get_next_page(struct page *page)
>> +{
>> +	struct page *next;
>> +
>> +	if (is_last_page(page))
>> +		next = NULL;
>> +	else if (is_first_page(page))
>> +		next = (struct page *)page->private;
>> +	else
>> +		next = list_entry(page->lru.next, struct page, lru);
>> +
>> +	return next;
>> +}
>> +
>> +/* Encode <page, obj_idx> as a single handle value */
>> +static void *obj_location_to_handle(struct page *page, unsigned long obj_idx)
>> +{
>> +	unsigned long handle;
>> +
>> +	if (!page) {
>> +		BUG_ON(obj_idx);
>> +		return NULL;
>> +	}
>> +
>> +	handle = page_to_pfn(page) << OBJ_INDEX_BITS;
>> +	handle |= (obj_idx & OBJ_INDEX_MASK);
>> +
>> +	return (void *)handle;
>> +}
>> +
>> +/* Decode <page, obj_idx> pair from the given object handle */
>> +static void obj_handle_to_location(unsigned long handle, struct page **page,
>> +				unsigned long *obj_idx)
>> +{
>> +	*page = pfn_to_page(handle >> OBJ_INDEX_BITS);
>> +	*obj_idx = handle & OBJ_INDEX_MASK;
>> +}
>> +
>> +static unsigned long obj_idx_to_offset(struct page *page,
>> +				unsigned long obj_idx, int class_size)
>> +{
>> +	unsigned long off = 0;
>> +
>> +	if (!is_first_page(page))
>> +		off = page->index;
>> +
>> +	return off + obj_idx * class_size;
>> +}
>> +
>> +static void reset_page(struct page *page)
>> +{
>> +	clear_bit(PG_private, &page->flags);
>> +	clear_bit(PG_private_2, &page->flags);
>> +	set_page_private(page, 0);
>> +	page->mapping = NULL;
>> +	page->freelist = NULL;
>> +	reset_page_mapcount(page);
>> +}
>> +
>> +static void free_zspage(struct zs_ops *ops, struct page *first_page)
>> +{
>> +	struct page *nextp, *tmp, *head_extra;
>> +
>> +	BUG_ON(!is_first_page(first_page));
>> +	BUG_ON(first_page->inuse);
>> +
>> +	head_extra = (struct page *)page_private(first_page);
>> +
>> +	reset_page(first_page);
>> +	ops->free(first_page);
>> +
>> +	/* zspage with only 1 system page */
>> +	if (!head_extra)
>> +		return;
>> +
>> +	list_for_each_entry_safe(nextp, tmp, &head_extra->lru, lru) {
>> +		list_del(&nextp->lru);
>> +		reset_page(nextp);
>> +		ops->free(nextp);
>> +	}
>> +	reset_page(head_extra);
>> +	ops->free(head_extra);
>> +}
>> +
>> +/* Initialize a newly allocated zspage */
>> +static void init_zspage(struct page *first_page, struct size_class *class)
>> +{
>> +	unsigned long off = 0;
>> +	struct page *page = first_page;
>> +
>> +	BUG_ON(!is_first_page(first_page));
>> +	while (page) {
>> +		struct page *next_page;
>> +		struct link_free *link;
>> +		unsigned int i, objs_on_page;
>> +
>> +		/*
>> +		 * page->index stores offset of first object starting
>> +		 * in the page. For the first page, this is always 0,
>> +		 * so we use first_page->index (aka ->freelist) to store
>> +		 * head of corresponding zspage's freelist.
>> +		 */
>> +		if (page != first_page)
>> +			page->index = off;
>> +
>> +		link = (struct link_free *)kmap_atomic(page) +
>> +						off / sizeof(*link);
>> +		objs_on_page = (PAGE_SIZE - off) / class->size;
>> +
>> +		for (i = 1; i <= objs_on_page; i++) {
>> +			off += class->size;
>> +			if (off < PAGE_SIZE) {
>> +				link->next = obj_location_to_handle(page, i);
>> +				link += class->size / sizeof(*link);
>> +			}
>> +		}
>> +
>> +		/*
>> +		 * We now come to the last (full or partial) object on this
>> +		 * page, which must point to the first object on the next
>> +		 * page (if present)
>> +		 */
>> +		next_page = get_next_page(page);
>> +		link->next = obj_location_to_handle(next_page, 0);
>> +		kunmap_atomic(link);
>> +		page = next_page;
>> +		off = (off + class->size) % PAGE_SIZE;
>> +	}
>> +}
>> +
>> +/*
>> + * Allocate a zspage for the given size class
>> + */
>> +static struct page *alloc_zspage(struct zs_ops *ops, struct size_class *class,
>> +				gfp_t flags)
>> +{
>> +	int i, error;
>> +	struct page *first_page = NULL, *uninitialized_var(prev_page);
>> +
>> +	/*
>> +	 * Allocate individual pages and link them together as:
>> +	 * 1. first page->private = first sub-page
>> +	 * 2. all sub-pages are linked together using page->lru
>> +	 * 3. each sub-page is linked to the first page using page->first_page
>> +	 *
>> +	 * For each size class, First/Head pages are linked together using
>> +	 * page->lru. Also, we set PG_private to identify the first page
>> +	 * (i.e. no other sub-page has this flag set) and PG_private_2 to
>> +	 * identify the last page.
>> +	 */
>> +	error = -ENOMEM;
>> +	for (i = 0; i < class->pages_per_zspage; i++) {
>> +		struct page *page;
>> +
>> +		page = ops->alloc(flags);
>> +		if (!page)
>> +			goto cleanup;
>> +
>> +		INIT_LIST_HEAD(&page->lru);
>> +		if (i == 0) {	/* first page */
>> +			SetPagePrivate(page);
>> +			set_page_private(page, 0);
>> +			first_page = page;
>> +			first_page->inuse = 0;
>> +		}
>> +		if (i == 1)
>> +			first_page->private = (unsigned long)page;
>> +		if (i >= 1)
>> +			page->first_page = first_page;
>> +		if (i >= 2)
>> +			list_add(&page->lru, &prev_page->lru);
>> +		if (i == class->pages_per_zspage - 1)	/* last page */
>> +			SetPagePrivate2(page);
>> +		prev_page = page;
>> +	}
>> +
>> +	init_zspage(first_page, class);
>> +
>> +	first_page->freelist = obj_location_to_handle(first_page, 0);
>> +	/* Maximum number of objects we can store in this zspage */
>> +	first_page->objects = class->pages_per_zspage * PAGE_SIZE / class->size;
>> +
>> +	error = 0; /* Success */
>> +
>> +cleanup:
>> +	if (unlikely(error) && first_page) {
>> +		free_zspage(ops, first_page);
>> +		first_page = NULL;
>> +	}
>> +
>> +	return first_page;
>> +}
>> +
>> +static struct page *find_get_zspage(struct size_class *class)
>> +{
>> +	int i;
>> +	struct page *page;
>> +
>> +	for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) {
>> +		page = class->fullness_list[i];
>> +		if (page)
>> +			break;
>> +	}
>> +
>> +	return page;
>> +}
>> +
>> +#ifdef CONFIG_PGTABLE_MAPPING
>> +static inline int __zs_cpu_up(struct mapping_area *area)
>> +{
>> +	/*
>> +	 * Make sure we don't leak memory if a cpu UP notification
>> +	 * and zs_init() race and both call zs_cpu_up() on the same cpu
>> +	 */
>> +	if (area->vm)
>> +		return 0;
>> +	area->vm = alloc_vm_area(PAGE_SIZE * 2, NULL);
>> +	if (!area->vm)
>> +		return -ENOMEM;
>> +	return 0;
>> +}
>> +
>> +static inline void __zs_cpu_down(struct mapping_area *area)
>> +{
>> +	if (area->vm)
>> +		free_vm_area(area->vm);
>> +	area->vm = NULL;
>> +}
>> +
>> +static inline void *__zs_map_object(struct mapping_area *area,
>> +				struct page *pages[2], int off, int size)
>> +{
>> +	BUG_ON(map_vm_area(area->vm, PAGE_KERNEL, &pages));
>> +	area->vm_addr = area->vm->addr;
>> +	return area->vm_addr + off;
>> +}
>> +
>> +static inline void __zs_unmap_object(struct mapping_area *area,
>> +				struct page *pages[2], int off, int size)
>> +{
>> +	unsigned long addr = (unsigned long)area->vm_addr;
>> +	unsigned long end = addr + (PAGE_SIZE * 2);
>> +
>> +	flush_cache_vunmap(addr, end);
>> +	unmap_kernel_range_noflush(addr, PAGE_SIZE * 2);
>> +	flush_tlb_kernel_range(addr, end);
>> +}
>> +
>> +#else /* CONFIG_PGTABLE_MAPPING*/
>> +
>> +static inline int __zs_cpu_up(struct mapping_area *area)
>> +{
>> +	/*
>> +	 * Make sure we don't leak memory if a cpu UP notification
>> +	 * and zs_init() race and both call zs_cpu_up() on the same cpu
>> +	 */
>> +	if (area->vm_buf)
>> +		return 0;
>> +	area->vm_buf = (char *)__get_free_page(GFP_KERNEL);
>> +	if (!area->vm_buf)
>> +		return -ENOMEM;
>> +	return 0;
>> +}
>> +
>> +static inline void __zs_cpu_down(struct mapping_area *area)
>> +{
>> +	if (area->vm_buf)
>> +		free_page((unsigned long)area->vm_buf);
>> +	area->vm_buf = NULL;
>> +}
>> +
>> +static void *__zs_map_object(struct mapping_area *area,
>> +			struct page *pages[2], int off, int size)
>> +{
>> +	int sizes[2];
>> +	void *addr;
>> +	char *buf = area->vm_buf;
>> +
>> +	/* disable page faults to match kmap_atomic() return conditions */
>> +	pagefault_disable();
>> +
>> +	/* no read fastpath */
>> +	if (area->vm_mm == ZS_MM_WO)
>> +		goto out;
> 
> Current implementation of 'ZS_MM_WO' is not safe.
> For example, think about a situation like as mapping 512 bytes and writing
> only 8 bytes. When we unmap_object, remaining area will be filled with
> dummy bytes, but not original values.

I guess I should comment on the cavet with ZS_MM_WO.  The idea is that
the user is planning to initialize the entire allocation region.  So you
wouldn't use ZS_MM_WO to do a partial write to the region.  You'd have
to use ZS_MM_RW in that case so that the existing region is maintained.

Worthy of a comment. I'd add one.

> 
> If above comments are not important for now, feel free to ignore them. :)

Great comments!  The insight they contain demonstrate that you have a
great understanding of the code, which I find encouraging (i.e. that the
code is not too complex to be understood by reviewers).

Thanks,
Seth

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/