linux-kernel - Re: [PATCH v2] mm: add zblock allocator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250407155409.GB827@cmpxchg.org>
Date: Mon, 7 Apr 2025 11:54:09 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Vitaly Wool <vitaly.wool@...sulko.se>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org,
	linux-kernel@...r.kernel.org, Nhat Pham <nphamcs@...il.com>,
	Shakeel Butt <shakeel.butt@...ux.dev>,
	Igor Belousov <igor.b@...dev.am>
Subject: Re: [PATCH v2] mm: add zblock allocator

On Fri, Apr 04, 2025 at 09:28:13PM +0200, Vitaly Wool wrote:
> @@ -0,0 +1,24 @@
> +. SPDX-License-Identifier: GPL-2.0
> +
> +======
> +zblock
> +======
> +
> +zblock is a special purpose allocator for storing compressed pages.
> +It stores integer number of compressed objects per its block. These
> +blocks consist of several physical pages (2**n, i. e. 1/2/4/8).
> +
> +With zblock, it is possible to densely arrange objects of various sizes
> +resulting in low internal fragmentation. Also this allocator tries to
> +fill incomplete blocks instead of adding new ones,  in many cases
> +providing a compression ratio substantially higher than z3fold and zbud
> +(though lower than zmalloc's).

This last part seems to be incorrect based on Igor's measurements.

> +zblock does not require MMU to operate and also is superior to zsmalloc
> +with regard to average performance and worst execution times, thus
> +allowing for better response time and real-time characteristics of the
> +whole system.

Please delete the MMU comment here and in the changelog. As Nhat said,
this is a pointless advantage of a swap backend allocator, as swapping
itself requires an MMU.

Please also add that highmem is not supported.

Please also add that migration is not supported, which has a negative
impact on the kernel's ability to produce higher-order pages. This is
particularly unfortunate because zblock itself wants higher-order pages.

> +E. g. on a series of stress-ng tests run on a Raspberry Pi 5, we get
> +5-10% higher value for bogo ops/s in zblock/zsmalloc comparison.
> +
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d5dfb9186962..19283e6a08eb 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -26556,6 +26556,13 @@ F:	Documentation/networking/device_drivers/hamradio/z8530drv.rst
>  F:	drivers/net/hamradio/*scc.c
>  F:	drivers/net/hamradio/z8530.h
>  
> +ZBLOCK COMPRESSED SLAB MEMORY ALLOCATOR
> +M:	Vitaly Wool <vitaly.wool@...sulko.se>
> +L:	linux-mm@...ck.org
> +S:	Maintained
> +F:	Documentation/mm/zblock.rst
> +F:	mm/zblock.[ch]
> +
>  ZBUD COMPRESSED PAGE ALLOCATOR
>  M:	Seth Jennings <sjenning@...hat.com>
>  M:	Dan Streetman <ddstreet@...e.org>

It looks like you're using an old tree. Can you please rebase on top
of the mm tree? There have been some significant changes to how zswap
is calling into zpool since the other allocators have been deleted.

> @@ -0,0 +1,418 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * zblock.c
> + *
> + * Author: Vitaly Wool <vitaly.wool@...sulko.se>
> + * Based on the work from Ananda Badmaev <a.badmaev@...cknet.pro>
> + * Copyright (C) 2022-2025, Konsulko AB.
> + *
> + * Zblock is a small object allocator with the intention to serve as a
> + * zpool backend. It operates on page blocks which consist of number
> + * of physical pages being a power of 2 and store integer number of
> + * compressed pages per block which results in determinism and simplicity.
> + *
> + * zblock doesn't export any API and is meant to be used via zpool API.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/atomic.h>
> +#include <linux/list.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/preempt.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/zpool.h>
> +#include "zblock.h"
> +
> +static struct rb_root block_desc_tree = RB_ROOT;
> +
> +/* add a new block to the list */
> +static inline void add_block(struct zblock_block *block,
> +				struct block_list *block_list)
> +{
> +	list_add(&block->link, &block_list->list);
> +}
> +
> +/*
> + * Find a block with at least one free slot and claim it.
> + * We make sure that the first block, if exists, will always work.
> + */
> +static inline struct zblock_block *find_block(struct block_list *block_list)
> +{
> +	struct list_head *l = &block_list->list;
> +
> +	if (!list_empty(l)) {
> +		struct zblock_block *z = list_first_entry(l, typeof(*z), link);
> +
> +		if (z->free_slots > 0) {
> +			if (--z->free_slots == 0)
> +				list_move_tail(&z->link, l);
> +			return z;
> +		}
> +	}
> +	return NULL;
> +}
> +
> +/* remove block from the list */
> +static inline void remove_block(struct zblock_block *block)
> +{
> +	list_del_init(&block->link);
> +}
> +
> +/* Encodes the handle of a particular slot in the pool using metadata */
> +static inline unsigned long metadata_to_handle(struct zblock_block *block,
> +				unsigned int block_type, unsigned int slot)
> +{
> +	return (unsigned long)(block) | (block_type << SLOT_BITS) | slot;
> +}
> +
> +/* Returns block, block type and slot in the pool corresponding to handle */
> +static inline struct zblock_block *handle_to_metadata(unsigned long handle,
> +				unsigned int *block_type, unsigned int *slot)
> +{
> +	*block_type = (handle & (PAGE_SIZE - 1)) >> SLOT_BITS;
> +	*slot = handle & SLOT_MASK;
> +	return (struct zblock_block *)(handle & PAGE_MASK);
> +}
> +
> +
> +/*
> + * allocate new block and add it to corresponding block list
> + */
> +static struct zblock_block *alloc_block(struct zblock_pool *pool,
> +					int block_type, gfp_t gfp,
> +					unsigned long *handle)
> +{
> +	struct zblock_block *block;
> +	struct block_list *block_list;
> +
> +	block = (void *)__get_free_pages(gfp, block_desc[block_type].order);

This gfp comes from zswap, which currently does:

        gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;

So you might get highmem here that cannot be dereferenced and crash.

__GFP_MOVABLE is also wrong, as you're not implementing migration.

> +	if (!block)
> +		return NULL;
> +
> +	block_list = &pool->block_lists[block_type];
> +
> +	/* init block data  */
> +	block->free_slots = block_desc[block_type].slots_per_block - 1;
> +	memset(&block->slot_info, 0, sizeof(block->slot_info));
> +	set_bit(0, block->slot_info);
> +	*handle = metadata_to_handle(block, block_type, 0);
> +
> +	spin_lock(&block_list->lock);
> +	add_block(block, block_list);

Use list_add() directly here.

> +	block_list->block_count++;
> +	spin_unlock(&block_list->lock);
> +	return block;
> +}
> +
> +/*****************
> + * API Functions
> + *****************/
> +/**
> + * zblock_create_pool() - create a new zblock pool
> + * @gfp:	gfp flags when allocating the zblock pool structure
> + * @ops:	user-defined operations for the zblock pool
> + *
> + * Return: pointer to the new zblock pool or NULL if the metadata allocation
> + * failed.
> + */
> +static struct zblock_pool *zblock_create_pool(gfp_t gfp)
> +{
> +	struct zblock_pool *pool;
> +	struct block_list *block_list;
> +	int i;
> +
> +	pool = kmalloc(sizeof(struct zblock_pool), gfp);
> +	if (!pool)
> +		return NULL;
> +
> +	/* init each block list */
> +	for (i = 0; i < ARRAY_SIZE(block_desc); i++) {
> +		block_list = &pool->block_lists[i];
> +		spin_lock_init(&block_list->lock);
> +		INIT_LIST_HEAD(&block_list->list);
> +		block_list->block_count = 0;
> +	}
> +	return pool;
> +}
> +
> +/**
> + * zblock_destroy_pool() - destroys an existing zblock pool
> + * @pool:	the zblock pool to be destroyed
> + *
> + */
> +static void zblock_destroy_pool(struct zblock_pool *pool)
> +{
> +	kfree(pool);
> +}
> +
> +
> +/**
> + * zblock_alloc() - allocates a slot of appropriate size
> + * @pool:	zblock pool from which to allocate
> + * @size:	size in bytes of the desired allocation
> + * @gfp:	gfp flags used if the pool needs to grow
> + * @handle:	handle of the new allocation
> + *
> + * Return: 0 if success and handle is set, otherwise -EINVAL if the size or
> + * gfp arguments are invalid or -ENOMEM if the pool was unable to allocate
> + * a new slot.
> + */
> +static int zblock_alloc(struct zblock_pool *pool, size_t size, gfp_t gfp,
> +			unsigned long *handle)
> +{
> +	int block_type = -1;
> +	unsigned int slot;
> +	struct zblock_block *block;
> +	struct block_list *block_list;
> +
> +	if (!size)
> +		return -EINVAL;
> +
> +	if (size > PAGE_SIZE)
> +		return -ENOSPC;
> +
> +	/* find basic block type with suitable slot size */
> +	if (size < block_desc[0].slot_size)
> +		block_type = 0;
> +	else {
> +		struct block_desc_node *block_node;
> +		struct rb_node *node = block_desc_tree.rb_node;
> +
> +		while (node) {
> +			block_node = container_of(node, typeof(*block_node), node);
> +			if (size < block_node->this_slot_size)
> +				node = node->rb_left;
> +			else if (size >= block_node->next_slot_size)
> +				node = node->rb_right;
> +			else {
> +				block_type = block_node->block_idx + 1;
> +				break;
> +			}
> +		}
> +	}
> +	if (WARN_ON(block_type < 0 || block_type >= ARRAY_SIZE(block_desc)))
> +		return -EINVAL;
> +
> +	block_list = &pool->block_lists[block_type];
> +
> +	spin_lock(&block_list->lock);
> +	block = find_block(block_list);
> +	spin_unlock(&block_list->lock);
> +	if (block)
> +		goto found;
> +	/* not found block with free slots try to allocate new empty block */
> +	block = alloc_block(pool, block_type, gfp, handle);
> +	return block ? 0 : -ENOMEM;
> +
> +found:
> +	/* find the first free slot in block */
> +	for (slot = find_first_zero_bit(block->slot_info,
> +					block_desc[block_type].slots_per_block);
> +	     slot < block_desc[block_type].slots_per_block;
> +	     slot = find_next_zero_bit(block->slot_info,
> +					block_desc[block_type].slots_per_block,
> +					slot)) {
> +		if (!test_and_set_bit(slot, block->slot_info))
> +			break;
> +		barrier();

Ah, so find_block() reserves a slot in block->free_slots. You can race
with another allocation for the exact bit, but there is one bit
guaranteed to remain. Clever. This could use a comment.

> +	}
> +	BUG_ON(slot >= block_desc[block_type].slots_per_block);
> +	*handle = metadata_to_handle(block, block_type, slot);
> +	return 0;
> +}
> +
> +/**
> + * zblock_free() - frees the allocation associated with the given handle
> + * @pool:	pool in which the allocation resided
> + * @handle:	handle associated with the allocation returned by zblock_alloc()
> + *
> + */
> +static void zblock_free(struct zblock_pool *pool, unsigned long handle)
> +{
> +	unsigned int slot, block_type;
> +	struct zblock_block *block;
> +	struct block_list *block_list;
> +
> +	block = handle_to_metadata(handle, &block_type, &slot);
> +	block_list = &pool->block_lists[block_type];
> +
> +	spin_lock(&block_list->lock);
> +	/* if all slots in block are empty delete whole block */
> +	if (++block->free_slots == block_desc[block_type].slots_per_block) {
> +		block_list->block_count--;
> +		remove_block(block);
> +		spin_unlock(&block_list->lock);
> +		free_pages((unsigned long)block, block_desc[block_type].order);
> +		return;
> +	}
> +	spin_unlock(&block_list->lock);
> +
> +	clear_bit(slot, block->slot_info);

The list handling here is unusual. find_block() only checks the first
block in the list. Filling a block puts it at the tail, but partially
freeing a block doesn't rotate it back to the head. AFAICS this can
lead to an ugly edge case where you have an unbounded amount of free
memory trapped in partial blocks, but you keep allocating new blocks:

You start with [new block]. Once full, list_move_tail() is a nop.

You allocate a new one, [partial]->[full].

Keep going, you can have [partial]->[full]->[full]->[full]->[full]
until it's hundreds of gigabytes.

The workload now unmaps randomly and thus partially draining many
blocks. So you have: [partial]->[full]->[partial]->... with many
gigabytes of free slots in partially used blocks.

Once the first partial block fills up, it rotates to the tail:
[full]->[partial]->...[full]

find_block() will fail. You allocate a new block, fill it, move it to
the tail. The old [full] is back at the head. Rinse, repeat.

You could make find_block() keep going, but it might have to go
through a million [full] blocks before finding a partial one.

This is the reason why other allocator schemes, like zsmalloc e.g.,
have separate partial and full lists.