linux-kernel - Re: [PATCH v2] mm: add zblock allocator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2cfcfdde-0d87-4195-b5a9-73636a94e89a@konsulko.se>
Date: Mon, 7 Apr 2025 20:26:44 +0200
From: Vitaly Wool <vitaly.wool@...sulko.se>
To: Johannes Weiner <hannes@...xchg.org>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org,
 linux-kernel@...r.kernel.org, Nhat Pham <nphamcs@...il.com>,
 Shakeel Butt <shakeel.butt@...ux.dev>, Igor Belousov <igor.b@...dev.am>
Subject: Re: [PATCH v2] mm: add zblock allocator

Hi Johannes,

thanks for the review, some comments follow below.

On 4/7/25 17:54, Johannes Weiner wrote:
> On Fri, Apr 04, 2025 at 09:28:13PM +0200, Vitaly Wool wrote:
>> @@ -0,0 +1,24 @@
>> +. SPDX-License-Identifier: GPL-2.0
>> +
>> +======
>> +zblock
>> +======
>> +
>> +zblock is a special purpose allocator for storing compressed pages.
>> +It stores integer number of compressed objects per its block. These
>> +blocks consist of several physical pages (2**n, i. e. 1/2/4/8).
>> +
>> +With zblock, it is possible to densely arrange objects of various sizes
>> +resulting in low internal fragmentation. Also this allocator tries to
>> +fill incomplete blocks instead of adding new ones,  in many cases
>> +providing a compression ratio substantially higher than z3fold and zbud
>> +(though lower than zmalloc's).
> 
> This last part seems to be incorrect based on Igor's measurements.

Well, I surely don't mind deleting that.

>> +zblock does not require MMU to operate and also is superior to zsmalloc
>> +with regard to average performance and worst execution times, thus
>> +allowing for better response time and real-time characteristics of the
>> +whole system.
> 
> Please delete the MMU comment here and in the changelog. As Nhat said,
> this is a pointless advantage of a swap backend allocator, as swapping
> itself requires an MMU.

That's correct indeed but I still believe zblock may be a good match for 
ZRAM, even better than zsmalloc, and that is where MMU indepencence 
comes into play.

> Please also add that highmem is not supported.
> 
> Please also add that migration is not supported, which has a negative
> impact on the kernel's ability to produce higher-order pages. This is
> particularly unfortunate because zblock itself wants higher-order pages.

Right. We plan to add migration support in the close future though.

>> +E. g. on a series of stress-ng tests run on a Raspberry Pi 5, we get
>> +5-10% higher value for bogo ops/s in zblock/zsmalloc comparison.
>> +
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index d5dfb9186962..19283e6a08eb 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -26556,6 +26556,13 @@ F:	Documentation/networking/device_drivers/hamradio/z8530drv.rst
>>   F:	drivers/net/hamradio/*scc.c
>>   F:	drivers/net/hamradio/z8530.h
>>   
>> +ZBLOCK COMPRESSED SLAB MEMORY ALLOCATOR
>> +M:	Vitaly Wool <vitaly.wool@...sulko.se>
>> +L:	linux-mm@...ck.org
>> +S:	Maintained
>> +F:	Documentation/mm/zblock.rst
>> +F:	mm/zblock.[ch]
>> +
>>   ZBUD COMPRESSED PAGE ALLOCATOR
>>   M:	Seth Jennings <sjenning@...hat.com>
>>   M:	Dan Streetman <ddstreet@...e.org>
> 
> It looks like you're using an old tree. Can you please rebase on top
> of the mm tree? There have been some significant changes to how zswap
> is calling into zpool since the other allocators have been deleted.
> 
>> @@ -0,0 +1,418 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * zblock.c
>> + *
>> + * Author: Vitaly Wool <vitaly.wool@...sulko.se>
>> + * Based on the work from Ananda Badmaev <a.badmaev@...cknet.pro>
>> + * Copyright (C) 2022-2025, Konsulko AB.
>> + *
>> + * Zblock is a small object allocator with the intention to serve as a
>> + * zpool backend. It operates on page blocks which consist of number
>> + * of physical pages being a power of 2 and store integer number of
>> + * compressed pages per block which results in determinism and simplicity.
>> + *
>> + * zblock doesn't export any API and is meant to be used via zpool API.
>> + */
>> +
>> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
>> +#include <linux/atomic.h>
>> +#include <linux/list.h>
>> +#include <linux/mm.h>
>> +#include <linux/module.h>
>> +#include <linux/preempt.h>
>> +#include <linux/slab.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/zpool.h>
>> +#include "zblock.h"
>> +
>> +static struct rb_root block_desc_tree = RB_ROOT;
>> +
>> +/* add a new block to the list */
>> +static inline void add_block(struct zblock_block *block,
>> +				struct block_list *block_list)
>> +{
>> +	list_add(&block->link, &block_list->list);
>> +}
>> +
>> +/*
>> + * Find a block with at least one free slot and claim it.
>> + * We make sure that the first block, if exists, will always work.
>> + */
>> +static inline struct zblock_block *find_block(struct block_list *block_list)
>> +{
>> +	struct list_head *l = &block_list->list;
>> +
>> +	if (!list_empty(l)) {
>> +		struct zblock_block *z = list_first_entry(l, typeof(*z), link);
>> +
>> +		if (z->free_slots > 0) {
>> +			if (--z->free_slots == 0)
>> +				list_move_tail(&z->link, l);
>> +			return z;
>> +		}
>> +	}
>> +	return NULL;
>> +}
>> +
>> +/* remove block from the list */
>> +static inline void remove_block(struct zblock_block *block)
>> +{
>> +	list_del_init(&block->link);
>> +}
>> +
>> +/* Encodes the handle of a particular slot in the pool using metadata */
>> +static inline unsigned long metadata_to_handle(struct zblock_block *block,
>> +				unsigned int block_type, unsigned int slot)
>> +{
>> +	return (unsigned long)(block) | (block_type << SLOT_BITS) | slot;
>> +}
>> +
>> +/* Returns block, block type and slot in the pool corresponding to handle */
>> +static inline struct zblock_block *handle_to_metadata(unsigned long handle,
>> +				unsigned int *block_type, unsigned int *slot)
>> +{
>> +	*block_type = (handle & (PAGE_SIZE - 1)) >> SLOT_BITS;
>> +	*slot = handle & SLOT_MASK;
>> +	return (struct zblock_block *)(handle & PAGE_MASK);
>> +}
>> +
>> +
>> +/*
>> + * allocate new block and add it to corresponding block list
>> + */
>> +static struct zblock_block *alloc_block(struct zblock_pool *pool,
>> +					int block_type, gfp_t gfp,
>> +					unsigned long *handle)
>> +{
>> +	struct zblock_block *block;
>> +	struct block_list *block_list;
>> +
>> +	block = (void *)__get_free_pages(gfp, block_desc[block_type].order);
> 
> This gfp comes from zswap, which currently does:
> 
>          gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> 
> So you might get highmem here that cannot be dereferenced and crash.
> 
> __GFP_MOVABLE is also wrong, as you're not implementing migration.

Ack.

>> +	if (!block)
>> +		return NULL;
>> +
>> +	block_list = &pool->block_lists[block_type];
>> +
>> +	/* init block data  */
>> +	block->free_slots = block_desc[block_type].slots_per_block - 1;
>> +	memset(&block->slot_info, 0, sizeof(block->slot_info));
>> +	set_bit(0, block->slot_info);
>> +	*handle = metadata_to_handle(block, block_type, 0);
>> +
>> +	spin_lock(&block_list->lock);
>> +	add_block(block, block_list);
> 
> Use list_add() directly here.

This is probably too small a thing to argue about but I would rather 
keep add_block and put block_count increment there. Would you agree?

>> +	block_list->block_count++;
>> +	spin_unlock(&block_list->lock);
>> +	return block;
>> +}
>> +
>> +/*****************
>> + * API Functions
>> + *****************/
>> +/**
>> + * zblock_create_pool() - create a new zblock pool
>> + * @gfp:	gfp flags when allocating the zblock pool structure
>> + * @ops:	user-defined operations for the zblock pool
>> + *
>> + * Return: pointer to the new zblock pool or NULL if the metadata allocation
>> + * failed.
>> + */
>> +static struct zblock_pool *zblock_create_pool(gfp_t gfp)
>> +{
>> +	struct zblock_pool *pool;
>> +	struct block_list *block_list;
>> +	int i;
>> +
>> +	pool = kmalloc(sizeof(struct zblock_pool), gfp);
>> +	if (!pool)
>> +		return NULL;
>> +
>> +	/* init each block list */
>> +	for (i = 0; i < ARRAY_SIZE(block_desc); i++) {
>> +		block_list = &pool->block_lists[i];
>> +		spin_lock_init(&block_list->lock);
>> +		INIT_LIST_HEAD(&block_list->list);
>> +		block_list->block_count = 0;
>> +	}
>> +	return pool;
>> +}
>> +
>> +/**
>> + * zblock_destroy_pool() - destroys an existing zblock pool
>> + * @pool:	the zblock pool to be destroyed
>> + *
>> + */
>> +static void zblock_destroy_pool(struct zblock_pool *pool)
>> +{
>> +	kfree(pool);
>> +}
>> +
>> +
>> +/**
>> + * zblock_alloc() - allocates a slot of appropriate size
>> + * @pool:	zblock pool from which to allocate
>> + * @size:	size in bytes of the desired allocation
>> + * @gfp:	gfp flags used if the pool needs to grow
>> + * @handle:	handle of the new allocation
>> + *
>> + * Return: 0 if success and handle is set, otherwise -EINVAL if the size or
>> + * gfp arguments are invalid or -ENOMEM if the pool was unable to allocate
>> + * a new slot.
>> + */
>> +static int zblock_alloc(struct zblock_pool *pool, size_t size, gfp_t gfp,
>> +			unsigned long *handle)
>> +{
>> +	int block_type = -1;
>> +	unsigned int slot;
>> +	struct zblock_block *block;
>> +	struct block_list *block_list;
>> +
>> +	if (!size)
>> +		return -EINVAL;
>> +
>> +	if (size > PAGE_SIZE)
>> +		return -ENOSPC;
>> +
>> +	/* find basic block type with suitable slot size */
>> +	if (size < block_desc[0].slot_size)
>> +		block_type = 0;
>> +	else {
>> +		struct block_desc_node *block_node;
>> +		struct rb_node *node = block_desc_tree.rb_node;
>> +
>> +		while (node) {
>> +			block_node = container_of(node, typeof(*block_node), node);
>> +			if (size < block_node->this_slot_size)
>> +				node = node->rb_left;
>> +			else if (size >= block_node->next_slot_size)
>> +				node = node->rb_right;
>> +			else {
>> +				block_type = block_node->block_idx + 1;
>> +				break;
>> +			}
>> +		}
>> +	}
>> +	if (WARN_ON(block_type < 0 || block_type >= ARRAY_SIZE(block_desc)))
>> +		return -EINVAL;
>> +
>> +	block_list = &pool->block_lists[block_type];
>> +
>> +	spin_lock(&block_list->lock);
>> +	block = find_block(block_list);
>> +	spin_unlock(&block_list->lock);
>> +	if (block)
>> +		goto found;
>> +	/* not found block with free slots try to allocate new empty block */
>> +	block = alloc_block(pool, block_type, gfp, handle);
>> +	return block ? 0 : -ENOMEM;
>> +
>> +found:
>> +	/* find the first free slot in block */
>> +	for (slot = find_first_zero_bit(block->slot_info,
>> +					block_desc[block_type].slots_per_block);
>> +	     slot < block_desc[block_type].slots_per_block;
>> +	     slot = find_next_zero_bit(block->slot_info,
>> +					block_desc[block_type].slots_per_block,
>> +					slot)) {
>> +		if (!test_and_set_bit(slot, block->slot_info))
>> +			break;
>> +		barrier();
> 
> Ah, so find_block() reserves a slot in block->free_slots. You can race
> with another allocation for the exact bit, but there is one bit
> guaranteed to remain. Clever. This could use a comment.

Will do.

>> +	}
>> +	BUG_ON(slot >= block_desc[block_type].slots_per_block);
>> +	*handle = metadata_to_handle(block, block_type, slot);
>> +	return 0;
>> +}
>> +
>> +/**
>> + * zblock_free() - frees the allocation associated with the given handle
>> + * @pool:	pool in which the allocation resided
>> + * @handle:	handle associated with the allocation returned by zblock_alloc()
>> + *
>> + */
>> +static void zblock_free(struct zblock_pool *pool, unsigned long handle)
>> +{
>> +	unsigned int slot, block_type;
>> +	struct zblock_block *block;
>> +	struct block_list *block_list;
>> +
>> +	block = handle_to_metadata(handle, &block_type, &slot);
>> +	block_list = &pool->block_lists[block_type];
>> +
>> +	spin_lock(&block_list->lock);
>> +	/* if all slots in block are empty delete whole block */
>> +	if (++block->free_slots == block_desc[block_type].slots_per_block) {
>> +		block_list->block_count--;
>> +		remove_block(block);
>> +		spin_unlock(&block_list->lock);
>> +		free_pages((unsigned long)block, block_desc[block_type].order);
>> +		return;
>> +	}
>> +	spin_unlock(&block_list->lock);
>> +
>> +	clear_bit(slot, block->slot_info);
> 
> The list handling here is unusual. find_block() only checks the first
> block in the list. Filling a block puts it at the tail, but partially
> freeing a block doesn't rotate it back to the head. AFAICS this can
> lead to an ugly edge case where you have an unbounded amount of free
> memory trapped in partial blocks, but you keep allocating new blocks:
> 
> You start with [new block]. Once full, list_move_tail() is a nop.
> 
> You allocate a new one, [partial]->[full].
> 
> Keep going, you can have [partial]->[full]->[full]->[full]->[full]
> until it's hundreds of gigabytes.
> 
> The workload now unmaps randomly and thus partially draining many
> blocks. So you have: [partial]->[full]->[partial]->... with many
> gigabytes of free slots in partially used blocks.
> 
> Once the first partial block fills up, it rotates to the tail:
> [full]->[partial]->...[full]
> 
> find_block() will fail. You allocate a new block, fill it, move it to
> the tail. The old [full] is back at the head. Rinse, repeat.
> 
> You could make find_block() keep going, but it might have to go
> through a million [full] blocks before finding a partial one.
> 
> This is the reason why other allocator schemes, like zsmalloc e.g.,
> have separate partial and full lists.

The initial variant of zblock_free() had list_move() for that very 
purpose but we hadn't hit this corner case in our testing so far so I 
decided to remove it. As an interesting coincidence I seem to had hit it 
minutes before I received your response so I've got it restored and 
checking now if it is enough.

~Vitaly