[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2cfcfdde-0d87-4195-b5a9-73636a94e89a@konsulko.se>
Date: Mon, 7 Apr 2025 20:26:44 +0200
From: Vitaly Wool <vitaly.wool@...sulko.se>
To: Johannes Weiner <hannes@...xchg.org>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org,
linux-kernel@...r.kernel.org, Nhat Pham <nphamcs@...il.com>,
Shakeel Butt <shakeel.butt@...ux.dev>, Igor Belousov <igor.b@...dev.am>
Subject: Re: [PATCH v2] mm: add zblock allocator
Hi Johannes,
thanks for the review, some comments follow below.
On 4/7/25 17:54, Johannes Weiner wrote:
> On Fri, Apr 04, 2025 at 09:28:13PM +0200, Vitaly Wool wrote:
>> @@ -0,0 +1,24 @@
>> +. SPDX-License-Identifier: GPL-2.0
>> +
>> +======
>> +zblock
>> +======
>> +
>> +zblock is a special purpose allocator for storing compressed pages.
>> +It stores integer number of compressed objects per its block. These
>> +blocks consist of several physical pages (2**n, i. e. 1/2/4/8).
>> +
>> +With zblock, it is possible to densely arrange objects of various sizes
>> +resulting in low internal fragmentation. Also this allocator tries to
>> +fill incomplete blocks instead of adding new ones, in many cases
>> +providing a compression ratio substantially higher than z3fold and zbud
>> +(though lower than zmalloc's).
>
> This last part seems to be incorrect based on Igor's measurements.
Well, I surely don't mind deleting that.
>> +zblock does not require MMU to operate and also is superior to zsmalloc
>> +with regard to average performance and worst execution times, thus
>> +allowing for better response time and real-time characteristics of the
>> +whole system.
>
> Please delete the MMU comment here and in the changelog. As Nhat said,
> this is a pointless advantage of a swap backend allocator, as swapping
> itself requires an MMU.
That's correct indeed but I still believe zblock may be a good match for
ZRAM, even better than zsmalloc, and that is where MMU indepencence
comes into play.
> Please also add that highmem is not supported.
>
> Please also add that migration is not supported, which has a negative
> impact on the kernel's ability to produce higher-order pages. This is
> particularly unfortunate because zblock itself wants higher-order pages.
Right. We plan to add migration support in the close future though.
>> +E. g. on a series of stress-ng tests run on a Raspberry Pi 5, we get
>> +5-10% higher value for bogo ops/s in zblock/zsmalloc comparison.
>> +
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index d5dfb9186962..19283e6a08eb 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -26556,6 +26556,13 @@ F: Documentation/networking/device_drivers/hamradio/z8530drv.rst
>> F: drivers/net/hamradio/*scc.c
>> F: drivers/net/hamradio/z8530.h
>>
>> +ZBLOCK COMPRESSED SLAB MEMORY ALLOCATOR
>> +M: Vitaly Wool <vitaly.wool@...sulko.se>
>> +L: linux-mm@...ck.org
>> +S: Maintained
>> +F: Documentation/mm/zblock.rst
>> +F: mm/zblock.[ch]
>> +
>> ZBUD COMPRESSED PAGE ALLOCATOR
>> M: Seth Jennings <sjenning@...hat.com>
>> M: Dan Streetman <ddstreet@...e.org>
>
> It looks like you're using an old tree. Can you please rebase on top
> of the mm tree? There have been some significant changes to how zswap
> is calling into zpool since the other allocators have been deleted.
>
>> @@ -0,0 +1,418 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * zblock.c
>> + *
>> + * Author: Vitaly Wool <vitaly.wool@...sulko.se>
>> + * Based on the work from Ananda Badmaev <a.badmaev@...cknet.pro>
>> + * Copyright (C) 2022-2025, Konsulko AB.
>> + *
>> + * Zblock is a small object allocator with the intention to serve as a
>> + * zpool backend. It operates on page blocks which consist of number
>> + * of physical pages being a power of 2 and store integer number of
>> + * compressed pages per block which results in determinism and simplicity.
>> + *
>> + * zblock doesn't export any API and is meant to be used via zpool API.
>> + */
>> +
>> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
>> +#include <linux/atomic.h>
>> +#include <linux/list.h>
>> +#include <linux/mm.h>
>> +#include <linux/module.h>
>> +#include <linux/preempt.h>
>> +#include <linux/slab.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/zpool.h>
>> +#include "zblock.h"
>> +
>> +static struct rb_root block_desc_tree = RB_ROOT;
>> +
>> +/* add a new block to the list */
>> +static inline void add_block(struct zblock_block *block,
>> + struct block_list *block_list)
>> +{
>> + list_add(&block->link, &block_list->list);
>> +}
>> +
>> +/*
>> + * Find a block with at least one free slot and claim it.
>> + * We make sure that the first block, if exists, will always work.
>> + */
>> +static inline struct zblock_block *find_block(struct block_list *block_list)
>> +{
>> + struct list_head *l = &block_list->list;
>> +
>> + if (!list_empty(l)) {
>> + struct zblock_block *z = list_first_entry(l, typeof(*z), link);
>> +
>> + if (z->free_slots > 0) {
>> + if (--z->free_slots == 0)
>> + list_move_tail(&z->link, l);
>> + return z;
>> + }
>> + }
>> + return NULL;
>> +}
>> +
>> +/* remove block from the list */
>> +static inline void remove_block(struct zblock_block *block)
>> +{
>> + list_del_init(&block->link);
>> +}
>> +
>> +/* Encodes the handle of a particular slot in the pool using metadata */
>> +static inline unsigned long metadata_to_handle(struct zblock_block *block,
>> + unsigned int block_type, unsigned int slot)
>> +{
>> + return (unsigned long)(block) | (block_type << SLOT_BITS) | slot;
>> +}
>> +
>> +/* Returns block, block type and slot in the pool corresponding to handle */
>> +static inline struct zblock_block *handle_to_metadata(unsigned long handle,
>> + unsigned int *block_type, unsigned int *slot)
>> +{
>> + *block_type = (handle & (PAGE_SIZE - 1)) >> SLOT_BITS;
>> + *slot = handle & SLOT_MASK;
>> + return (struct zblock_block *)(handle & PAGE_MASK);
>> +}
>> +
>> +
>> +/*
>> + * allocate new block and add it to corresponding block list
>> + */
>> +static struct zblock_block *alloc_block(struct zblock_pool *pool,
>> + int block_type, gfp_t gfp,
>> + unsigned long *handle)
>> +{
>> + struct zblock_block *block;
>> + struct block_list *block_list;
>> +
>> + block = (void *)__get_free_pages(gfp, block_desc[block_type].order);
>
> This gfp comes from zswap, which currently does:
>
> gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
>
> So you might get highmem here that cannot be dereferenced and crash.
>
> __GFP_MOVABLE is also wrong, as you're not implementing migration.
Ack.
>> + if (!block)
>> + return NULL;
>> +
>> + block_list = &pool->block_lists[block_type];
>> +
>> + /* init block data */
>> + block->free_slots = block_desc[block_type].slots_per_block - 1;
>> + memset(&block->slot_info, 0, sizeof(block->slot_info));
>> + set_bit(0, block->slot_info);
>> + *handle = metadata_to_handle(block, block_type, 0);
>> +
>> + spin_lock(&block_list->lock);
>> + add_block(block, block_list);
>
> Use list_add() directly here.
This is probably too small a thing to argue about but I would rather
keep add_block and put block_count increment there. Would you agree?
>> + block_list->block_count++;
>> + spin_unlock(&block_list->lock);
>> + return block;
>> +}
>> +
>> +/*****************
>> + * API Functions
>> + *****************/
>> +/**
>> + * zblock_create_pool() - create a new zblock pool
>> + * @gfp: gfp flags when allocating the zblock pool structure
>> + * @ops: user-defined operations for the zblock pool
>> + *
>> + * Return: pointer to the new zblock pool or NULL if the metadata allocation
>> + * failed.
>> + */
>> +static struct zblock_pool *zblock_create_pool(gfp_t gfp)
>> +{
>> + struct zblock_pool *pool;
>> + struct block_list *block_list;
>> + int i;
>> +
>> + pool = kmalloc(sizeof(struct zblock_pool), gfp);
>> + if (!pool)
>> + return NULL;
>> +
>> + /* init each block list */
>> + for (i = 0; i < ARRAY_SIZE(block_desc); i++) {
>> + block_list = &pool->block_lists[i];
>> + spin_lock_init(&block_list->lock);
>> + INIT_LIST_HEAD(&block_list->list);
>> + block_list->block_count = 0;
>> + }
>> + return pool;
>> +}
>> +
>> +/**
>> + * zblock_destroy_pool() - destroys an existing zblock pool
>> + * @pool: the zblock pool to be destroyed
>> + *
>> + */
>> +static void zblock_destroy_pool(struct zblock_pool *pool)
>> +{
>> + kfree(pool);
>> +}
>> +
>> +
>> +/**
>> + * zblock_alloc() - allocates a slot of appropriate size
>> + * @pool: zblock pool from which to allocate
>> + * @size: size in bytes of the desired allocation
>> + * @gfp: gfp flags used if the pool needs to grow
>> + * @handle: handle of the new allocation
>> + *
>> + * Return: 0 if success and handle is set, otherwise -EINVAL if the size or
>> + * gfp arguments are invalid or -ENOMEM if the pool was unable to allocate
>> + * a new slot.
>> + */
>> +static int zblock_alloc(struct zblock_pool *pool, size_t size, gfp_t gfp,
>> + unsigned long *handle)
>> +{
>> + int block_type = -1;
>> + unsigned int slot;
>> + struct zblock_block *block;
>> + struct block_list *block_list;
>> +
>> + if (!size)
>> + return -EINVAL;
>> +
>> + if (size > PAGE_SIZE)
>> + return -ENOSPC;
>> +
>> + /* find basic block type with suitable slot size */
>> + if (size < block_desc[0].slot_size)
>> + block_type = 0;
>> + else {
>> + struct block_desc_node *block_node;
>> + struct rb_node *node = block_desc_tree.rb_node;
>> +
>> + while (node) {
>> + block_node = container_of(node, typeof(*block_node), node);
>> + if (size < block_node->this_slot_size)
>> + node = node->rb_left;
>> + else if (size >= block_node->next_slot_size)
>> + node = node->rb_right;
>> + else {
>> + block_type = block_node->block_idx + 1;
>> + break;
>> + }
>> + }
>> + }
>> + if (WARN_ON(block_type < 0 || block_type >= ARRAY_SIZE(block_desc)))
>> + return -EINVAL;
>> +
>> + block_list = &pool->block_lists[block_type];
>> +
>> + spin_lock(&block_list->lock);
>> + block = find_block(block_list);
>> + spin_unlock(&block_list->lock);
>> + if (block)
>> + goto found;
>> + /* not found block with free slots try to allocate new empty block */
>> + block = alloc_block(pool, block_type, gfp, handle);
>> + return block ? 0 : -ENOMEM;
>> +
>> +found:
>> + /* find the first free slot in block */
>> + for (slot = find_first_zero_bit(block->slot_info,
>> + block_desc[block_type].slots_per_block);
>> + slot < block_desc[block_type].slots_per_block;
>> + slot = find_next_zero_bit(block->slot_info,
>> + block_desc[block_type].slots_per_block,
>> + slot)) {
>> + if (!test_and_set_bit(slot, block->slot_info))
>> + break;
>> + barrier();
>
> Ah, so find_block() reserves a slot in block->free_slots. You can race
> with another allocation for the exact bit, but there is one bit
> guaranteed to remain. Clever. This could use a comment.
Will do.
>> + }
>> + BUG_ON(slot >= block_desc[block_type].slots_per_block);
>> + *handle = metadata_to_handle(block, block_type, slot);
>> + return 0;
>> +}
>> +
>> +/**
>> + * zblock_free() - frees the allocation associated with the given handle
>> + * @pool: pool in which the allocation resided
>> + * @handle: handle associated with the allocation returned by zblock_alloc()
>> + *
>> + */
>> +static void zblock_free(struct zblock_pool *pool, unsigned long handle)
>> +{
>> + unsigned int slot, block_type;
>> + struct zblock_block *block;
>> + struct block_list *block_list;
>> +
>> + block = handle_to_metadata(handle, &block_type, &slot);
>> + block_list = &pool->block_lists[block_type];
>> +
>> + spin_lock(&block_list->lock);
>> + /* if all slots in block are empty delete whole block */
>> + if (++block->free_slots == block_desc[block_type].slots_per_block) {
>> + block_list->block_count--;
>> + remove_block(block);
>> + spin_unlock(&block_list->lock);
>> + free_pages((unsigned long)block, block_desc[block_type].order);
>> + return;
>> + }
>> + spin_unlock(&block_list->lock);
>> +
>> + clear_bit(slot, block->slot_info);
>
> The list handling here is unusual. find_block() only checks the first
> block in the list. Filling a block puts it at the tail, but partially
> freeing a block doesn't rotate it back to the head. AFAICS this can
> lead to an ugly edge case where you have an unbounded amount of free
> memory trapped in partial blocks, but you keep allocating new blocks:
>
> You start with [new block]. Once full, list_move_tail() is a nop.
>
> You allocate a new one, [partial]->[full].
>
> Keep going, you can have [partial]->[full]->[full]->[full]->[full]
> until it's hundreds of gigabytes.
>
> The workload now unmaps randomly and thus partially draining many
> blocks. So you have: [partial]->[full]->[partial]->... with many
> gigabytes of free slots in partially used blocks.
>
> Once the first partial block fills up, it rotates to the tail:
> [full]->[partial]->...[full]
>
> find_block() will fail. You allocate a new block, fill it, move it to
> the tail. The old [full] is back at the head. Rinse, repeat.
>
> You could make find_block() keep going, but it might have to go
> through a million [full] blocks before finding a partial one.
>
> This is the reason why other allocator schemes, like zsmalloc e.g.,
> have separate partial and full lists.
The initial variant of zblock_free() had list_move() for that very
purpose but we hadn't hit this corner case in our testing so far so I
decided to remove it. As an interesting coincidence I seem to had hit it
minutes before I received your response so I've got it restored and
checking now if it is enough.
~Vitaly
Powered by blists - more mailing lists