linux-kernel - Re: [PATCH v1 1/3] kho: Adopt KHO radix tree data structures

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251006141444.GN3360665@nvidia.com>
Date: Mon, 6 Oct 2025 11:14:44 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Jason Miu <jasonmiu@...gle.com>
Cc: Alexander Graf <graf@...zon.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Baoquan He <bhe@...hat.com>, Changyuan Lyu <changyuanl@...gle.com>,
	David Matlack <dmatlack@...gle.com>,
	David Rientjes <rientjes@...gle.com>,
	Mike Rapoport <rppt@...nel.org>,
	Pasha Tatashin <pasha.tatashin@...een.com>,
	Pratyush Yadav <pratyush@...nel.org>, kexec@...ts.infradead.org,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v1 1/3] kho: Adopt KHO radix tree data structures

On Tue, Sep 30, 2025 at 06:19:39PM -0700, Jason Miu wrote:
> @@ -29,7 +30,7 @@
>  #include "kexec_internal.h"
>  
>  #define KHO_FDT_COMPATIBLE "kho-v1"

We don't bump this?

> -#define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map"
> +#define PROP_PRESERVED_PAGE_RADIX_TREE "preserved-page-radix-tree"
>  #define PROP_SUB_FDT "fdt"

I'de really like to see all of these sorts of definitions in some
structured ABI header not open coded all over the place..

>  /*
> + * The KHO radix tree tracks preserved memory pages. It is a hierarchical
> + * structure that starts with a single root `kho_radix_tree`. This single
> + * tree stores pages of all orders.
> + *
> + * This is achieved by encoding the page's physical address and its order into
> + * a single `unsigned long` value. This encoded value is then used to traverse
> + * the tree.
> + *
> + * The tree hierarchy is shown below:
> + *
> + * kho_radix_tree_root
> + * +-------------------+
> + * |     Level 6       | (struct kho_radix_tree)
> + * +-------------------+
> + *   |
> + *   v
> + * +-------------------+
> + * |     Level 5       | (struct kho_radix_tree)
> + * +-------------------+
> + *   |
> + *   | ... (intermediate levels)
> + *   |
> + *   v
> + * +-------------------+
> + * |      Level 1      | (struct kho_bitmap_table)
> + * +-------------------+
> + *
> + * The following diagram illustrates how the encoded value is split into
> + * indices for the tree levels:
>   *
> + *      63:60   59:51    50:42    41:33    32:24    23:15         14:0
> + * +---------+--------+--------+--------+--------+--------+-----------------+
> + * |    0    |  Lv 6  |  Lv 5  |  Lv 4  |  Lv 3  |  Lv 2  |  Lv 1 (bitmap)  |
> + * +---------+--------+--------+--------+--------+--------+-----------------+
>   *
> + * Each `kho_radix_tree` (Levels 2-6) and `kho_bitmap_table` (Level 1) is
> + * PAGE_SIZE. Each entry in a `kho_radix_tree` is a descriptor (a physical
> + * address) pointing to the next level node. For Level 2 `kho_radix_tree`
> + * nodes, these descriptors point to a `kho_bitmap_table`. The final
> + * `kho_bitmap_table` is a bitmap where each set bit represents a single
> + * preserved page.

Maybe a note that this is example is for PAGE_SIZE=4k.


>   */
> +struct kho_radix_tree {
> +	unsigned long table[PAGE_SIZE / sizeof(unsigned long)];

This should be phys_addr_t.

> +};

You dropped the macros so now we don't know these are actually
pointers to 'struct kho_radix_tree'

> +/*
> + * `kho_radix_tree_root` points to a page thats serves as the root of the
> + * KHO radix tree. This page is allocated during KHO module initialization.
> + * Its physical address is written to the FDT and passed to the next kernel
> + * during kexec.
> + */
> +static struct kho_radix_tree *kho_radix_tree_root;
> +static DECLARE_RWSEM(kho_radix_tree_root_sem);
> +
> +static int kho_radix_tree_max_depth(void)
> +{
> +	int page_offset_bit_num = BITS_PER_LONG - PAGE_SHIFT;
> +	int order_bit_num = ilog2(__roundup_pow_of_two(page_offset_bit_num));
> +	int bitmap_bit_num = PAGE_SHIFT + ilog2(BITS_PER_BYTE);
> +	int table_bit_num = ilog2(PAGE_SIZE / sizeof(unsigned long));
> +	int table_level_num = DIV_ROUND_UP(page_offset_bit_num -
> +					   bitmap_bit_num + order_bit_num,
> +					   table_bit_num);

All should be unsigned int. Below I suggest to put it in an enum and
use different names.. And since the function is constant it can just
be an enum TOP_LEVEL too.

> +/*
> + * The KHO radix tree tracks preserved pages by encoding a page's physical
> + * address (pa) and its order into a single unsigned long value. This value
> + * is then used to traverse the tree. The encoded value is composed of two
> + * parts: the 'order bits' in the upper part and the 'page offset' in the
> + * lower part.
> + *
> + *   <-- Higher Bits ------------------------------------ Lower Bits -->
> + *  +--------------------------+-----------------------------------------+
> + *  |        Order Bits        |               Page Offset               |
> + *  +--------------------------+-----------------------------------------+
> + *  | ... 0 0 1 0 0 ...        | pa >> (PAGE_SHIFT + order)              |
> + *  +--------------------------+-----------------------------------------+
> + *            ^
> + *            |
> + *  This single '1' bit's position
> + *  uniquely identifies the 'order'.
> + *
> + *
> + * Page Offset:
> + * The 'page offset' is the physical address normalized for its order. It
> + * effectively represents the page offset for the given order.
> + *
> + * Order Bits:
> + * The 'order bits' encode the page order by setting a single bit at a
> + * specific position. The position of this bit itself represents the order.
> + *
> + * For instance, on a 64-bit system with 4KB pages (PAGE_SHIFT = 12), the
> + * maximum range for a page offset (for order 0) is 52 bits (64 - 12). This
> + * offset occupies bits [0-51]. For order 0, the order bit is set at
> + * position 52.
> + *
> + * As the order increases, the number of bits required for the 'page offset'
> + * decreases. For example, order 1 requires one less bit for its page
> + * offset. This allows its order bit to be set at position 51 without
> + * conflicting with the page offset bits.
> + *
> + * This scheme ensures that the single order bit is always in a higher
> + * position than any bit used by the page offset for that same order,
> + * preventing collisions.

Should explain why it is like this:

This scheme allows storing all the multi-order page sizes in a single
6 level table with a good sharing of lower tables levels for 0 top
address bits. A single algorithm can efficiently process everything.

> + */
> +static unsigned long kho_radix_encode(unsigned long pa, unsigned int order)

pa is phys_addr_t in the kernel, never unsigned long.

If you want to make it all dynamic then this should be phys_addr_t

> +{
> +	unsigned long h = 1UL << (BITS_PER_LONG - PAGE_SHIFT - order);

And this BITS_PER_LONG is confused, it is BITS_PER_PHYS_ADDR_T which
may not exist.

Use an enum ORDER_0_LG2 maybe

> +	unsigned long l = pa >> (PAGE_SHIFT + order);
>  
> +	return h | l;
> +}
>  
> +static unsigned long kho_radix_decode(unsigned long encoded, unsigned int *order)

Returns phys_addr_t

>  {
> -	void *elm, *res;
> +	unsigned long order_bit = fls64(encoded);

unsigned int

> +	unsigned long pa;

phys_addr_t

> +	*order = BITS_PER_LONG - PAGE_SHIFT - order_bit + 1;

ORDER_0_LG2

> +	pa = encoded << (PAGE_SHIFT + *order);

I'd add a comment that the shift always discards order.

> +	return pa;
> +}
>  
> +static unsigned long kho_radix_get_index(unsigned long encoded, int level)

unsigned int level

> +{
> +	int table_bit_num = ilog2(PAGE_SIZE / sizeof(unsigned long));
> +	int bitmap_bit_num = PAGE_SHIFT + ilog2(BITS_PER_BYTE);

Stick all the constants that kho_radix_tree_max_depth() are computing
in an enum instead of recomputing them..

> +	unsigned long mask;
> +	int s;

unsigned for all of these.

> +
> +	if (level == 1) {

I think the math is easier if level 0 == bitmap..

> +		s = 0;
> +		mask = (1UL << bitmap_bit_num) - 1;
> +	} else {
> +		s = ((level - 2) * table_bit_num) + bitmap_bit_num;

eg here you are doing level-2 which is a bit weird only because of the
arbitary choice to make level=1 be the bitmap.

I'd also use some different names

table_bit_num == TABLE_SIZE_LG2
BITMAP_BIT_NUM = BITMAP_SIZE_LG2

Log2 designates the value is 1<<LG2

> +		mask = (1UL << table_bit_num) - 1;
>  	}
>  
> -	return elm;
> +	return (encoded >> s) & mask;

It is just:

return encoded % (1 << BITMAP_SIZE_LG2);
return (encoded >> s) % (1 << TABLE_SIZE_LG2);

The compiler is smart enough to choose bit logic if that is the
fastest option and the above is more readable.

> +static int kho_radix_set_bitmap(struct kho_bitmap_table *bit_tlb, unsigned long offset)
>  {
> +	if (!bit_tlb ||
> +	    offset >= PAGE_SIZE * BITS_PER_BYTE)
> +		return -EINVAL;
>  
> +	set_bit(offset, bit_tlb->bitmaps);

set_bit is an atomic, you want __set_bit()

> +	return 0;
> +}
>  
> +static int kho_radix_preserve_page(unsigned long pa, unsigned int order)

phys_addr_t

> +{
> +	unsigned long encoded = kho_radix_encode(pa, order);
> +	int num_tree_level = kho_radix_tree_max_depth();

kho_radix_tree_max_depth() is constant, stick it in an enum with the
rest of them.

> +	struct kho_radix_tree *current_tree, *new_tree;
> +	struct kho_bitmap_table *bitmap_table;
> +	int err = 0;
> +	int i, idx;

various unsigned int.

>  
> +	down_write(&kho_radix_tree_root_sem);
>  
> +	current_tree = kho_radix_tree_root;
>  
> +	/* Go from high levels to low levels */
> +	for (i = num_tree_level; i >= 1; i--) {
> +		idx = kho_radix_get_index(encoded, i);
> +
> +		if (i == 1) {
> +			bitmap_table = (struct kho_bitmap_table *)current_tree;
> +			err = kho_radix_set_bitmap(bitmap_table, idx);
> +			goto out;
> +		}
> +
> +		if (!current_tree->table[idx]) {
> +			new_tree = kho_alloc_radix_tree();
> +			if (!new_tree) {
> +				err = -ENOMEM;
> +				goto out;
> +			}
> +
> +			current_tree->table[idx] =
> +				(unsigned long)virt_to_phys(new_tree);

current_tree = new_tree
> +		}

else

> +
> +		current_tree = (struct kho_radix_tree *)
> +			phys_to_virt(current_tree->table[idx]);
>  	}
> +
> +out:
> +	up_write(&kho_radix_tree_root_sem);
> +	return err;
>  }
>  
> +static int kho_radix_walk_bitmaps(struct kho_bitmap_table *bit_tlb,
> +				  unsigned long offset,

phys_addr_t

> +				  kho_radix_tree_walk_callback_t cb)
>  {
> +	unsigned long encoded = offset << (PAGE_SHIFT + ilog2(BITS_PER_BYTE));
> +	unsigned long *bitmap = (unsigned long *)bit_tlb;
> +	int err = 0;
> +	int i;
>  
> +	for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) {
> +		err = cb(encoded | i);
> +		if (err)
> +			return err;
> +	}
>  
> +	return 0;
> +}
>  
> +static int kho_radix_walk_trees(struct kho_radix_tree *root, int level,

unsigned int

> +				unsigned long offset,

phys_addr_t.  I would call this start not offset..

> +				kho_radix_tree_walk_callback_t cb)
> +{
> +	int level_shift = ilog2(PAGE_SIZE / sizeof(unsigned long));
> +	struct kho_radix_tree *next_tree;
> +	unsigned long encoded, i;
> +	int err = 0;
>  
> +	if (level == 1) {
> +		encoded = offset;
> +		return kho_radix_walk_bitmaps((struct kho_bitmap_table *)root,
> +					      encoded, cb);

Better to do this in the caller  a few lines below

> +	}

>  
> +	for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) {
> +		if (root->table[i]) {
> +			encoded = offset << level_shift | i;

This doesn't seem right..

The argument to the walker should be the starting encoded of the table
it is about to walk.

Since everything always starts at 0 it should always be
  start | (i << level_shift)

?

> +			next_tree = (struct kho_radix_tree *)
> +				phys_to_virt(root->table[i]);
> +			err = kho_radix_walk_trees(next_tree, level - 1, encoded, cb);
>  			if (err)
>  				return err;
>  		}
>  	}
>  
> +	return 0;
> +}
>  
> +static int kho_memblock_reserve(phys_addr_t pa, int order)
> +{
> +	int sz = 1 << (order + PAGE_SHIFT);
> +	struct page *page = phys_to_page(pa);
> +
> +	memblock_reserve(pa, sz);
> +	memblock_reserved_mark_noinit(pa, sz);
> +	page->private = order;
>  
>  	return 0;
>  }
>  
> +static int kho_radix_walk_trees_callback(unsigned long encoded)
> +{
> +	unsigned int order;
> +	unsigned long pa;
> +
> +	pa = kho_radix_decode(encoded, &order);
> +
> +	return kho_memblock_reserve(pa, order);
> +}
> +
> +struct kho_serialization {
> +	struct page *fdt;
> +	struct list_head fdt_list;
> +	struct dentry *sub_fdt_dir;
> +};
> +
> +static int __kho_preserve_order(unsigned long pfn, unsigned int order)
> +{
> +	unsigned long pa = PFN_PHYS(pfn);

phys_addr_t

Jason