[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251006141444.GN3360665@nvidia.com>
Date: Mon, 6 Oct 2025 11:14:44 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Jason Miu <jasonmiu@...gle.com>
Cc: Alexander Graf <graf@...zon.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Baoquan He <bhe@...hat.com>, Changyuan Lyu <changyuanl@...gle.com>,
David Matlack <dmatlack@...gle.com>,
David Rientjes <rientjes@...gle.com>,
Mike Rapoport <rppt@...nel.org>,
Pasha Tatashin <pasha.tatashin@...een.com>,
Pratyush Yadav <pratyush@...nel.org>, kexec@...ts.infradead.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v1 1/3] kho: Adopt KHO radix tree data structures
On Tue, Sep 30, 2025 at 06:19:39PM -0700, Jason Miu wrote:
> @@ -29,7 +30,7 @@
> #include "kexec_internal.h"
>
> #define KHO_FDT_COMPATIBLE "kho-v1"
We don't bump this?
> -#define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map"
> +#define PROP_PRESERVED_PAGE_RADIX_TREE "preserved-page-radix-tree"
> #define PROP_SUB_FDT "fdt"
I'de really like to see all of these sorts of definitions in some
structured ABI header not open coded all over the place..
> /*
> + * The KHO radix tree tracks preserved memory pages. It is a hierarchical
> + * structure that starts with a single root `kho_radix_tree`. This single
> + * tree stores pages of all orders.
> + *
> + * This is achieved by encoding the page's physical address and its order into
> + * a single `unsigned long` value. This encoded value is then used to traverse
> + * the tree.
> + *
> + * The tree hierarchy is shown below:
> + *
> + * kho_radix_tree_root
> + * +-------------------+
> + * | Level 6 | (struct kho_radix_tree)
> + * +-------------------+
> + * |
> + * v
> + * +-------------------+
> + * | Level 5 | (struct kho_radix_tree)
> + * +-------------------+
> + * |
> + * | ... (intermediate levels)
> + * |
> + * v
> + * +-------------------+
> + * | Level 1 | (struct kho_bitmap_table)
> + * +-------------------+
> + *
> + * The following diagram illustrates how the encoded value is split into
> + * indices for the tree levels:
> *
> + * 63:60 59:51 50:42 41:33 32:24 23:15 14:0
> + * +---------+--------+--------+--------+--------+--------+-----------------+
> + * | 0 | Lv 6 | Lv 5 | Lv 4 | Lv 3 | Lv 2 | Lv 1 (bitmap) |
> + * +---------+--------+--------+--------+--------+--------+-----------------+
> *
> + * Each `kho_radix_tree` (Levels 2-6) and `kho_bitmap_table` (Level 1) is
> + * PAGE_SIZE. Each entry in a `kho_radix_tree` is a descriptor (a physical
> + * address) pointing to the next level node. For Level 2 `kho_radix_tree`
> + * nodes, these descriptors point to a `kho_bitmap_table`. The final
> + * `kho_bitmap_table` is a bitmap where each set bit represents a single
> + * preserved page.
Maybe a note that this is example is for PAGE_SIZE=4k.
> */
> +struct kho_radix_tree {
> + unsigned long table[PAGE_SIZE / sizeof(unsigned long)];
This should be phys_addr_t.
> +};
You dropped the macros so now we don't know these are actually
pointers to 'struct kho_radix_tree'
> +/*
> + * `kho_radix_tree_root` points to a page thats serves as the root of the
> + * KHO radix tree. This page is allocated during KHO module initialization.
> + * Its physical address is written to the FDT and passed to the next kernel
> + * during kexec.
> + */
> +static struct kho_radix_tree *kho_radix_tree_root;
> +static DECLARE_RWSEM(kho_radix_tree_root_sem);
> +
> +static int kho_radix_tree_max_depth(void)
> +{
> + int page_offset_bit_num = BITS_PER_LONG - PAGE_SHIFT;
> + int order_bit_num = ilog2(__roundup_pow_of_two(page_offset_bit_num));
> + int bitmap_bit_num = PAGE_SHIFT + ilog2(BITS_PER_BYTE);
> + int table_bit_num = ilog2(PAGE_SIZE / sizeof(unsigned long));
> + int table_level_num = DIV_ROUND_UP(page_offset_bit_num -
> + bitmap_bit_num + order_bit_num,
> + table_bit_num);
All should be unsigned int. Below I suggest to put it in an enum and
use different names.. And since the function is constant it can just
be an enum TOP_LEVEL too.
> +/*
> + * The KHO radix tree tracks preserved pages by encoding a page's physical
> + * address (pa) and its order into a single unsigned long value. This value
> + * is then used to traverse the tree. The encoded value is composed of two
> + * parts: the 'order bits' in the upper part and the 'page offset' in the
> + * lower part.
> + *
> + * <-- Higher Bits ------------------------------------ Lower Bits -->
> + * +--------------------------+-----------------------------------------+
> + * | Order Bits | Page Offset |
> + * +--------------------------+-----------------------------------------+
> + * | ... 0 0 1 0 0 ... | pa >> (PAGE_SHIFT + order) |
> + * +--------------------------+-----------------------------------------+
> + * ^
> + * |
> + * This single '1' bit's position
> + * uniquely identifies the 'order'.
> + *
> + *
> + * Page Offset:
> + * The 'page offset' is the physical address normalized for its order. It
> + * effectively represents the page offset for the given order.
> + *
> + * Order Bits:
> + * The 'order bits' encode the page order by setting a single bit at a
> + * specific position. The position of this bit itself represents the order.
> + *
> + * For instance, on a 64-bit system with 4KB pages (PAGE_SHIFT = 12), the
> + * maximum range for a page offset (for order 0) is 52 bits (64 - 12). This
> + * offset occupies bits [0-51]. For order 0, the order bit is set at
> + * position 52.
> + *
> + * As the order increases, the number of bits required for the 'page offset'
> + * decreases. For example, order 1 requires one less bit for its page
> + * offset. This allows its order bit to be set at position 51 without
> + * conflicting with the page offset bits.
> + *
> + * This scheme ensures that the single order bit is always in a higher
> + * position than any bit used by the page offset for that same order,
> + * preventing collisions.
Should explain why it is like this:
This scheme allows storing all the multi-order page sizes in a single
6 level table with a good sharing of lower tables levels for 0 top
address bits. A single algorithm can efficiently process everything.
> + */
> +static unsigned long kho_radix_encode(unsigned long pa, unsigned int order)
pa is phys_addr_t in the kernel, never unsigned long.
If you want to make it all dynamic then this should be phys_addr_t
> +{
> + unsigned long h = 1UL << (BITS_PER_LONG - PAGE_SHIFT - order);
And this BITS_PER_LONG is confused, it is BITS_PER_PHYS_ADDR_T which
may not exist.
Use an enum ORDER_0_LG2 maybe
> + unsigned long l = pa >> (PAGE_SHIFT + order);
>
> + return h | l;
> +}
>
> +static unsigned long kho_radix_decode(unsigned long encoded, unsigned int *order)
Returns phys_addr_t
> {
> - void *elm, *res;
> + unsigned long order_bit = fls64(encoded);
unsigned int
> + unsigned long pa;
phys_addr_t
> + *order = BITS_PER_LONG - PAGE_SHIFT - order_bit + 1;
ORDER_0_LG2
> + pa = encoded << (PAGE_SHIFT + *order);
I'd add a comment that the shift always discards order.
> + return pa;
> +}
>
> +static unsigned long kho_radix_get_index(unsigned long encoded, int level)
unsigned int level
> +{
> + int table_bit_num = ilog2(PAGE_SIZE / sizeof(unsigned long));
> + int bitmap_bit_num = PAGE_SHIFT + ilog2(BITS_PER_BYTE);
Stick all the constants that kho_radix_tree_max_depth() are computing
in an enum instead of recomputing them..
> + unsigned long mask;
> + int s;
unsigned for all of these.
> +
> + if (level == 1) {
I think the math is easier if level 0 == bitmap..
> + s = 0;
> + mask = (1UL << bitmap_bit_num) - 1;
> + } else {
> + s = ((level - 2) * table_bit_num) + bitmap_bit_num;
eg here you are doing level-2 which is a bit weird only because of the
arbitary choice to make level=1 be the bitmap.
I'd also use some different names
table_bit_num == TABLE_SIZE_LG2
BITMAP_BIT_NUM = BITMAP_SIZE_LG2
Log2 designates the value is 1<<LG2
> + mask = (1UL << table_bit_num) - 1;
> }
>
> - return elm;
> + return (encoded >> s) & mask;
It is just:
return encoded % (1 << BITMAP_SIZE_LG2);
return (encoded >> s) % (1 << TABLE_SIZE_LG2);
The compiler is smart enough to choose bit logic if that is the
fastest option and the above is more readable.
> +static int kho_radix_set_bitmap(struct kho_bitmap_table *bit_tlb, unsigned long offset)
> {
> + if (!bit_tlb ||
> + offset >= PAGE_SIZE * BITS_PER_BYTE)
> + return -EINVAL;
>
> + set_bit(offset, bit_tlb->bitmaps);
set_bit is an atomic, you want __set_bit()
> + return 0;
> +}
>
> +static int kho_radix_preserve_page(unsigned long pa, unsigned int order)
phys_addr_t
> +{
> + unsigned long encoded = kho_radix_encode(pa, order);
> + int num_tree_level = kho_radix_tree_max_depth();
kho_radix_tree_max_depth() is constant, stick it in an enum with the
rest of them.
> + struct kho_radix_tree *current_tree, *new_tree;
> + struct kho_bitmap_table *bitmap_table;
> + int err = 0;
> + int i, idx;
various unsigned int.
>
> + down_write(&kho_radix_tree_root_sem);
>
> + current_tree = kho_radix_tree_root;
>
> + /* Go from high levels to low levels */
> + for (i = num_tree_level; i >= 1; i--) {
> + idx = kho_radix_get_index(encoded, i);
> +
> + if (i == 1) {
> + bitmap_table = (struct kho_bitmap_table *)current_tree;
> + err = kho_radix_set_bitmap(bitmap_table, idx);
> + goto out;
> + }
> +
> + if (!current_tree->table[idx]) {
> + new_tree = kho_alloc_radix_tree();
> + if (!new_tree) {
> + err = -ENOMEM;
> + goto out;
> + }
> +
> + current_tree->table[idx] =
> + (unsigned long)virt_to_phys(new_tree);
current_tree = new_tree
> + }
else
> +
> + current_tree = (struct kho_radix_tree *)
> + phys_to_virt(current_tree->table[idx]);
> }
> +
> +out:
> + up_write(&kho_radix_tree_root_sem);
> + return err;
> }
>
> +static int kho_radix_walk_bitmaps(struct kho_bitmap_table *bit_tlb,
> + unsigned long offset,
phys_addr_t
> + kho_radix_tree_walk_callback_t cb)
> {
> + unsigned long encoded = offset << (PAGE_SHIFT + ilog2(BITS_PER_BYTE));
> + unsigned long *bitmap = (unsigned long *)bit_tlb;
> + int err = 0;
> + int i;
>
> + for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) {
> + err = cb(encoded | i);
> + if (err)
> + return err;
> + }
>
> + return 0;
> +}
>
> +static int kho_radix_walk_trees(struct kho_radix_tree *root, int level,
unsigned int
> + unsigned long offset,
phys_addr_t. I would call this start not offset..
> + kho_radix_tree_walk_callback_t cb)
> +{
> + int level_shift = ilog2(PAGE_SIZE / sizeof(unsigned long));
> + struct kho_radix_tree *next_tree;
> + unsigned long encoded, i;
> + int err = 0;
>
> + if (level == 1) {
> + encoded = offset;
> + return kho_radix_walk_bitmaps((struct kho_bitmap_table *)root,
> + encoded, cb);
Better to do this in the caller a few lines below
> + }
>
> + for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) {
> + if (root->table[i]) {
> + encoded = offset << level_shift | i;
This doesn't seem right..
The argument to the walker should be the starting encoded of the table
it is about to walk.
Since everything always starts at 0 it should always be
start | (i << level_shift)
?
> + next_tree = (struct kho_radix_tree *)
> + phys_to_virt(root->table[i]);
> + err = kho_radix_walk_trees(next_tree, level - 1, encoded, cb);
> if (err)
> return err;
> }
> }
>
> + return 0;
> +}
>
> +static int kho_memblock_reserve(phys_addr_t pa, int order)
> +{
> + int sz = 1 << (order + PAGE_SHIFT);
> + struct page *page = phys_to_page(pa);
> +
> + memblock_reserve(pa, sz);
> + memblock_reserved_mark_noinit(pa, sz);
> + page->private = order;
>
> return 0;
> }
>
> +static int kho_radix_walk_trees_callback(unsigned long encoded)
> +{
> + unsigned int order;
> + unsigned long pa;
> +
> + pa = kho_radix_decode(encoded, &order);
> +
> + return kho_memblock_reserve(pa, order);
> +}
> +
> +struct kho_serialization {
> + struct page *fdt;
> + struct list_head fdt_list;
> + struct dentry *sub_fdt_dir;
> +};
> +
> +static int __kho_preserve_order(unsigned long pfn, unsigned int order)
> +{
> + unsigned long pa = PFN_PHYS(pfn);
phys_addr_t
Jason
Powered by blists - more mailing lists