lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CA+CK2bBaQEVySFcNmyQnWayZ7K=z5FxU7wTyNtmO7ja7NNGo3Q@mail.gmail.com>
Date: Mon, 6 Oct 2025 13:26:57 -0400
From: Pasha Tatashin <pasha.tatashin@...een.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Jason Miu <jasonmiu@...gle.com>, Alexander Graf <graf@...zon.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, Baoquan He <bhe@...hat.com>, 
	Changyuan Lyu <changyuanl@...gle.com>, David Matlack <dmatlack@...gle.com>, 
	David Rientjes <rientjes@...gle.com>, Mike Rapoport <rppt@...nel.org>, 
	Pratyush Yadav <pratyush@...nel.org>, kexec@...ts.infradead.org, 
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v1 1/3] kho: Adopt KHO radix tree data structures

On Mon, Oct 6, 2025 at 10:14 AM Jason Gunthorpe <jgg@...dia.com> wrote:
>
> On Tue, Sep 30, 2025 at 06:19:39PM -0700, Jason Miu wrote:
> > @@ -29,7 +30,7 @@
> >  #include "kexec_internal.h"
> >
> >  #define KHO_FDT_COMPATIBLE "kho-v1"
>
> We don't bump this?
>
> > -#define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map"
> > +#define PROP_PRESERVED_PAGE_RADIX_TREE "preserved-page-radix-tree"
> >  #define PROP_SUB_FDT "fdt"
>
> I'de really like to see all of these sorts of definitions in some
> structured ABI header not open coded all over the place..
>
> >  /*
> > + * The KHO radix tree tracks preserved memory pages. It is a hierarchical
> > + * structure that starts with a single root `kho_radix_tree`. This single
> > + * tree stores pages of all orders.
> > + *
> > + * This is achieved by encoding the page's physical address and its order into
> > + * a single `unsigned long` value. This encoded value is then used to traverse
> > + * the tree.
> > + *
> > + * The tree hierarchy is shown below:
> > + *
> > + * kho_radix_tree_root
> > + * +-------------------+
> > + * |     Level 6       | (struct kho_radix_tree)
> > + * +-------------------+
> > + *   |
> > + *   v
> > + * +-------------------+
> > + * |     Level 5       | (struct kho_radix_tree)
> > + * +-------------------+
> > + *   |
> > + *   | ... (intermediate levels)
> > + *   |
> > + *   v
> > + * +-------------------+
> > + * |      Level 1      | (struct kho_bitmap_table)
> > + * +-------------------+
> > + *
> > + * The following diagram illustrates how the encoded value is split into
> > + * indices for the tree levels:
> >   *
> > + *      63:60   59:51    50:42    41:33    32:24    23:15         14:0
> > + * +---------+--------+--------+--------+--------+--------+-----------------+
> > + * |    0    |  Lv 6  |  Lv 5  |  Lv 4  |  Lv 3  |  Lv 2  |  Lv 1 (bitmap)  |
> > + * +---------+--------+--------+--------+--------+--------+-----------------+
> >   *
> > + * Each `kho_radix_tree` (Levels 2-6) and `kho_bitmap_table` (Level 1) is
> > + * PAGE_SIZE. Each entry in a `kho_radix_tree` is a descriptor (a physical
> > + * address) pointing to the next level node. For Level 2 `kho_radix_tree`
> > + * nodes, these descriptors point to a `kho_bitmap_table`. The final
> > + * `kho_bitmap_table` is a bitmap where each set bit represents a single
> > + * preserved page.
>
> Maybe a note that this is example is for PAGE_SIZE=4k.
>
>
> >   */
> > +struct kho_radix_tree {
> > +     unsigned long table[PAGE_SIZE / sizeof(unsigned long)];
>
> This should be phys_addr_t.

Maybe u64 ? This is a preserved data, I would specify the size, and
not care about 32-bit arches. Also, if we ever have to support larger
physical spaces, this radix tree version would need to be bumped
anyway.

>
> > +};
>
> You dropped the macros so now we don't know these are actually
> pointers to 'struct kho_radix_tree'
>
> > +/*
> > + * `kho_radix_tree_root` points to a page thats serves as the root of the
> > + * KHO radix tree. This page is allocated during KHO module initialization.
> > + * Its physical address is written to the FDT and passed to the next kernel
> > + * during kexec.
> > + */
> > +static struct kho_radix_tree *kho_radix_tree_root;
> > +static DECLARE_RWSEM(kho_radix_tree_root_sem);
> > +
> > +static int kho_radix_tree_max_depth(void)
> > +{
> > +     int page_offset_bit_num = BITS_PER_LONG - PAGE_SHIFT;
> > +     int order_bit_num = ilog2(__roundup_pow_of_two(page_offset_bit_num));
> > +     int bitmap_bit_num = PAGE_SHIFT + ilog2(BITS_PER_BYTE);
> > +     int table_bit_num = ilog2(PAGE_SIZE / sizeof(unsigned long));
> > +     int table_level_num = DIV_ROUND_UP(page_offset_bit_num -
> > +                                        bitmap_bit_num + order_bit_num,
> > +                                        table_bit_num);
>
> All should be unsigned int. Below I suggest to put it in an enum and
> use different names.. And since the function is constant it can just
> be an enum TOP_LEVEL too.
>
> > +/*
> > + * The KHO radix tree tracks preserved pages by encoding a page's physical
> > + * address (pa) and its order into a single unsigned long value. This value
> > + * is then used to traverse the tree. The encoded value is composed of two
> > + * parts: the 'order bits' in the upper part and the 'page offset' in the
> > + * lower part.
> > + *
> > + *   <-- Higher Bits ------------------------------------ Lower Bits -->
> > + *  +--------------------------+-----------------------------------------+
> > + *  |        Order Bits        |               Page Offset               |
> > + *  +--------------------------+-----------------------------------------+
> > + *  | ... 0 0 1 0 0 ...        | pa >> (PAGE_SHIFT + order)              |
> > + *  +--------------------------+-----------------------------------------+
> > + *            ^
> > + *            |
> > + *  This single '1' bit's position
> > + *  uniquely identifies the 'order'.
> > + *
> > + *
> > + * Page Offset:
> > + * The 'page offset' is the physical address normalized for its order. It
> > + * effectively represents the page offset for the given order.
> > + *
> > + * Order Bits:
> > + * The 'order bits' encode the page order by setting a single bit at a
> > + * specific position. The position of this bit itself represents the order.
> > + *
> > + * For instance, on a 64-bit system with 4KB pages (PAGE_SHIFT = 12), the
> > + * maximum range for a page offset (for order 0) is 52 bits (64 - 12). This
> > + * offset occupies bits [0-51]. For order 0, the order bit is set at
> > + * position 52.
> > + *
> > + * As the order increases, the number of bits required for the 'page offset'
> > + * decreases. For example, order 1 requires one less bit for its page
> > + * offset. This allows its order bit to be set at position 51 without
> > + * conflicting with the page offset bits.
> > + *
> > + * This scheme ensures that the single order bit is always in a higher
> > + * position than any bit used by the page offset for that same order,
> > + * preventing collisions.
>
> Should explain why it is like this:
>
> This scheme allows storing all the multi-order page sizes in a single
> 6 level table with a good sharing of lower tables levels for 0 top
> address bits. A single algorithm can efficiently process everything.
>
> > + */
> > +static unsigned long kho_radix_encode(unsigned long pa, unsigned int order)
>
> pa is phys_addr_t in the kernel, never unsigned long.
>
> If you want to make it all dynamic then this should be phys_addr_t
>
> > +{
> > +     unsigned long h = 1UL << (BITS_PER_LONG - PAGE_SHIFT - order);
>
> And this BITS_PER_LONG is confused, it is BITS_PER_PHYS_ADDR_T which
> may not exist.
>
> Use an enum ORDER_0_LG2 maybe
>
> > +     unsigned long l = pa >> (PAGE_SHIFT + order);
> >
> > +     return h | l;
> > +}
> >
> > +static unsigned long kho_radix_decode(unsigned long encoded, unsigned int *order)
>
> Returns phys_addr_t
>
> >  {
> > -     void *elm, *res;
> > +     unsigned long order_bit = fls64(encoded);
>
> unsigned int
>
> > +     unsigned long pa;
>
> phys_addr_t
>
> > +     *order = BITS_PER_LONG - PAGE_SHIFT - order_bit + 1;
>
> ORDER_0_LG2
>
> > +     pa = encoded << (PAGE_SHIFT + *order);
>
> I'd add a comment that the shift always discards order.
>
> > +     return pa;
> > +}
> >
> > +static unsigned long kho_radix_get_index(unsigned long encoded, int level)
>
> unsigned int level
>
> > +{
> > +     int table_bit_num = ilog2(PAGE_SIZE / sizeof(unsigned long));
> > +     int bitmap_bit_num = PAGE_SHIFT + ilog2(BITS_PER_BYTE);
>
> Stick all the constants that kho_radix_tree_max_depth() are computing
> in an enum instead of recomputing them..
>
> > +     unsigned long mask;
> > +     int s;
>
> unsigned for all of these.
>
> > +
> > +     if (level == 1) {
>
> I think the math is easier if level 0 == bitmap..
>
> > +             s = 0;
> > +             mask = (1UL << bitmap_bit_num) - 1;
> > +     } else {
> > +             s = ((level - 2) * table_bit_num) + bitmap_bit_num;
>
> eg here you are doing level-2 which is a bit weird only because of the
> arbitary choice to make level=1 be the bitmap.
>
> I'd also use some different names
>
> table_bit_num == TABLE_SIZE_LG2
> BITMAP_BIT_NUM = BITMAP_SIZE_LG2
>
> Log2 designates the value is 1<<LG2
>
> > +             mask = (1UL << table_bit_num) - 1;
> >       }
> >
> > -     return elm;
> > +     return (encoded >> s) & mask;
>
> It is just:
>
> return encoded % (1 << BITMAP_SIZE_LG2);
> return (encoded >> s) % (1 << TABLE_SIZE_LG2);
>
> The compiler is smart enough to choose bit logic if that is the
> fastest option and the above is more readable.
>
> > +static int kho_radix_set_bitmap(struct kho_bitmap_table *bit_tlb, unsigned long offset)
> >  {
> > +     if (!bit_tlb ||
> > +         offset >= PAGE_SIZE * BITS_PER_BYTE)
> > +             return -EINVAL;
> >
> > +     set_bit(offset, bit_tlb->bitmaps);
>
> set_bit is an atomic, you want __set_bit()
>
> > +     return 0;
> > +}
> >
> > +static int kho_radix_preserve_page(unsigned long pa, unsigned int order)
>
> phys_addr_t
>
> > +{
> > +     unsigned long encoded = kho_radix_encode(pa, order);
> > +     int num_tree_level = kho_radix_tree_max_depth();
>
> kho_radix_tree_max_depth() is constant, stick it in an enum with the
> rest of them.
>
> > +     struct kho_radix_tree *current_tree, *new_tree;
> > +     struct kho_bitmap_table *bitmap_table;
> > +     int err = 0;
> > +     int i, idx;
>
> various unsigned int.
>
> >
> > +     down_write(&kho_radix_tree_root_sem);
> >
> > +     current_tree = kho_radix_tree_root;
> >
> > +     /* Go from high levels to low levels */
> > +     for (i = num_tree_level; i >= 1; i--) {
> > +             idx = kho_radix_get_index(encoded, i);
> > +
> > +             if (i == 1) {
> > +                     bitmap_table = (struct kho_bitmap_table *)current_tree;
> > +                     err = kho_radix_set_bitmap(bitmap_table, idx);
> > +                     goto out;
> > +             }
> > +
> > +             if (!current_tree->table[idx]) {
> > +                     new_tree = kho_alloc_radix_tree();
> > +                     if (!new_tree) {
> > +                             err = -ENOMEM;
> > +                             goto out;
> > +                     }
> > +
> > +                     current_tree->table[idx] =
> > +                             (unsigned long)virt_to_phys(new_tree);
>
> current_tree = new_tree
> > +             }
>
> else
>
> > +
> > +             current_tree = (struct kho_radix_tree *)
> > +                     phys_to_virt(current_tree->table[idx]);
> >       }
> > +
> > +out:
> > +     up_write(&kho_radix_tree_root_sem);
> > +     return err;
> >  }
> >
> > +static int kho_radix_walk_bitmaps(struct kho_bitmap_table *bit_tlb,
> > +                               unsigned long offset,
>
> phys_addr_t
>
> > +                               kho_radix_tree_walk_callback_t cb)
> >  {
> > +     unsigned long encoded = offset << (PAGE_SHIFT + ilog2(BITS_PER_BYTE));
> > +     unsigned long *bitmap = (unsigned long *)bit_tlb;
> > +     int err = 0;
> > +     int i;
> >
> > +     for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) {
> > +             err = cb(encoded | i);
> > +             if (err)
> > +                     return err;
> > +     }
> >
> > +     return 0;
> > +}
> >
> > +static int kho_radix_walk_trees(struct kho_radix_tree *root, int level,
>
> unsigned int
>
> > +                             unsigned long offset,
>
> phys_addr_t.  I would call this start not offset..
>
> > +                             kho_radix_tree_walk_callback_t cb)
> > +{
> > +     int level_shift = ilog2(PAGE_SIZE / sizeof(unsigned long));
> > +     struct kho_radix_tree *next_tree;
> > +     unsigned long encoded, i;
> > +     int err = 0;
> >
> > +     if (level == 1) {
> > +             encoded = offset;
> > +             return kho_radix_walk_bitmaps((struct kho_bitmap_table *)root,
> > +                                           encoded, cb);
>
> Better to do this in the caller  a few lines below
>
> > +     }
>
> >
> > +     for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) {
> > +             if (root->table[i]) {
> > +                     encoded = offset << level_shift | i;
>
> This doesn't seem right..
>
> The argument to the walker should be the starting encoded of the table
> it is about to walk.
>
> Since everything always starts at 0 it should always be
>   start | (i << level_shift)
>
> ?
>
> > +                     next_tree = (struct kho_radix_tree *)
> > +                             phys_to_virt(root->table[i]);
> > +                     err = kho_radix_walk_trees(next_tree, level - 1, encoded, cb);
> >                       if (err)
> >                               return err;
> >               }
> >       }
> >
> > +     return 0;
> > +}
> >
> > +static int kho_memblock_reserve(phys_addr_t pa, int order)
> > +{
> > +     int sz = 1 << (order + PAGE_SHIFT);
> > +     struct page *page = phys_to_page(pa);
> > +
> > +     memblock_reserve(pa, sz);
> > +     memblock_reserved_mark_noinit(pa, sz);
> > +     page->private = order;
> >
> >       return 0;
> >  }
> >
> > +static int kho_radix_walk_trees_callback(unsigned long encoded)
> > +{
> > +     unsigned int order;
> > +     unsigned long pa;
> > +
> > +     pa = kho_radix_decode(encoded, &order);
> > +
> > +     return kho_memblock_reserve(pa, order);
> > +}
> > +
> > +struct kho_serialization {
> > +     struct page *fdt;
> > +     struct list_head fdt_list;
> > +     struct dentry *sub_fdt_dir;
> > +};
> > +
> > +static int __kho_preserve_order(unsigned long pfn, unsigned int order)
> > +{
> > +     unsigned long pa = PFN_PHYS(pfn);
>
> phys_addr_t
>
> Jason

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ