[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHN2nPKjg6=0QTcSoptxvQW9MpP0YwGUTx_gQDBxgCtSzxY5Pg@mail.gmail.com>
Date: Wed, 8 Oct 2025 19:07:18 -0700
From: Jason Miu <jasonmiu@...gle.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Alexander Graf <graf@...zon.com>, Andrew Morton <akpm@...ux-foundation.org>,
Baoquan He <bhe@...hat.com>, Changyuan Lyu <changyuanl@...gle.com>,
David Matlack <dmatlack@...gle.com>, David Rientjes <rientjes@...gle.com>,
Mike Rapoport <rppt@...nel.org>, Pasha Tatashin <pasha.tatashin@...een.com>,
Pratyush Yadav <pratyush@...nel.org>, kexec@...ts.infradead.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v1 1/3] kho: Adopt KHO radix tree data structures
Hi Jason,
Thank you very much for your feedback again.
On Mon, Oct 6, 2025 at 7:14 AM Jason Gunthorpe <jgg@...dia.com> wrote:
> > #define KHO_FDT_COMPATIBLE "kho-v1"
>
> We don't bump this?
Will do. Will be "kho-v2".
>
> > -#define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map"
> > +#define PROP_PRESERVED_PAGE_RADIX_TREE "preserved-page-radix-tree"
> > #define PROP_SUB_FDT "fdt"
>
> I'de really like to see all of these sorts of definitions in some
> structured ABI header not open coded all over the place..
Do you think `include/linux/kexec_handover.h` is the appropriate
place, or would you prefer a new, dedicated ABI header (e.g., in
`include/uapi/linux/`) for all KHO-related FDT constants?
>
> > */
> > +struct kho_radix_tree {
> > + unsigned long table[PAGE_SIZE / sizeof(unsigned long)];
>
> This should be phys_addr_t.
>
> > +};
>
> You dropped the macros so now we don't know these are actually
> pointers to 'struct kho_radix_tree'
>
Agreed. Will change `u64` according to Pasha's comment. And we use
explicit casts like `(u64)virt_to_phys(new_tree)` and `(struct
kho_radix_tree *)phys_to_virt(table_entry)` in the current series. I
believe this, along with the `u64` type, makes it clear that the table
stores physical addresses.
> > +static int kho_radix_tree_max_depth(void)
> > +{
> > + int page_offset_bit_num = BITS_PER_LONG - PAGE_SHIFT;
> > + int order_bit_num = ilog2(__roundup_pow_of_two(page_offset_bit_num));
> > + int bitmap_bit_num = PAGE_SHIFT + ilog2(BITS_PER_BYTE);
> > + int table_bit_num = ilog2(PAGE_SIZE / sizeof(unsigned long));
> > + int table_level_num = DIV_ROUND_UP(page_offset_bit_num -
> > + bitmap_bit_num + order_bit_num,
> > + table_bit_num);
>
> All should be unsigned int. Below I suggest to put it in an enum and
> use different names.. And since the function is constant it can just
> be an enum TOP_LEVEL too.
>
Yes I did think of returning a const for `kho_radix_tree_max_depth()`.
I think using enums is a better idea and I can place all above values
as enums.
> > + */
> > +static unsigned long kho_radix_encode(unsigned long pa, unsigned int order)
>
> pa is phys_addr_t in the kernel, never unsigned long.
>
> If you want to make it all dynamic then this should be phys_addr_t
Should this also be `u64`, or we stay with `phys_addr_t` for all page
addresses?
>
> > +{
> > + unsigned long h = 1UL << (BITS_PER_LONG - PAGE_SHIFT - order);
>
> And this BITS_PER_LONG is confused, it is BITS_PER_PHYS_ADDR_T which
> may not exist.
>
> Use an enum ORDER_0_LG2 maybe
I prefer `KHO_RADIX_ORDER_0_BIT_POS` (defined as `BITS_PER_LONG -
PAGE_SHIFT`) over `ORDER_0_LG2`, as I think the latter is a bit hard
to understand, what do you think? This constant, along with others,
will be placed in the enum.
>
> > + unsigned long l = pa >> (PAGE_SHIFT + order);
> >
> > + return h | l;
> > +}
> >
> > +static unsigned long kho_radix_decode(unsigned long encoded, unsigned int *order)
>
> Returns phys_addr_t
>
> > {
> > - void *elm, *res;
> > + unsigned long order_bit = fls64(encoded);
>
> unsigned int
>
> > + unsigned long pa;
>
> phys_addr_t
>
> > + *order = BITS_PER_LONG - PAGE_SHIFT - order_bit + 1;
>
> ORDER_0_LG2
>
> > + pa = encoded << (PAGE_SHIFT + *order);
>
> I'd add a comment that the shift always discards order.
>
> > + return pa;
> > +}
> >
> > +static unsigned long kho_radix_get_index(unsigned long encoded, int level)
>
> unsigned int level
>
> > +{
> > + int table_bit_num = ilog2(PAGE_SIZE / sizeof(unsigned long));
> > + int bitmap_bit_num = PAGE_SHIFT + ilog2(BITS_PER_BYTE);
>
> Stick all the constants that kho_radix_tree_max_depth() are computing
> in an enum instead of recomputing them..
>
> > + unsigned long mask;
> > + int s;
>
> unsigned for all of these.
>
> > +
> > + if (level == 1) {
>
> I think the math is easier if level 0 == bitmap..
>
> > + s = 0;
> > + mask = (1UL << bitmap_bit_num) - 1;
> > + } else {
> > + s = ((level - 2) * table_bit_num) + bitmap_bit_num;
>
> eg here you are doing level-2 which is a bit weird only because of the
> arbitary choice to make level=1 be the bitmap.
>
> I'd also use some different names
>
> table_bit_num == TABLE_SIZE_LG2
> BITMAP_BIT_NUM = BITMAP_SIZE_LG2
>
> Log2 designates the value is 1<<LG2
Good point on the level numbering, we'll switch to 0-based where level
0 is the bitmap. The modulo operations you suggested play nicely with
the 0-based numbering too, thanks. Will also update the names and put
them in enum.
>
> > + mask = (1UL << table_bit_num) - 1;
> > }
> >
> > - return elm;
> > + return (encoded >> s) & mask;
>
> It is just:
>
> return encoded % (1 << BITMAP_SIZE_LG2);
> return (encoded >> s) % (1 << TABLE_SIZE_LG2);
>
> The compiler is smart enough to choose bit logic if that is the
> fastest option and the above is more readable.
>
> > +static int kho_radix_set_bitmap(struct kho_bitmap_table *bit_tlb, unsigned long offset)
> > {
> > + if (!bit_tlb ||
> > + offset >= PAGE_SIZE * BITS_PER_BYTE)
> > + return -EINVAL;
> >
> > + set_bit(offset, bit_tlb->bitmaps);
>
> set_bit is an atomic, you want __set_bit()
>
> > + return 0;
> > +}
> >
> > +static int kho_radix_preserve_page(unsigned long pa, unsigned int order)
>
> phys_addr_t
>
> > +{
> > + unsigned long encoded = kho_radix_encode(pa, order);
> > + int num_tree_level = kho_radix_tree_max_depth();
>
> kho_radix_tree_max_depth() is constant, stick it in an enum with the
> rest of them.
>
> > + struct kho_radix_tree *current_tree, *new_tree;
> > + struct kho_bitmap_table *bitmap_table;
> > + int err = 0;
> > + int i, idx;
>
> various unsigned int.
>
> >
> > + down_write(&kho_radix_tree_root_sem);
> >
> > + current_tree = kho_radix_tree_root;
> >
> > + /* Go from high levels to low levels */
> > + for (i = num_tree_level; i >= 1; i--) {
> > + idx = kho_radix_get_index(encoded, i);
> > +
> > + if (i == 1) {
> > + bitmap_table = (struct kho_bitmap_table *)current_tree;
> > + err = kho_radix_set_bitmap(bitmap_table, idx);
> > + goto out;
> > + }
> > +
> > + if (!current_tree->table[idx]) {
> > + new_tree = kho_alloc_radix_tree();
> > + if (!new_tree) {
> > + err = -ENOMEM;
> > + goto out;
> > + }
> > +
> > + current_tree->table[idx] =
> > + (unsigned long)virt_to_phys(new_tree);
>
> current_tree = new_tree
> > + }
>
> else
>
> > +
> > + current_tree = (struct kho_radix_tree *)
> > + phys_to_virt(current_tree->table[idx]);
> > }
> > +
> > +out:
> > + up_write(&kho_radix_tree_root_sem);
> > + return err;
> > }
> >
> > +static int kho_radix_walk_bitmaps(struct kho_bitmap_table *bit_tlb,
> > + unsigned long offset,
>
> phys_addr_t
>
> > + kho_radix_tree_walk_callback_t cb)
> > {
> > + unsigned long encoded = offset << (PAGE_SHIFT + ilog2(BITS_PER_BYTE));
> > + unsigned long *bitmap = (unsigned long *)bit_tlb;
> > + int err = 0;
> > + int i;
> >
> > + for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) {
> > + err = cb(encoded | i);
> > + if (err)
> > + return err;
> > + }
> >
> > + return 0;
> > +}
> >
> > +static int kho_radix_walk_trees(struct kho_radix_tree *root, int level,
>
> unsigned int
>
> > + unsigned long offset,
>
> phys_addr_t. I would call this start not offset..
>
> > + kho_radix_tree_walk_callback_t cb)
> > +{
> > + int level_shift = ilog2(PAGE_SIZE / sizeof(unsigned long));
> > + struct kho_radix_tree *next_tree;
> > + unsigned long encoded, i;
> > + int err = 0;
> >
> > + if (level == 1) {
> > + encoded = offset;
> > + return kho_radix_walk_bitmaps((struct kho_bitmap_table *)root,
> > + encoded, cb);
>
> Better to do this in the caller a few lines below
But the caller is in a different tree level? Should we only walk the
bitmaps at the lowest level?
>
> > + }
>
> >
> > + for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) {
> > + if (root->table[i]) {
> > + encoded = offset << level_shift | i;
>
> This doesn't seem right..
>
> The argument to the walker should be the starting encoded of the table
> it is about to walk.
>
> Since everything always starts at 0 it should always be
> start | (i << level_shift)
>
> ?
You're right that this line might not be immediately intuitive. The
var `level_shift` (which is constant value 9 here) is applied to the
*accumulated* `offset` from the parent level. Let's consider an
example of a preserved page at physical address `0x1000`, which
encodes to `0x10000000000001` (bit 52 is set for order 0, bit 0 is set
for page 1).
If we were to use `start | (i << level_shift)` where `level_shift` is
a constant 9, and `start` is the value from the parent call:
- At Level 6, `start` is `0`. `i` is 2 as bit 51:59 = 2. Result: `0
| (2 << 9) = 0x400`. This is passed to Level 5.
- At Level 5, `start` is `0x400`, `i` is 0 as bit 50:42 = 0. Result:
`0x400 | (0 << 9) = 0x400`. This is passed to Level 4.
- At Level 4, `start` is `0x400`, `i` is 0 as bit 33:41 = 0. Result:
`0x400 | (0 << 9) = 0x400`. And so on.
As you can see, the encoded value is not correctly reconstructed. This
will work if the `level_shift` represents the total shift from the LSB
for each specific level, but it is not the case here.
I will, however, add a detailed comment to `kho_radix_walk_trees` to
clarify this logic. I also agree to change the name of `offset` to
make it more clearer, how about `base_encoded`, or do you still prefer
`start`?
>
> > + next_tree = (struct kho_radix_tree *)
> > + phys_to_virt(root->table[i]);
> > + err = kho_radix_walk_trees(next_tree, level - 1, encoded, cb);
> > if (err)
> > return err;
> > }
> > }
> >
> > + return 0;
> > +}
> >
> > +static int kho_memblock_reserve(phys_addr_t pa, int order)
> > +{
> > + int sz = 1 << (order + PAGE_SHIFT);
> > + struct page *page = phys_to_page(pa);
> > +
> > + memblock_reserve(pa, sz);
> > + memblock_reserved_mark_noinit(pa, sz);
> > + page->private = order;
> >
> > return 0;
> > }
> >
> > +static int kho_radix_walk_trees_callback(unsigned long encoded)
> > +{
> > + unsigned int order;
> > + unsigned long pa;
> > +
> > + pa = kho_radix_decode(encoded, &order);
> > +
> > + return kho_memblock_reserve(pa, order);
> > +}
> > +
> > +struct kho_serialization {
> > + struct page *fdt;
> > + struct list_head fdt_list;
> > + struct dentry *sub_fdt_dir;
> > +};
> > +
> > +static int __kho_preserve_order(unsigned long pfn, unsigned int order)
> > +{
> > + unsigned long pa = PFN_PHYS(pfn);
>
> phys_addr_t
>
> Jason
Will do the update in the next patch version. Thanks again.
--
Jason Miu
Powered by blists - more mailing lists