lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHN2nPKjg6=0QTcSoptxvQW9MpP0YwGUTx_gQDBxgCtSzxY5Pg@mail.gmail.com>
Date: Wed, 8 Oct 2025 19:07:18 -0700
From: Jason Miu <jasonmiu@...gle.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Alexander Graf <graf@...zon.com>, Andrew Morton <akpm@...ux-foundation.org>, 
	Baoquan He <bhe@...hat.com>, Changyuan Lyu <changyuanl@...gle.com>, 
	David Matlack <dmatlack@...gle.com>, David Rientjes <rientjes@...gle.com>, 
	Mike Rapoport <rppt@...nel.org>, Pasha Tatashin <pasha.tatashin@...een.com>, 
	Pratyush Yadav <pratyush@...nel.org>, kexec@...ts.infradead.org, 
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v1 1/3] kho: Adopt KHO radix tree data structures

Hi Jason,

Thank you very much for your feedback again.

On Mon, Oct 6, 2025 at 7:14 AM Jason Gunthorpe <jgg@...dia.com> wrote:
> >  #define KHO_FDT_COMPATIBLE "kho-v1"
>
> We don't bump this?
Will do. Will be "kho-v2".

>
> > -#define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map"
> > +#define PROP_PRESERVED_PAGE_RADIX_TREE "preserved-page-radix-tree"
> >  #define PROP_SUB_FDT "fdt"
>
> I'de really like to see all of these sorts of definitions in some
> structured ABI header not open coded all over the place..

Do you think `include/linux/kexec_handover.h` is the appropriate
place, or would you prefer a new, dedicated ABI header (e.g., in
`include/uapi/linux/`) for all KHO-related FDT constants?


>
> >   */
> > +struct kho_radix_tree {
> > +     unsigned long table[PAGE_SIZE / sizeof(unsigned long)];
>
> This should be phys_addr_t.
>
> > +};
>
> You dropped the macros so now we don't know these are actually
> pointers to 'struct kho_radix_tree'
>

Agreed. Will change `u64` according to Pasha's comment. And we use
explicit casts like `(u64)virt_to_phys(new_tree)` and `(struct
kho_radix_tree *)phys_to_virt(table_entry)` in the current series. I
believe this, along with the `u64` type, makes it clear that the table
stores physical addresses.


> > +static int kho_radix_tree_max_depth(void)
> > +{
> > +     int page_offset_bit_num = BITS_PER_LONG - PAGE_SHIFT;
> > +     int order_bit_num = ilog2(__roundup_pow_of_two(page_offset_bit_num));
> > +     int bitmap_bit_num = PAGE_SHIFT + ilog2(BITS_PER_BYTE);
> > +     int table_bit_num = ilog2(PAGE_SIZE / sizeof(unsigned long));
> > +     int table_level_num = DIV_ROUND_UP(page_offset_bit_num -
> > +                                        bitmap_bit_num + order_bit_num,
> > +                                        table_bit_num);
>
> All should be unsigned int. Below I suggest to put it in an enum and
> use different names.. And since the function is constant it can just
> be an enum TOP_LEVEL too.
>

Yes I did think of returning a const for `kho_radix_tree_max_depth()`.
I think using enums is a better idea and I can place all above values
as enums.


> > + */
> > +static unsigned long kho_radix_encode(unsigned long pa, unsigned int order)
>
> pa is phys_addr_t in the kernel, never unsigned long.
>
> If you want to make it all dynamic then this should be phys_addr_t

Should this also be `u64`, or we stay with `phys_addr_t` for all page
addresses?

>
> > +{
> > +     unsigned long h = 1UL << (BITS_PER_LONG - PAGE_SHIFT - order);
>
> And this BITS_PER_LONG is confused, it is BITS_PER_PHYS_ADDR_T which
> may not exist.
>
> Use an enum ORDER_0_LG2 maybe

I prefer `KHO_RADIX_ORDER_0_BIT_POS` (defined as `BITS_PER_LONG -
PAGE_SHIFT`) over `ORDER_0_LG2`, as I think the latter is a bit hard
to understand, what do you think? This constant, along with others,
will be placed in the enum.

>
> > +     unsigned long l = pa >> (PAGE_SHIFT + order);
> >
> > +     return h | l;
> > +}
> >
> > +static unsigned long kho_radix_decode(unsigned long encoded, unsigned int *order)
>
> Returns phys_addr_t
>
> >  {
> > -     void *elm, *res;
> > +     unsigned long order_bit = fls64(encoded);
>
> unsigned int
>
> > +     unsigned long pa;
>
> phys_addr_t
>
> > +     *order = BITS_PER_LONG - PAGE_SHIFT - order_bit + 1;
>
> ORDER_0_LG2
>
> > +     pa = encoded << (PAGE_SHIFT + *order);
>
> I'd add a comment that the shift always discards order.
>
> > +     return pa;
> > +}
> >
> > +static unsigned long kho_radix_get_index(unsigned long encoded, int level)
>
> unsigned int level
>
> > +{
> > +     int table_bit_num = ilog2(PAGE_SIZE / sizeof(unsigned long));
> > +     int bitmap_bit_num = PAGE_SHIFT + ilog2(BITS_PER_BYTE);
>
> Stick all the constants that kho_radix_tree_max_depth() are computing
> in an enum instead of recomputing them..
>
> > +     unsigned long mask;
> > +     int s;
>
> unsigned for all of these.
>
> > +
> > +     if (level == 1) {
>
> I think the math is easier if level 0 == bitmap..
>
> > +             s = 0;
> > +             mask = (1UL << bitmap_bit_num) - 1;
> > +     } else {
> > +             s = ((level - 2) * table_bit_num) + bitmap_bit_num;
>
> eg here you are doing level-2 which is a bit weird only because of the
> arbitary choice to make level=1 be the bitmap.
>
> I'd also use some different names
>
> table_bit_num == TABLE_SIZE_LG2
> BITMAP_BIT_NUM = BITMAP_SIZE_LG2
>
> Log2 designates the value is 1<<LG2

Good point on the level numbering, we'll switch to 0-based where level
0 is the bitmap. The modulo operations you suggested play nicely with
the 0-based numbering too, thanks. Will also update the names and put
them in enum.
>
> > +             mask = (1UL << table_bit_num) - 1;
> >       }
> >
> > -     return elm;
> > +     return (encoded >> s) & mask;
>
> It is just:
>
> return encoded % (1 << BITMAP_SIZE_LG2);
> return (encoded >> s) % (1 << TABLE_SIZE_LG2);
>
> The compiler is smart enough to choose bit logic if that is the
> fastest option and the above is more readable.
>
> > +static int kho_radix_set_bitmap(struct kho_bitmap_table *bit_tlb, unsigned long offset)
> >  {
> > +     if (!bit_tlb ||
> > +         offset >= PAGE_SIZE * BITS_PER_BYTE)
> > +             return -EINVAL;
> >
> > +     set_bit(offset, bit_tlb->bitmaps);
>
> set_bit is an atomic, you want __set_bit()
>
> > +     return 0;
> > +}
> >
> > +static int kho_radix_preserve_page(unsigned long pa, unsigned int order)
>
> phys_addr_t
>
> > +{
> > +     unsigned long encoded = kho_radix_encode(pa, order);
> > +     int num_tree_level = kho_radix_tree_max_depth();
>
> kho_radix_tree_max_depth() is constant, stick it in an enum with the
> rest of them.
>
> > +     struct kho_radix_tree *current_tree, *new_tree;
> > +     struct kho_bitmap_table *bitmap_table;
> > +     int err = 0;
> > +     int i, idx;
>
> various unsigned int.
>
> >
> > +     down_write(&kho_radix_tree_root_sem);
> >
> > +     current_tree = kho_radix_tree_root;
> >
> > +     /* Go from high levels to low levels */
> > +     for (i = num_tree_level; i >= 1; i--) {
> > +             idx = kho_radix_get_index(encoded, i);
> > +
> > +             if (i == 1) {
> > +                     bitmap_table = (struct kho_bitmap_table *)current_tree;
> > +                     err = kho_radix_set_bitmap(bitmap_table, idx);
> > +                     goto out;
> > +             }
> > +
> > +             if (!current_tree->table[idx]) {
> > +                     new_tree = kho_alloc_radix_tree();
> > +                     if (!new_tree) {
> > +                             err = -ENOMEM;
> > +                             goto out;
> > +                     }
> > +
> > +                     current_tree->table[idx] =
> > +                             (unsigned long)virt_to_phys(new_tree);
>
> current_tree = new_tree
> > +             }
>
> else
>
> > +
> > +             current_tree = (struct kho_radix_tree *)
> > +                     phys_to_virt(current_tree->table[idx]);
> >       }
> > +
> > +out:
> > +     up_write(&kho_radix_tree_root_sem);
> > +     return err;
> >  }
> >
> > +static int kho_radix_walk_bitmaps(struct kho_bitmap_table *bit_tlb,
> > +                               unsigned long offset,
>
> phys_addr_t
>
> > +                               kho_radix_tree_walk_callback_t cb)
> >  {
> > +     unsigned long encoded = offset << (PAGE_SHIFT + ilog2(BITS_PER_BYTE));
> > +     unsigned long *bitmap = (unsigned long *)bit_tlb;
> > +     int err = 0;
> > +     int i;
> >
> > +     for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) {
> > +             err = cb(encoded | i);
> > +             if (err)
> > +                     return err;
> > +     }
> >
> > +     return 0;
> > +}
> >
> > +static int kho_radix_walk_trees(struct kho_radix_tree *root, int level,
>
> unsigned int
>
> > +                             unsigned long offset,
>
> phys_addr_t.  I would call this start not offset..
>
> > +                             kho_radix_tree_walk_callback_t cb)
> > +{
> > +     int level_shift = ilog2(PAGE_SIZE / sizeof(unsigned long));
> > +     struct kho_radix_tree *next_tree;
> > +     unsigned long encoded, i;
> > +     int err = 0;
> >
> > +     if (level == 1) {
> > +             encoded = offset;
> > +             return kho_radix_walk_bitmaps((struct kho_bitmap_table *)root,
> > +                                           encoded, cb);
>
> Better to do this in the caller  a few lines below

But the caller is in a different tree level? Should we only walk the
bitmaps at the lowest level?

>
> > +     }
>
> >
> > +     for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) {
> > +             if (root->table[i]) {
> > +                     encoded = offset << level_shift | i;
>
> This doesn't seem right..
>
> The argument to the walker should be the starting encoded of the table
> it is about to walk.
>
> Since everything always starts at 0 it should always be
>   start | (i << level_shift)
>
> ?

You're right that this line might not be immediately intuitive. The
var `level_shift` (which is constant value 9 here) is applied to the
*accumulated* `offset` from the parent level. Let's consider an
example of a preserved page at physical address `0x1000`, which
encodes to `0x10000000000001` (bit 52 is set for order 0, bit 0 is set
for page 1).

If we were to use `start | (i << level_shift)` where `level_shift` is
a constant 9, and `start` is the value from the parent call:
  - At Level 6, `start` is `0`. `i` is 2 as bit 51:59 = 2. Result: `0
| (2 << 9) = 0x400`. This is passed to Level 5.
  - At Level 5, `start` is `0x400`, `i` is 0 as bit 50:42 = 0. Result:
`0x400 | (0 << 9) = 0x400`. This is passed to Level 4.
  - At Level 4, `start` is `0x400`, `i` is 0 as bit 33:41 = 0. Result:
`0x400 | (0 << 9) = 0x400`. And so on.

As you can see, the encoded value is not correctly reconstructed. This
will work if the `level_shift` represents the total shift from the LSB
for each specific level, but it is not the case here.

I will, however, add a detailed comment to `kho_radix_walk_trees` to
clarify this logic. I also agree to change the name of `offset` to
make it more clearer, how about `base_encoded`, or do you still prefer
`start`?

>
> > +                     next_tree = (struct kho_radix_tree *)
> > +                             phys_to_virt(root->table[i]);
> > +                     err = kho_radix_walk_trees(next_tree, level - 1, encoded, cb);
> >                       if (err)
> >                               return err;
> >               }
> >       }
> >
> > +     return 0;
> > +}
> >
> > +static int kho_memblock_reserve(phys_addr_t pa, int order)
> > +{
> > +     int sz = 1 << (order + PAGE_SHIFT);
> > +     struct page *page = phys_to_page(pa);
> > +
> > +     memblock_reserve(pa, sz);
> > +     memblock_reserved_mark_noinit(pa, sz);
> > +     page->private = order;
> >
> >       return 0;
> >  }
> >
> > +static int kho_radix_walk_trees_callback(unsigned long encoded)
> > +{
> > +     unsigned int order;
> > +     unsigned long pa;
> > +
> > +     pa = kho_radix_decode(encoded, &order);
> > +
> > +     return kho_memblock_reserve(pa, order);
> > +}
> > +
> > +struct kho_serialization {
> > +     struct page *fdt;
> > +     struct list_head fdt_list;
> > +     struct dentry *sub_fdt_dir;
> > +};
> > +
> > +static int __kho_preserve_order(unsigned long pfn, unsigned int order)
> > +{
> > +     unsigned long pa = PFN_PHYS(pfn);
>
> phys_addr_t
>
> Jason

Will do the update in the next patch version. Thanks again.

--
Jason Miu

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ