lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHN2nPKbxM5Bc9SDHHyvHbUPvsatMa6W59tXuVwfytxF7v+DRg@mail.gmail.com>
Date: Tue, 28 Oct 2025 18:28:02 -0700
From: Jason Miu <jasonmiu@...gle.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Alexander Graf <graf@...zon.com>, Andrew Morton <akpm@...ux-foundation.org>, 
	Baoquan He <bhe@...hat.com>, Changyuan Lyu <changyuanl@...gle.com>, 
	David Matlack <dmatlack@...gle.com>, David Rientjes <rientjes@...gle.com>, 
	Mike Rapoport <rppt@...nel.org>, Pasha Tatashin <pasha.tatashin@...een.com>, 
	Pratyush Yadav <pratyush@...nel.org>, kexec@...ts.infradead.org, 
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v2 1/3] kho: Adopt KHO radix tree data structures

On Thu, Oct 23, 2025 at 10:43 AM Jason Gunthorpe <jgg@...dia.com> wrote:
>
> On Mon, Oct 20, 2025 at 03:03:04AM -0700, Jason Miu wrote:
> > - * Keep track of memory that is to be preserved across KHO.
> > + * The KHO radix tree tracks preserved memory pages. It is a hierarchical
> > + * structure that starts with a single root `kho_radix_tree`. This single
> > + * tree stores pages of all orders.
> > + *
> > + * This is achieved by encoding the page's physical address and its order into
> > + * a single `unsigned long` value. This encoded value is then used to traverse
> > + * the tree.
> > + *
> > + * The tree hierarchy is shown below:
> > + *
> > + * kho_radix_tree_root
> > + * +-------------------+
> > + * |     Level 5       | (struct kho_radix_tree)
> > + * +-------------------+
> > + *   |
> > + *   v
> > + * +-------------------+
> > + * |     Level 4       | (struct kho_radix_tree)
> > + * +-------------------+
> > + *   |
> > + *   | ... (intermediate levels)
> > + *   |
> > + *   v
> > + * +-------------------+
> > + * |      Level 0      | (struct kho_bitmap_table)
> > + * +-------------------+
> >   *
> > - * The serializing side uses two levels of xarrays to manage chunks of per-order
> > - * 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a
> > - * 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations
> > - * each bitmap will cover 16M of address space. Thus, for 16G of memory at most
> > - * 512K of bitmap memory will be needed for order 0.
>
> I think it is valuable to preserve this justification for
> bitmaps. There was a lot of debate over bitmaps vs ranges and this is
> the advantage of bitmaps. It is a bit subtle..

Sure, I will update the description about using the bitmap, along with
the new values.


> > +/*
> > + * The KHO radix tree tracks preserved pages by encoding a page's physical
> > + * address (pa) and its order into a single unsigned long value. This value
> > + * is then used to traverse the tree. The encoded value is composed of two
> > + * parts: the 'order bits' in the upper part and the 'page offset' in the
> > + * lower part.
> > + *
> > + *   <-- Higher Bits ------------------------------------ Lower Bits -->
> > + *  +--------------------------+-----------------------------------------+
> > + *  |        Order Bits        |               Page Offset               |
> > + *  +--------------------------+-----------------------------------------+
> > + *  | ... 0 0 1 0 0 ...        | pa >> (PAGE_SHIFT + order)              |
> > + *  +--------------------------+-----------------------------------------+
> > + *            ^
> > + *            |
> > + *  This single '1' bit's position
> > + *  uniquely identifies the 'order'.
> > + *
> > + *
> > + * Page Offset:
> > + * The 'page offset' is the physical address normalized for its order. It
> > + * effectively represents the page offset for the given order.
> > + *
> > + * Order Bits:
> > + * The 'order bits' encode the page order by setting a single bit at a
> > + * specific position. The position of this bit itself represents the order.
> > + *
> > + * For instance, on a 64-bit system with 4KB pages (PAGE_SHIFT = 12), the
> > + * maximum range for a page offset (for order 0) is 52 bits (64 - 12). This
> > + * offset occupies bits [0-51]. For order 0, the order bit is set at
> > + * position 52.
> > + *
> > + * As the order increases, the number of bits required for the 'page offset'
> > + * decreases. For example, order 1 requires one less bit for its page
> > + * offset. This allows its order bit to be set at position 51,
> > + * i.e. shifting right, without conflicting with the page offset bits.
>
> This description is correct, but the diagram is misleading. Maybe like this:
>
>  *  |0| ... 0 0 0 1          | pa >> (PAGE_SHIFT + 0)              |
>  *  |1| ... 0 0 0 0 1        | pa >> (PAGE_SHIFT + 1)              |
>  *  |2| ... 0 0 0 0 0 1      | pa >> (PAGE_SHIFT + 2)              |
>  [..]
>

I see, giving new examples makes the order bit positions more clearer.

>
> > + *
> > + * This design stores pages of all sizes (orders) in a single 6-level table.  It
> > + * efficiently shares lower table levels, especially due to common zero top
> > + * address bits, allowing a single, efficient algorithm to manage all pages.
> > + */
> > +enum kho_radix_consts {
> > +     /* The bit position of a 0-order page, only supports 64bits system */
> > +     ORDER_0_LG2 = 64 - PAGE_SHIFT,
> > +     /* Bit number used for each level of tables */
> > +     TABLE_SIZE_LG2 = const_ilog2(PAGE_SIZE / sizeof(phys_addr_t)),
> > +     /* Bit number used for lowest level of bitmaps */
> > +     BITMAP_SIZE_LG2 = PAGE_SHIFT + const_ilog2(BITS_PER_BYTE),
>
> "bit number" is a bit confusing when using a lg2 terms..
>
>         /* Size of the table in kho_radix_tree, in lg2 */
>         TABLE_SIZE_LG2 = const_ilog2(PAGE_SIZE / sizeof(phys_addr_t))
>
>         /* Number of bits in the kho_bitmap_table, in lg2 */
>         BITMAP_SIZE_LG2 = const_ilog2(BITS_PER_BYTE * PAGE_SIZE)
>
> Then use the constants in the structs:
>
> struct kho_radix_tree {
>        phys_addr_t table[1 << TABLE_SIZE_LG2];
> };
> struct kho_bitmap_table {
>        unsigned long bitmaps[(1 << BITMAP_SIZE_LG2) / BITS_PER_LONG];
> };
>
> > -struct khoser_mem_chunk;
> > +static inline phys_addr_t kho_radix_tree_desc(struct kho_radix_tree *va)
> > +{
> > +     return (phys_addr_t)virt_to_phys(va);
> > +}
>
> virt_to_phys already returns phys_addr_t ?
>
> > +static inline struct kho_radix_tree *kho_radix_tree(phys_addr_t desc)
> > +{
> > +     return (struct kho_radix_tree *)(phys_to_virt(desc));
> > +}
>
> phys_to_virt returns void *, no need for a cast
>

Yup, will update the types in both functions.

> > +static int kho_radix_set_bitmap(struct kho_bitmap_table *bit_tlb,
> > +                             unsigned long offset)
> > +{
> > +     if (!bit_tlb ||
> > +         offset >= PAGE_SIZE * BITS_PER_BYTE)
> > +             return -EINVAL;
>
> WARN_ON ? These are assertions aren't they?

Yes, will add an assertion here. Or should we use "BUG_ON"? As if this
error happened something is quite wrong and I do not think it is
recoverable.

>
> > +static int kho_radix_walk_trees(struct kho_radix_tree *root, unsigned int level,
> > +                             unsigned long start,
> > +                             kho_radix_tree_walk_callback_t cb)
> > +{
> > +     struct kho_radix_tree *next_tree;
> > +     struct kho_bitmap_table *bitmap_table;
> > +     unsigned long encoded, i;
> > +     unsigned int level_shift;
> > +     int err = 0;
> > +
> > +     for (i = 0; i < PAGE_SIZE / sizeof(phys_addr_t); i++) {
> > +             if (root->table[i]) {
>
>
> if (!root->table[i])
>    continue
>
> Remove a level of indentation here
>
> > +static int __kho_preserve_order(unsigned long pfn, unsigned int order)
> > +{
> > +     unsigned long pa = PFN_PHYS(pfn);
> > +
> > +     might_sleep();
> > +
> > +     return kho_radix_preserve_page(pa, order);
>
> might_sleep should be in kho_radix_preserve_page() ? The might should
> be placed around if statements that might avoid a sleep, and that is
> not in this function.
>

I got another feedback that we can merge the kho_radix_preserve_page()
__kho_preserve_order(), so the might_sleep() will be in the if
statements.

> >  static void __init kho_mem_deserialize(const void *fdt)
> >  {
> > +     struct kho_radix_tree *tree_root;
> >       const phys_addr_t *mem;
> >       int len;
> >
> > +     /* Retrieve the KHO radix tree from passed-in FDT. */
> > +     mem = fdt_getprop(fdt, 0, PROP_PRESERVED_PAGE_RADIX_TREE, &len);
> >
> >       if (!mem || len != sizeof(*mem)) {
> > +             pr_err("failed to get preserved KHO memory tree\n");
> >               return;
> >       }
> >
> > +     tree_root = *mem ?
> > +             (struct kho_radix_tree *)phys_to_virt(*mem) :
> > +             NULL;
> >
> > +     if (!tree_root)
> > +             return;
>
> Seems weird?
>
> if (!*mem)
>    return;
>
> tree_root = phys_to_virt(*mem)
> kho_radix_walk_trees(tree_root, TREE_MAX_DEPTH - 1, 0,
>                              kho_radix_walk_trees_memblock_callback);
>

Yup, this looks more natural, my mind was thinking to check tree_root.

>
> This is prettty good now, IMHO
>
> Jason

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ