[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251023174317.GO262900@nvidia.com>
Date: Thu, 23 Oct 2025 14:43:17 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Jason Miu <jasonmiu@...gle.com>
Cc: Alexander Graf <graf@...zon.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Baoquan He <bhe@...hat.com>, Changyuan Lyu <changyuanl@...gle.com>,
David Matlack <dmatlack@...gle.com>,
David Rientjes <rientjes@...gle.com>,
Mike Rapoport <rppt@...nel.org>,
Pasha Tatashin <pasha.tatashin@...een.com>,
Pratyush Yadav <pratyush@...nel.org>, kexec@...ts.infradead.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v2 1/3] kho: Adopt KHO radix tree data structures
On Mon, Oct 20, 2025 at 03:03:04AM -0700, Jason Miu wrote:
> - * Keep track of memory that is to be preserved across KHO.
> + * The KHO radix tree tracks preserved memory pages. It is a hierarchical
> + * structure that starts with a single root `kho_radix_tree`. This single
> + * tree stores pages of all orders.
> + *
> + * This is achieved by encoding the page's physical address and its order into
> + * a single `unsigned long` value. This encoded value is then used to traverse
> + * the tree.
> + *
> + * The tree hierarchy is shown below:
> + *
> + * kho_radix_tree_root
> + * +-------------------+
> + * | Level 5 | (struct kho_radix_tree)
> + * +-------------------+
> + * |
> + * v
> + * +-------------------+
> + * | Level 4 | (struct kho_radix_tree)
> + * +-------------------+
> + * |
> + * | ... (intermediate levels)
> + * |
> + * v
> + * +-------------------+
> + * | Level 0 | (struct kho_bitmap_table)
> + * +-------------------+
> *
> - * The serializing side uses two levels of xarrays to manage chunks of per-order
> - * 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a
> - * 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations
> - * each bitmap will cover 16M of address space. Thus, for 16G of memory at most
> - * 512K of bitmap memory will be needed for order 0.
I think it is valuable to preserve this justification for
bitmaps. There was a lot of debate over bitmaps vs ranges and this is
the advantage of bitmaps. It is a bit subtle..
> +/*
> + * The KHO radix tree tracks preserved pages by encoding a page's physical
> + * address (pa) and its order into a single unsigned long value. This value
> + * is then used to traverse the tree. The encoded value is composed of two
> + * parts: the 'order bits' in the upper part and the 'page offset' in the
> + * lower part.
> + *
> + * <-- Higher Bits ------------------------------------ Lower Bits -->
> + * +--------------------------+-----------------------------------------+
> + * | Order Bits | Page Offset |
> + * +--------------------------+-----------------------------------------+
> + * | ... 0 0 1 0 0 ... | pa >> (PAGE_SHIFT + order) |
> + * +--------------------------+-----------------------------------------+
> + * ^
> + * |
> + * This single '1' bit's position
> + * uniquely identifies the 'order'.
> + *
> + *
> + * Page Offset:
> + * The 'page offset' is the physical address normalized for its order. It
> + * effectively represents the page offset for the given order.
> + *
> + * Order Bits:
> + * The 'order bits' encode the page order by setting a single bit at a
> + * specific position. The position of this bit itself represents the order.
> + *
> + * For instance, on a 64-bit system with 4KB pages (PAGE_SHIFT = 12), the
> + * maximum range for a page offset (for order 0) is 52 bits (64 - 12). This
> + * offset occupies bits [0-51]. For order 0, the order bit is set at
> + * position 52.
> + *
> + * As the order increases, the number of bits required for the 'page offset'
> + * decreases. For example, order 1 requires one less bit for its page
> + * offset. This allows its order bit to be set at position 51,
> + * i.e. shifting right, without conflicting with the page offset bits.
This description is correct, but the diagram is misleading. Maybe like this:
* |0| ... 0 0 0 1 | pa >> (PAGE_SHIFT + 0) |
* |1| ... 0 0 0 0 1 | pa >> (PAGE_SHIFT + 1) |
* |2| ... 0 0 0 0 0 1 | pa >> (PAGE_SHIFT + 2) |
[..]
> + *
> + * This design stores pages of all sizes (orders) in a single 6-level table. It
> + * efficiently shares lower table levels, especially due to common zero top
> + * address bits, allowing a single, efficient algorithm to manage all pages.
> + */
> +enum kho_radix_consts {
> + /* The bit position of a 0-order page, only supports 64bits system */
> + ORDER_0_LG2 = 64 - PAGE_SHIFT,
> + /* Bit number used for each level of tables */
> + TABLE_SIZE_LG2 = const_ilog2(PAGE_SIZE / sizeof(phys_addr_t)),
> + /* Bit number used for lowest level of bitmaps */
> + BITMAP_SIZE_LG2 = PAGE_SHIFT + const_ilog2(BITS_PER_BYTE),
"bit number" is a bit confusing when using a lg2 terms..
/* Size of the table in kho_radix_tree, in lg2 */
TABLE_SIZE_LG2 = const_ilog2(PAGE_SIZE / sizeof(phys_addr_t))
/* Number of bits in the kho_bitmap_table, in lg2 */
BITMAP_SIZE_LG2 = const_ilog2(BITS_PER_BYTE * PAGE_SIZE)
Then use the constants in the structs:
struct kho_radix_tree {
phys_addr_t table[1 << TABLE_SIZE_LG2];
};
struct kho_bitmap_table {
unsigned long bitmaps[(1 << BITMAP_SIZE_LG2) / BITS_PER_LONG];
};
> -struct khoser_mem_chunk;
> +static inline phys_addr_t kho_radix_tree_desc(struct kho_radix_tree *va)
> +{
> + return (phys_addr_t)virt_to_phys(va);
> +}
virt_to_phys already returns phys_addr_t ?
> +static inline struct kho_radix_tree *kho_radix_tree(phys_addr_t desc)
> +{
> + return (struct kho_radix_tree *)(phys_to_virt(desc));
> +}
phys_to_virt returns void *, no need for a cast
> +static int kho_radix_set_bitmap(struct kho_bitmap_table *bit_tlb,
> + unsigned long offset)
> +{
> + if (!bit_tlb ||
> + offset >= PAGE_SIZE * BITS_PER_BYTE)
> + return -EINVAL;
WARN_ON ? These are assertions aren't they?
> +static int kho_radix_walk_trees(struct kho_radix_tree *root, unsigned int level,
> + unsigned long start,
> + kho_radix_tree_walk_callback_t cb)
> +{
> + struct kho_radix_tree *next_tree;
> + struct kho_bitmap_table *bitmap_table;
> + unsigned long encoded, i;
> + unsigned int level_shift;
> + int err = 0;
> +
> + for (i = 0; i < PAGE_SIZE / sizeof(phys_addr_t); i++) {
> + if (root->table[i]) {
if (!root->table[i])
continue
Remove a level of indentation here
> +static int __kho_preserve_order(unsigned long pfn, unsigned int order)
> +{
> + unsigned long pa = PFN_PHYS(pfn);
> +
> + might_sleep();
> +
> + return kho_radix_preserve_page(pa, order);
might_sleep should be in kho_radix_preserve_page() ? The might should
be placed around if statements that might avoid a sleep, and that is
not in this function.
> static void __init kho_mem_deserialize(const void *fdt)
> {
> + struct kho_radix_tree *tree_root;
> const phys_addr_t *mem;
> int len;
>
> + /* Retrieve the KHO radix tree from passed-in FDT. */
> + mem = fdt_getprop(fdt, 0, PROP_PRESERVED_PAGE_RADIX_TREE, &len);
>
> if (!mem || len != sizeof(*mem)) {
> + pr_err("failed to get preserved KHO memory tree\n");
> return;
> }
>
> + tree_root = *mem ?
> + (struct kho_radix_tree *)phys_to_virt(*mem) :
> + NULL;
>
> + if (!tree_root)
> + return;
Seems weird?
if (!*mem)
return;
tree_root = phys_to_virt(*mem)
kho_radix_walk_trees(tree_root, TREE_MAX_DEPTH - 1, 0,
kho_radix_walk_trees_memblock_callback);
This is prettty good now, IMHO
Jason
Powered by blists - more mailing lists