lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251023174317.GO262900@nvidia.com>
Date: Thu, 23 Oct 2025 14:43:17 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Jason Miu <jasonmiu@...gle.com>
Cc: Alexander Graf <graf@...zon.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Baoquan He <bhe@...hat.com>, Changyuan Lyu <changyuanl@...gle.com>,
	David Matlack <dmatlack@...gle.com>,
	David Rientjes <rientjes@...gle.com>,
	Mike Rapoport <rppt@...nel.org>,
	Pasha Tatashin <pasha.tatashin@...een.com>,
	Pratyush Yadav <pratyush@...nel.org>, kexec@...ts.infradead.org,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v2 1/3] kho: Adopt KHO radix tree data structures

On Mon, Oct 20, 2025 at 03:03:04AM -0700, Jason Miu wrote:
> - * Keep track of memory that is to be preserved across KHO.
> + * The KHO radix tree tracks preserved memory pages. It is a hierarchical
> + * structure that starts with a single root `kho_radix_tree`. This single
> + * tree stores pages of all orders.
> + *
> + * This is achieved by encoding the page's physical address and its order into
> + * a single `unsigned long` value. This encoded value is then used to traverse
> + * the tree.
> + *
> + * The tree hierarchy is shown below:
> + *
> + * kho_radix_tree_root
> + * +-------------------+
> + * |     Level 5       | (struct kho_radix_tree)
> + * +-------------------+
> + *   |
> + *   v
> + * +-------------------+
> + * |     Level 4       | (struct kho_radix_tree)
> + * +-------------------+
> + *   |
> + *   | ... (intermediate levels)
> + *   |
> + *   v
> + * +-------------------+
> + * |      Level 0      | (struct kho_bitmap_table)
> + * +-------------------+
>   *
> - * The serializing side uses two levels of xarrays to manage chunks of per-order
> - * 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a
> - * 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations
> - * each bitmap will cover 16M of address space. Thus, for 16G of memory at most
> - * 512K of bitmap memory will be needed for order 0.

I think it is valuable to preserve this justification for
bitmaps. There was a lot of debate over bitmaps vs ranges and this is
the advantage of bitmaps. It is a bit subtle..

> +/*
> + * The KHO radix tree tracks preserved pages by encoding a page's physical
> + * address (pa) and its order into a single unsigned long value. This value
> + * is then used to traverse the tree. The encoded value is composed of two
> + * parts: the 'order bits' in the upper part and the 'page offset' in the
> + * lower part.
> + *
> + *   <-- Higher Bits ------------------------------------ Lower Bits -->
> + *  +--------------------------+-----------------------------------------+
> + *  |        Order Bits        |               Page Offset               |
> + *  +--------------------------+-----------------------------------------+
> + *  | ... 0 0 1 0 0 ...        | pa >> (PAGE_SHIFT + order)              |
> + *  +--------------------------+-----------------------------------------+
> + *            ^
> + *            |
> + *  This single '1' bit's position
> + *  uniquely identifies the 'order'.
> + *
> + *
> + * Page Offset:
> + * The 'page offset' is the physical address normalized for its order. It
> + * effectively represents the page offset for the given order.
> + *
> + * Order Bits:
> + * The 'order bits' encode the page order by setting a single bit at a
> + * specific position. The position of this bit itself represents the order.
> + *
> + * For instance, on a 64-bit system with 4KB pages (PAGE_SHIFT = 12), the
> + * maximum range for a page offset (for order 0) is 52 bits (64 - 12). This
> + * offset occupies bits [0-51]. For order 0, the order bit is set at
> + * position 52.
> + *
> + * As the order increases, the number of bits required for the 'page offset'
> + * decreases. For example, order 1 requires one less bit for its page
> + * offset. This allows its order bit to be set at position 51,
> + * i.e. shifting right, without conflicting with the page offset bits.

This description is correct, but the diagram is misleading. Maybe like this:

 *  |0| ... 0 0 0 1          | pa >> (PAGE_SHIFT + 0)              |
 *  |1| ... 0 0 0 0 1        | pa >> (PAGE_SHIFT + 1)              |
 *  |2| ... 0 0 0 0 0 1      | pa >> (PAGE_SHIFT + 2)              |
 [..]


> + *
> + * This design stores pages of all sizes (orders) in a single 6-level table.  It
> + * efficiently shares lower table levels, especially due to common zero top
> + * address bits, allowing a single, efficient algorithm to manage all pages.
> + */
> +enum kho_radix_consts {
> +	/* The bit position of a 0-order page, only supports 64bits system */
> +	ORDER_0_LG2 = 64 - PAGE_SHIFT,
> +	/* Bit number used for each level of tables */
> +	TABLE_SIZE_LG2 = const_ilog2(PAGE_SIZE / sizeof(phys_addr_t)),
> +	/* Bit number used for lowest level of bitmaps */
> +	BITMAP_SIZE_LG2 = PAGE_SHIFT + const_ilog2(BITS_PER_BYTE),

"bit number" is a bit confusing when using a lg2 terms..

        /* Size of the table in kho_radix_tree, in lg2 */
	TABLE_SIZE_LG2 = const_ilog2(PAGE_SIZE / sizeof(phys_addr_t))

	/* Number of bits in the kho_bitmap_table, in lg2 */
	BITMAP_SIZE_LG2 = const_ilog2(BITS_PER_BYTE * PAGE_SIZE)

Then use the constants in the structs:

struct kho_radix_tree {
       phys_addr_t table[1 << TABLE_SIZE_LG2];
};
struct kho_bitmap_table {
       unsigned long bitmaps[(1 << BITMAP_SIZE_LG2) / BITS_PER_LONG];
};

> -struct khoser_mem_chunk;
> +static inline phys_addr_t kho_radix_tree_desc(struct kho_radix_tree *va)
> +{
> +	return (phys_addr_t)virt_to_phys(va);
> +}

virt_to_phys already returns phys_addr_t ?

> +static inline struct kho_radix_tree *kho_radix_tree(phys_addr_t desc)
> +{
> +	return (struct kho_radix_tree *)(phys_to_virt(desc));
> +}

phys_to_virt returns void *, no need for a cast

> +static int kho_radix_set_bitmap(struct kho_bitmap_table *bit_tlb,
> +				unsigned long offset)
> +{
> +	if (!bit_tlb ||
> +	    offset >= PAGE_SIZE * BITS_PER_BYTE)
> +		return -EINVAL;

WARN_ON ? These are assertions aren't they?

> +static int kho_radix_walk_trees(struct kho_radix_tree *root, unsigned int level,
> +				unsigned long start,
> +				kho_radix_tree_walk_callback_t cb)
> +{
> +	struct kho_radix_tree *next_tree;
> +	struct kho_bitmap_table *bitmap_table;
> +	unsigned long encoded, i;
> +	unsigned int level_shift;
> +	int err = 0;
> +
> +	for (i = 0; i < PAGE_SIZE / sizeof(phys_addr_t); i++) {
> +		if (root->table[i]) {


if (!root->table[i])
   continue

Remove a level of indentation here

> +static int __kho_preserve_order(unsigned long pfn, unsigned int order)
> +{
> +	unsigned long pa = PFN_PHYS(pfn);
> +
> +	might_sleep();
> +
> +	return kho_radix_preserve_page(pa, order);

might_sleep should be in kho_radix_preserve_page() ? The might should
be placed around if statements that might avoid a sleep, and that is
not in this function.

>  static void __init kho_mem_deserialize(const void *fdt)
>  {
> +	struct kho_radix_tree *tree_root;
>  	const phys_addr_t *mem;
>  	int len;
>  
> +	/* Retrieve the KHO radix tree from passed-in FDT. */
> +	mem = fdt_getprop(fdt, 0, PROP_PRESERVED_PAGE_RADIX_TREE, &len);
>  
>  	if (!mem || len != sizeof(*mem)) {
> +		pr_err("failed to get preserved KHO memory tree\n");
>  		return;
>  	}
>  
> +	tree_root = *mem ?
> +		(struct kho_radix_tree *)phys_to_virt(*mem) :
> +		NULL;
>  
> +	if (!tree_root)
> +		return;

Seems weird?

if (!*mem)
   return;

tree_root = phys_to_virt(*mem)
kho_radix_walk_trees(tree_root, TREE_MAX_DEPTH - 1, 0,
			     kho_radix_walk_trees_memblock_callback);


This is prettty good now, IMHO

Jason

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ