[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251009175205.GB3899236@nvidia.com>
Date: Thu, 9 Oct 2025 14:52:05 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Jason Miu <jasonmiu@...gle.com>
Cc: Alexander Graf <graf@...zon.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Baoquan He <bhe@...hat.com>, Changyuan Lyu <changyuanl@...gle.com>,
David Matlack <dmatlack@...gle.com>,
David Rientjes <rientjes@...gle.com>,
Mike Rapoport <rppt@...nel.org>,
Pasha Tatashin <pasha.tatashin@...een.com>,
Pratyush Yadav <pratyush@...nel.org>, kexec@...ts.infradead.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v1 1/3] kho: Adopt KHO radix tree data structures
> > > -#define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map"
> > > +#define PROP_PRESERVED_PAGE_RADIX_TREE "preserved-page-radix-tree"
> > > #define PROP_SUB_FDT "fdt"
> >
> > I'de really like to see all of these sorts of definitions in some
> > structured ABI header not open coded all over the place..
>
> Do you think `include/linux/kexec_handover.h` is the appropriate
> place, or would you prefer a new, dedicated ABI header (e.g., in
> `include/uapi/linux/`) for all KHO-related FDT constants?
I would avoid uapi, but maybe Pasha has some
idea.
include/linux/live_update/abi/ ?
> Agreed. Will change `u64` according to Pasha's comment. And we use
> explicit casts like `(u64)virt_to_phys(new_tree)` and `(struct
> kho_radix_tree *)phys_to_virt(table_entry)` in the current series. I
> believe this, along with the `u64` type, makes it clear that the table
> stores physical addresses.
Well, the macros were intended to automate this and avoid mistakes
from opencoding.. Just keep using them?
> > > + */
> > > +static unsigned long kho_radix_encode(unsigned long pa, unsigned int order)
> >
> > pa is phys_addr_t in the kernel, never unsigned long.
> >
> > If you want to make it all dynamic then this should be phys_addr_t
>
> Should this also be `u64`, or we stay with `phys_addr_t` for all page
> addresses?
you should use phys_addr_t for everything that originates from a
phys_addr_t, and u64 for all the ABI
> > > +{
> > > + unsigned long h = 1UL << (BITS_PER_LONG - PAGE_SHIFT - order);
> >
> > And this BITS_PER_LONG is confused, it is BITS_PER_PHYS_ADDR_T which
> > may not exist.
> >
> > Use an enum ORDER_0_LG2 maybe
>
> I prefer `KHO_RADIX_ORDER_0_BIT_POS` (defined as `BITS_PER_LONG -
> PAGE_SHIFT`) over `ORDER_0_LG2`, as I think the latter is a bit hard
> to understand, what do you think? This constant, along with others,
> will be placed in the enum.
Sure, though I prefer LG2 to BIT_POS
BIT_POS to me implies it is being used as bit wise operation, while
log2 is a mathematical concept
X_lg2 = ilog2(X) && X == 1 << X_lg2
> > > + kho_radix_tree_walk_callback_t cb)
> > > +{
> > > + int level_shift = ilog2(PAGE_SIZE / sizeof(unsigned long));
> > > + struct kho_radix_tree *next_tree;
> > > + unsigned long encoded, i;
> > > + int err = 0;
> > >
> > > + if (level == 1) {
> > > + encoded = offset;
> > > + return kho_radix_walk_bitmaps((struct kho_bitmap_table *)root,
> > > + encoded, cb);
> >
> > Better to do this in the caller a few lines below
>
> But the caller is in a different tree level? Should we only walk the
> bitmaps at the lowest level?
I mean just have the caller do
if (level-1 ==0)
kho_radix_walk_bitmaps()
else
..
Avoids a function call
> > > + for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) {
> > > + if (root->table[i]) {
> > > + encoded = offset << level_shift | i;
> >
> > This doesn't seem right..
> >
> > The argument to the walker should be the starting encoded of the table
> > it is about to walk.
> >
> > Since everything always starts at 0 it should always be
> > start | (i << level_shift)
> >
> > ?
>
> You're right that this line might not be immediately intuitive. The
> var `level_shift` (which is constant value 9 here) is applied to the
> *accumulated* `offset` from the parent level. Let's consider an
> example of a preserved page at physical address `0x1000`, which
> encodes to `0x10000000000001` (bit 52 is set for order 0, bit 0 is set
> for page 1).
Oh, weird, too weird maybe. I'd just keep all the values as fully
shifted, level_shift should be adjusted to have the full shift for
this level. Easier to understand.
Also, I think the order bits might have become a bit confused, I think
I explained it wrong.
My idea was to try to share the radix levels to save space eg if we
have like this patch does:
Order phys
00001 abcd
00010 0bcd
00100 00cd
01000 000d
Then we don't get too much page level sharing, the middle ends up with
0 indexes in tables that cannot be shared.
What I was going for was to push all the shared pages to the left
00001 abcd
00000 1bcd
00000 01cd
00000 001d
Here the first radix level has index 0 or 1 and is fully shared. So eg
Order 4 and 5 will have all the same 0 index table levels. This also
reduces the max height of the tree because only +1 bit is needed to
store order.
Jason
Powered by blists - more mailing lists