[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAHN2nPJbXeSzLuznWcV+vo80rtk5odd+2GDW_NVDUGPG1KO-Gg@mail.gmail.com>
Date: Tue, 21 Oct 2025 17:59:29 -0700
From: Jason Miu <jasonmiu@...gle.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Alexander Graf <graf@...zon.com>, Andrew Morton <akpm@...ux-foundation.org>,
Baoquan He <bhe@...hat.com>, Changyuan Lyu <changyuanl@...gle.com>,
David Matlack <dmatlack@...gle.com>, David Rientjes <rientjes@...gle.com>,
Mike Rapoport <rppt@...nel.org>, Pasha Tatashin <pasha.tatashin@...een.com>,
Pratyush Yadav <pratyush@...nel.org>, kexec@...ts.infradead.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v1 1/3] kho: Adopt KHO radix tree data structures
Thanks Jason, I uploaded the patch v2 according to your feedback.
On Thu, Oct 9, 2025 at 10:52 AM Jason Gunthorpe <jgg@...dia.com> wrote:
>
> > > > -#define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map"
> > > > +#define PROP_PRESERVED_PAGE_RADIX_TREE "preserved-page-radix-tree"
> > > > #define PROP_SUB_FDT "fdt"
> > >
> > > I'de really like to see all of these sorts of definitions in some
> > > structured ABI header not open coded all over the place..
> >
> > Do you think `include/linux/kexec_handover.h` is the appropriate
> > place, or would you prefer a new, dedicated ABI header (e.g., in
> > `include/uapi/linux/`) for all KHO-related FDT constants?
>
> I would avoid uapi, but maybe Pasha has some
> idea.
>
> include/linux/live_update/abi/ ?
Yes, moved to include/linux/live_update/abi/.
>
> > Agreed. Will change `u64` according to Pasha's comment. And we use
> > explicit casts like `(u64)virt_to_phys(new_tree)` and `(struct
> > kho_radix_tree *)phys_to_virt(table_entry)` in the current series. I
> > believe this, along with the `u64` type, makes it clear that the table
> > stores physical addresses.
>
> Well, the macros were intended to automate this and avoid mistakes
> from opencoding.. Just keep using them?
>
Sure, added two inline functions `kho_radix_tree_desc()` and
`kho_radix_tree()` back for converting.
> > > > + */
> > > > +static unsigned long kho_radix_encode(unsigned long pa, unsigned int order)
> > >
> > > pa is phys_addr_t in the kernel, never unsigned long.
> > >
> > > If you want to make it all dynamic then this should be phys_addr_t
> >
> > Should this also be `u64`, or we stay with `phys_addr_t` for all page
> > addresses?
>
> you should use phys_addr_t for everything that originates from a
> phys_addr_t, and u64 for all the ABI
>
done
> > > > +{
> > > > + unsigned long h = 1UL << (BITS_PER_LONG - PAGE_SHIFT - order);
> > >
> > > And this BITS_PER_LONG is confused, it is BITS_PER_PHYS_ADDR_T which
> > > may not exist.
> > >
> > > Use an enum ORDER_0_LG2 maybe
> >
> > I prefer `KHO_RADIX_ORDER_0_BIT_POS` (defined as `BITS_PER_LONG -
> > PAGE_SHIFT`) over `ORDER_0_LG2`, as I think the latter is a bit hard
> > to understand, what do you think? This constant, along with others,
> > will be placed in the enum.
>
> Sure, though I prefer LG2 to BIT_POS
Lets pick LG2. =)
>
> BIT_POS to me implies it is being used as bit wise operation, while
> log2 is a mathematical concept
>
> X_lg2 = ilog2(X) && X == 1 << X_lg2
>
> > > > + kho_radix_tree_walk_callback_t cb)
> > > > +{
> > > > + int level_shift = ilog2(PAGE_SIZE / sizeof(unsigned long));
> > > > + struct kho_radix_tree *next_tree;
> > > > + unsigned long encoded, i;
> > > > + int err = 0;
> > > >
> > > > + if (level == 1) {
> > > > + encoded = offset;
> > > > + return kho_radix_walk_bitmaps((struct kho_bitmap_table *)root,
> > > > + encoded, cb);
> > >
> > > Better to do this in the caller a few lines below
> >
> > But the caller is in a different tree level? Should we only walk the
> > bitmaps at the lowest level?
>
> I mean just have the caller do
>
> if (level-1 ==0)
> kho_radix_walk_bitmaps()
> else
> ..
>
> Avoids a function call
I see. Done.
>
> > > > + for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) {
> > > > + if (root->table[i]) {
> > > > + encoded = offset << level_shift | i;
> > >
> > > This doesn't seem right..
> > >
> > > The argument to the walker should be the starting encoded of the table
> > > it is about to walk.
> > >
> > > Since everything always starts at 0 it should always be
> > > start | (i << level_shift)
> > >
> > > ?
> >
> > You're right that this line might not be immediately intuitive. The
> > var `level_shift` (which is constant value 9 here) is applied to the
> > *accumulated* `offset` from the parent level. Let's consider an
> > example of a preserved page at physical address `0x1000`, which
> > encodes to `0x10000000000001` (bit 52 is set for order 0, bit 0 is set
> > for page 1).
>
> Oh, weird, too weird maybe. I'd just keep all the values as fully
> shifted, level_shift should be adjusted to have the full shift for
> this level. Easier to understand.
>
> Also, I think the order bits might have become a bit confused, I think
> I explained it wrong.
>
> My idea was to try to share the radix levels to save space eg if we
> have like this patch does:
>
> Order phys
> 00001 abcd
> 00010 0bcd
> 00100 00cd
> 01000 000d
>
> Then we don't get too much page level sharing, the middle ends up with
> 0 indexes in tables that cannot be shared.
>
> What I was going for was to push all the shared pages to the left
>
> 00001 abcd
> 00000 1bcd
> 00000 01cd
> 00000 001d
>
> Here the first radix level has index 0 or 1 and is fully shared. So eg
> Order 4 and 5 will have all the same 0 index table levels. This also
> reduces the max height of the tree because only +1 bit is needed to
> store order.
>
> Jason
Thanks for the clarification. I updated the logic by keeping the
encoded value fully shifted and adjusting the `level_shift` according
to the current level.
And yes we are having the shared pages on the left side (zeros in the
encoded prefix) while having the order bits shift to right when the
page order increases. I hope the updated code makes this more clearer.
--
Jason Miu
Powered by blists - more mailing lists