[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <mafs0tt75p2nx.fsf@amazon.de>
Date: Thu, 3 Apr 2025 17:37:06 +0000
From: Pratyush Yadav <ptyadav@...zon.de>
To: Jason Gunthorpe <jgg@...dia.com>
CC: Changyuan Lyu <changyuanl@...gle.com>, <linux-kernel@...r.kernel.org>,
<graf@...zon.com>, <akpm@...ux-foundation.org>, <luto@...nel.org>,
<anthony.yznaga@...cle.com>, <arnd@...db.de>, <ashish.kalra@....com>,
<benh@...nel.crashing.org>, <bp@...en8.de>, <catalin.marinas@....com>,
<dave.hansen@...ux.intel.com>, <dwmw2@...radead.org>,
<ebiederm@...ssion.com>, <mingo@...hat.com>, <jgowans@...zon.com>,
<corbet@....net>, <krzk@...nel.org>, <rppt@...nel.org>,
<mark.rutland@....com>, <pbonzini@...hat.com>, <pasha.tatashin@...een.com>,
<hpa@...or.com>, <peterz@...radead.org>, <robh+dt@...nel.org>,
<robh@...nel.org>, <saravanak@...gle.com>,
<skinsburskii@...ux.microsoft.com>, <rostedt@...dmis.org>,
<tglx@...utronix.de>, <thomas.lendacky@....com>, <usama.arif@...edance.com>,
<will@...nel.org>, <devicetree@...r.kernel.org>, <kexec@...ts.infradead.org>,
<linux-arm-kernel@...ts.infradead.org>, <linux-doc@...r.kernel.org>,
<linux-mm@...ck.org>, <x86@...nel.org>
Subject: Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation
On Thu, Apr 03 2025, Jason Gunthorpe wrote:
> On Thu, Apr 03, 2025 at 03:50:04PM +0000, Pratyush Yadav wrote:
>
>> The patch currently has a limitation where it does not free any of the
>> empty tables after a unpreserve operation. But Changyuan's patch also
>> doesn't do it so at least it is not any worse off.
>
> We do we even have unpreserve? Just discard the entire KHO operation
> in a bulk.
Yeah, I guess that makes sense.
>
>> When working on this patch, I realized that kho_mem_deserialize() is
>> currently _very_ slow. It takes over 2 seconds to make memblock
>> reservations for 48 GiB of 0-order pages. I suppose this can later be
>> optimized by teaching memblock_free_all() to skip preserved pages
>> instead of making memblock reservations.
>
> Yes, this was my prior point of not having actual data to know what
> the actual hot spots are.. This saves a few ms on an operation that
> takes over 2 seconds :)
Yes, you're right. But for 2.5 days of work it isn't too shabby :-)
And I think this will help make the 2 seconds much smaller as well later
down the line since we can now find out if a given page is reserved in a
few operations, and do it in parallel.
>
>> +typedef unsigned long khomem_desc_t;
>
> This should be more like:
>
> union {
> void *table;
> phys_addr_t table_phys;
> };
>
> Since we are not using the low bits right now and it is alot cheaper
> to convert from va to phys only once during the final step. __va is
> not exactly fast.
The descriptor is used on _every_ level of the table, not just the top.
So if we use virtual addresses, at serialize time we would have to walk
the whole table and covert all addresses to physical. And __va() does
not seem to be doing too much. On x86, it expands to:
#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
and on ARM64 to:
#define __va(x) ((void *)__phys_to_virt((phys_addr_t)(x)))
#define __phys_to_virt(x) ((unsigned long)((x) - PHYS_OFFSET) | PAGE_OFFSET)
So only some addition and bitwise or. Should be fast enough I reckon.
Maybe walking the table once is faster than calculating va every time,
but that walking would happen in the blackout time, and need more data
on whether that optimization is worth it.
>
>> +#define PTRS_PER_LEVEL (PAGE_SIZE / sizeof(unsigned long))
>> +#define KHOMEM_L1_BITS (PAGE_SIZE * BITS_PER_BYTE)
>> +#define KHOMEM_L1_MASK ((1 << ilog2(KHOMEM_L1_BITS)) - 1)
>> +#define KHOMEM_L1_SHIFT (PAGE_SHIFT)
>> +#define KHOMEM_L2_SHIFT (KHOMEM_L1_SHIFT + ilog2(KHOMEM_L1_BITS))
>> +#define KHOMEM_L3_SHIFT (KHOMEM_L2_SHIFT + ilog2(PTRS_PER_LEVEL))
>> +#define KHOMEM_L4_SHIFT (KHOMEM_L3_SHIFT + ilog2(PTRS_PER_LEVEL))
>> +#define KHOMEM_PFN_MASK PAGE_MASK
>
> This all works better if you just use GENMASK and FIELD_GET
I suppose yes. Though the masks need to be shifted by page order so need
to be careful. Will take a look.
>
>> +static int __khomem_table_alloc(khomem_desc_t *desc)
>> +{
>> + if (khomem_desc_none(*desc)) {
>
> Needs READ_ONCE
ACK, will add.
>
>> +struct kho_mem_track {
>> + /* Points to L4 KHOMEM descriptor, each order gets its own table. */
>> + struct xarray orders;
>> +};
>
> I think it would be easy to add a 5th level and just use bits 63:57 as
> a 6 bit order. Then you don't need all this stuff either.
I am guessing you mean to store the order in the table descriptor
itself, instead of having a different table for each order. I don't
think that would work since say you have a level 1 table spanning 128
MiB. You can have pages of different orders in that 128 MiB, and have no
way of knowing which is which. To have all orders in one table, we would
need more than one bit per page at the lowest level.
Though now that I think of it, it is probably much simpler to just use
khomem_desc_t orders[NR_PAGE_ORDERS] instead of the xarray.
>
>> +int kho_preserve_folio(struct folio *folio)
>> +{
>> + unsigned long pfn = folio_pfn(folio);
>> + unsigned int order = folio_order(folio);
>> + int err;
>> +
>> + if (!kho_enable)
>> + return -EOPNOTSUPP;
>> +
>> + down_read(&kho_out.tree_lock);
>
> This lock still needs to go away
Agree. I hope Changyuan's next version fixes it. I didn't really touch
any of these functions.
>
>> +static void kho_mem_serialize(void)
>> +{
>> + struct kho_mem_track *tracker = &kho_mem_track;
>> + khomem_desc_t *desc;
>> + unsigned long order;
>> +
>> + xa_for_each(&tracker->orders, order, desc) {
>> + if (WARN_ON(order >= NR_PAGE_ORDERS))
>> + break;
>> + kho_out.mem_tables[order] = *desc;
>
> Missing the virt_to_phys?
Nope. This isn't storing the pointer to the descriptor, but the _value_
of the descriptor -- so it already contains the physical address of the
level 4 table.
>
>> + nr_tables = min_t(unsigned int, len / sizeof(*tables), NR_PAGE_ORDERS);
>> + for (order = 0; order < nr_tables; order++)
>> + khomem_walk_preserved((khomem_desc_t *)&tables[order], order,
>
> Missing phys_to_virt
Same as above. tables contains the _values_ of the descriptors which
already have physical addresses which we turn into virtual ones in
khomem_table().
>
> Please dont' remove the KHOSER stuff, and do use it with proper
> structs and types. It is part of keeping this stuff understandable.
I didn't see any need for KHOSER stuff here to be honest. The only time
we deal with KHO pointers is with table addresses, and that is already
well abstracted in khomem_mkdesc() (I suppose that can require a
khomem_desc_t * instead of a void *, but beyond that it is quite easy to
understand IMO).
--
Regards,
Pratyush Yadav
Powered by blists - more mailing lists