linux-kernel - Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250403161001.GG342109@nvidia.com>
Date: Thu, 3 Apr 2025 13:10:01 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Pratyush Yadav <ptyadav@...zon.de>
Cc: Changyuan Lyu <changyuanl@...gle.com>, linux-kernel@...r.kernel.org,
	graf@...zon.com, akpm@...ux-foundation.org, luto@...nel.org,
	anthony.yznaga@...cle.com, arnd@...db.de, ashish.kalra@....com,
	benh@...nel.crashing.org, bp@...en8.de, catalin.marinas@....com,
	dave.hansen@...ux.intel.com, dwmw2@...radead.org,
	ebiederm@...ssion.com, mingo@...hat.com, jgowans@...zon.com,
	corbet@....net, krzk@...nel.org, rppt@...nel.org,
	mark.rutland@....com, pbonzini@...hat.com,
	pasha.tatashin@...een.com, hpa@...or.com, peterz@...radead.org,
	robh+dt@...nel.org, robh@...nel.org, saravanak@...gle.com,
	skinsburskii@...ux.microsoft.com, rostedt@...dmis.org,
	tglx@...utronix.de, thomas.lendacky@....com,
	usama.arif@...edance.com, will@...nel.org,
	devicetree@...r.kernel.org, kexec@...ts.infradead.org,
	linux-arm-kernel@...ts.infradead.org, linux-doc@...r.kernel.org,
	linux-mm@...ck.org, x86@...nel.org
Subject: Re: [PATCH v5 09/16] kexec: enable KHO support for memory
 preservation

On Thu, Apr 03, 2025 at 03:50:04PM +0000, Pratyush Yadav wrote:

> The patch currently has a limitation where it does not free any of the
> empty tables after a unpreserve operation. But Changyuan's patch also
> doesn't do it so at least it is not any worse off.

We do we even have unpreserve? Just discard the entire KHO operation
in a bulk.

> When working on this patch, I realized that kho_mem_deserialize() is
> currently _very_ slow. It takes over 2 seconds to make memblock
> reservations for 48 GiB of 0-order pages. I suppose this can later be
> optimized by teaching memblock_free_all() to skip preserved pages
> instead of making memblock reservations.

Yes, this was my prior point of not having actual data to know what
the actual hot spots are.. This saves a few ms on an operation that
takes over 2 seconds :)

> +typedef unsigned long khomem_desc_t;

This should be more like:

union {
      void *table;
      phys_addr_t table_phys;
};

Since we are not using the low bits right now and it is alot cheaper
to convert from va to phys only once during the final step. __va is
not exactly fast.

> +#define PTRS_PER_LEVEL		(PAGE_SIZE / sizeof(unsigned long))
> +#define KHOMEM_L1_BITS		(PAGE_SIZE * BITS_PER_BYTE)
> +#define KHOMEM_L1_MASK		((1 << ilog2(KHOMEM_L1_BITS)) - 1)
> +#define KHOMEM_L1_SHIFT		(PAGE_SHIFT)
> +#define KHOMEM_L2_SHIFT		(KHOMEM_L1_SHIFT + ilog2(KHOMEM_L1_BITS))
> +#define KHOMEM_L3_SHIFT		(KHOMEM_L2_SHIFT + ilog2(PTRS_PER_LEVEL))
> +#define KHOMEM_L4_SHIFT		(KHOMEM_L3_SHIFT + ilog2(PTRS_PER_LEVEL))
> +#define KHOMEM_PFN_MASK		PAGE_MASK

This all works better if you just use GENMASK and FIELD_GET

> +static int __khomem_table_alloc(khomem_desc_t *desc)
> +{
> +	if (khomem_desc_none(*desc)) {

Needs READ_ONCE

> +struct kho_mem_track {
> +	/* Points to L4 KHOMEM descriptor, each order gets its own table. */
> +	struct xarray orders;
> +};

I think it would be easy to add a 5th level and just use bits 63:57 as
a 6 bit order. Then you don't need all this stuff either.

> +int kho_preserve_folio(struct folio *folio)
> +{
> +	unsigned long pfn = folio_pfn(folio);
> +	unsigned int order = folio_order(folio);
> +	int err;
> +
> +	if (!kho_enable)
> +		return -EOPNOTSUPP;
> +
> +	down_read(&kho_out.tree_lock);

This lock still needs to go away

> +static void kho_mem_serialize(void)
> +{
> +	struct kho_mem_track *tracker = &kho_mem_track;
> +	khomem_desc_t *desc;
> +	unsigned long order;
> +
> +	xa_for_each(&tracker->orders, order, desc) {
> +		if (WARN_ON(order >= NR_PAGE_ORDERS))
> +			break;
> +		kho_out.mem_tables[order] = *desc;

Missing the virt_to_phys?

> +	nr_tables = min_t(unsigned int, len / sizeof(*tables), NR_PAGE_ORDERS);
> +	for (order = 0; order < nr_tables; order++)
> +		khomem_walk_preserved((khomem_desc_t *)&tables[order], order,

Missing phys_to_virt

Please dont' remove the KHOSER stuff, and do use it with proper
structs and types. It is part of keeping this stuff understandable.

Jason