linux-kernel - Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250323190758.743798-1-changyuanl@google.com>
Date: Sun, 23 Mar 2025 12:07:58 -0700
From: Changyuan Lyu <changyuanl@...gle.com>
To: jgg@...dia.com
Cc: akpm@...ux-foundation.org, anthony.yznaga@...cle.com, arnd@...db.de, 
	ashish.kalra@....com, benh@...nel.crashing.org, bp@...en8.de, 
	catalin.marinas@....com, changyuanl@...gle.com, corbet@....net, 
	dave.hansen@...ux.intel.com, devicetree@...r.kernel.org, dwmw2@...radead.org, 
	ebiederm@...ssion.com, graf@...zon.com, hpa@...or.com, jgowans@...zon.com, 
	kexec@...ts.infradead.org, krzk@...nel.org, 
	linux-arm-kernel@...ts.infradead.org, linux-doc@...r.kernel.org, 
	linux-kernel@...r.kernel.org, linux-mm@...ck.org, luto@...nel.org, 
	mark.rutland@....com, mingo@...hat.com, pasha.tatashin@...een.com, 
	pbonzini@...hat.com, peterz@...radead.org, ptyadav@...zon.de, 
	robh+dt@...nel.org, robh@...nel.org, rostedt@...dmis.org, rppt@...nel.org, 
	saravanak@...gle.com, skinsburskii@...ux.microsoft.com, tglx@...utronix.de, 
	thomas.lendacky@....com, will@...nel.org, x86@...nel.org
Subject: Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation

On Fri, Mar 21, 2025 at 10:46:29 -0300, Jason Gunthorpe <jgg@...dia.com> wrote:
> On Wed, Mar 19, 2025 at 06:55:44PM -0700, Changyuan Lyu wrote:
> > +/**
> > + * kho_preserve_folio - preserve a folio across KHO.
> > + * @folio: folio to preserve
> > + *
> > + * Records that the entire folio is preserved across KHO. The order
> > + * will be preserved as well.
> > + *
> > + * Return: 0 on success, error code on failure
> > + */
> > +int kho_preserve_folio(struct folio *folio)
> > +{
> > +	unsigned long pfn = folio_pfn(folio);
> > +	unsigned int order = folio_order(folio);
> > +	int err;
> > +
> > +	if (!kho_enable)
> > +		return -EOPNOTSUPP;
> > +
> > +	down_read(&kho_out.tree_lock);
> > +	if (kho_out.fdt) {
>
> What is the lock and fdt test for?

It is to avoid the competition between the following 2 operations,
- converting the hashtables and mem traker to FDT,
- adding new data to hashtable/mem tracker.
Please also see function kho_finalize() in the previous patch
"kexec: add Kexec HandOver (KHO) generation helpers" [1].

The function kho_finalize() iterates over all the hashtables and
the mem tracker. We want to make sure that during the iterations,
no new data is added to the hashtables and mem tracker.

Also if FDT is generated, the mem tracker then has been serialized
to linked pages, so we return -EBUSY to prevent more data from
being added to the mem tracker.

> I'm getting the feeling that probably kho_preserve_folio() and the
> like should accept some kind of
> 'struct kho_serialization *' and then we don't need this to prove we
> are within a valid serialization window. It could pass the pointer
> through the notifiers

If we use notifiers, callbacks have to be done serially.

> The global variables in this series are sort of ugly..
>
> We want this to be fast, so try hard to avoid a lock..

In most cases we only need read lock. Different KHO users can adding
data into their own subnodes in parallel.
We only need a write lock if
- 2 KHO users register subnodes to the KHO root node at the same time
- KHO root tree is about to be converted to FDT.

> > +void *kho_restore_phys(phys_addr_t phys, size_t size)
> > +{
> > +	unsigned long start_pfn, end_pfn, pfn;
> > +	void *va = __va(phys);
> > +
> > +	start_pfn = PFN_DOWN(phys);
> > +	end_pfn = PFN_UP(phys + size);
> > +
> > +	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > +		struct page *page = pfn_to_online_page(pfn);
> > +
> > +		if (!page)
> > +			return NULL;
> > +		kho_restore_page(page);
> > +	}
> > +
> > +	return va;
> > +}
> > +EXPORT_SYMBOL_GPL(kho_restore_phys);
>
> What do you imagine this is used for? I'm not sure what value there is
> in returning a void *? How does the caller "free" this?

This function is also from Mike :)

I suppose some KHO users may still
preserve memory using memory ranges (instead of folio). In the restoring
stage they need a helper to setup the pages of reserved memory ranges.
A void * is returned so the KHO user can access the memory
contents through the virtual address.
I guess the caller can free the ranges by free_pages()?

It makes sense to return nothing and let caller to call `__va`
if they want. Then the function signature looks more symmetric to
`kho_preserve_phys`.

> > +#define KHOSER_PTR(type)          \
> > +	union {                   \
> > +		phys_addr_t phys; \
> > +		type ptr;         \
> > +	}
> > +#define KHOSER_STORE_PTR(dest, val)                 \
> > +	({                                          \
> > +		(dest).phys = virt_to_phys(val);    \
> > +		typecheck(typeof((dest).ptr), val); \
> > +	})
> > +#define KHOSER_LOAD_PTR(src) \
> > +	((src).phys ? (typeof((src).ptr))(phys_to_virt((src).phys)) : NULL)
>
> I had imagined these macros would be in a header and usably by drivers
> that also want to use structs to carry information.
>

OK I will move them to the header file in the next version.

> > [...]
> > @@ -829,6 +1305,10 @@ static __init int kho_init(void)
> >
> >  	kho_out.root.name = "";
>
> ?

Set the root node name to an empty string since fdt_begin_node
calls strlen on the node name.

It is equivalent to `err = fdt_begin_node(fdt, "")` in kho_serialize()
of Mike's V4 patch [2].

> >  	err = kho_add_string_prop(&kho_out.root, "compatible", "kho-v1");
> > +	err |= kho_add_prop(&kho_out.preserved_memory, "metadata",
> > +			    &kho_out.first_chunk_phys, sizeof(phys_addr_t));
>
> metedata doesn't fee like a great a better name..
>
> Please also document all the FDT schema thoroughly!
>
> There should be yaml files just like in the normal DT case defining
> all of this. This level of documentation and stability was one of the
> selling reasons why FDT is being used here!

YAML files were dropped because we think it may take a while for our
schema to be near stable. So we start from some simple plain text. We
can add some prop and node docs (that are considered stable at this point)
back to YAML in the next version.

[1] https://lore.kernel.org/all/20250320015551.2157511-8-changyuanl@google.com/
[2] https://lore.kernel.org/all/20250206132754.2596694-6-rppt@kernel.org/

Best,
Changyuan