linux-kernel - Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF8kJuMmiBEoaAL=XYcj6Y1qfAVd=Q_s9iT0wi70LJYn6ht42w@mail.gmail.com>
Date: Mon, 28 Apr 2025 16:01:21 -0700
From: Chris Li <chrisl@...nel.org>
To: Andrey Ryabinin <arbn@...dex-team.com>
Cc: linux-kernel@...r.kernel.org, Alexander Graf <graf@...zon.com>, 
	James Gowans <jgowans@...zon.com>, Mike Rapoport <rppt@...nel.org>, 
	Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org, 
	Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org, 
	"H . Peter Anvin" <hpa@...or.com>, Eric Biederman <ebiederm@...ssion.com>, kexec@...ts.infradead.org, 
	Pratyush Yadav <ptyadav@...zon.de>, Jason Gunthorpe <jgg@...dia.com>, 
	Pasha Tatashin <pasha.tatashin@...een.com>, David Rientjes <rientjes@...gle.com>
Subject: Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the
 kernel state across kexec

Hi Andrey,

I am working on the PCI portion of the live update and looking at
using KSTATE as an alternative to the FDT. Here are some high level
feedbacks.

On Mon, Mar 10, 2025 at 5:04 AM Andrey Ryabinin <arbn@...dex-team.com> wrote:
>
>  Main changes from v1 [1]:
>   - Get rid of abusing crashkernel and implent proper way to pass memory to new kernel
>   - Lots of misc cleanups/refactorings.
>
> kstate (kernel state) is a mechanism to describe internal some part of the
> kernel state, save it into the memory and restore the state after kexec
> in the new kernel.
>
> The end goal here and the main use case for this is to be able to
> update host kernel under VMs with VFIO pass-through devices running
> on that host. Since we are pretty far from that end goal yet, this
> only establishes some basic infrastructure to describe and migrate complex
> in-kernel states.
>
> The idea behind KSTATE resembles QEMU's migration framework [1], which
> solves quite similar problem - migrate state of VM/emulated devices
> across different versions of QEMU.
>
> This is an altenative to Kexec Hand Over (KHO [3]).
>
> So, why not KHO?
>

KHO does more than just serializing/unserializing. It also has scratch
areas etc to allow safely performing early allocation without stepping
on the preserved memory. I see KSTATE as an alternative to libFDT as
ways of serializing the preserved memory. Not a replacement for KHO.

With that, it would be great to see a KSTATE build on top of the
current version of KHO. The V6 version of KHO uses a recursive FDT
object. I see recursive FDT can map to the C struct description
similar to the KSTATE field description nicely. However, that will
require KSTATE to make some considerable changes to embrace the KHO
v6. For example, the KSTATE uses one contiguous stream buffer and KHO
V6 uses many recursive physical address object pointers for different
objects.  Maybe a KSTATE V3?


>  - The main reason is KHO doesn't provide simple and convenient internal
>     API for the drivers/subsystems to preserve internal data.
>     E.g. lets consider we have some variable of type 'struct a'
>     that needs to be preserved:
>         struct a {
>                 int i;
>                 unsigned long *p_ulong;
>                 char s[10];
>                 struct page *page;
>         };
>
>      The KHO-way requires driver/subsystem to have a bunch of code
>      dealing with FDT stuff, something like
>
>      a_kho_write()
>      {
>              ...
>              fdt_property(fdt, "i", &a.i, sizeof(a.i));
>              fdt_property(fdt, "ulong", a.p_ulong, sizeof(*a.p_ulong));
>              fdt_property(fdt, "s", &a.s, sizeof(a.s));
>              if (err)
>              ...
>      }

I can add more of the pain point of using FDT as data format for
load/restore states. It is not easy to determine how much memory the
FDT serialize is going to use up front. We want to do all the memory
allocation in the KHO PREPARE phase, so that after the KHO PREPARE
phase there is no KHO failure due to can't allocate memory.
The current KHO V6 does not handle the case where the recursive FDT
goes beyond 4K pages. There is a feature gap where the PCI subsystem
will likely save state for a list of PCI devices and the FDT can
possibly go more than 4K.

FDT also does not save the type of the object buffer, only the size.
There is an implicit contract of what this object points to. The
KSTATE description table can be extended to be more expressive than
FDT, e.g. cover optional min max allowed values.

>      a_kho_restore()
>      {
>              ...
>              a.i = fdt_getprop(fdt, offset, "i", &len);
>              if (!a.i || len != sizeof(a.i))
>                 goto err
>              *a.p_ulong = fdt_getprop....
>      }
>
>     Each driver/subsystem has to solve this problem in their own way.
>     Also if we use fdt properties for individual fields, that might be wastefull
>     in terms of used memory, as these properties use strings as keys.

Right, I need to write a lot of boilerplate code to do the per
property save/restore. I am not worried too much about memory usage. A
lot of string keys are not much longer than 8 bytes. The memory saving
convert to binary index is not huge. I actually would suggest adding
the string version of the field name to the description table, so that
we can dump the state in KSTATE just like the YAML FDT output for
debugging purposes. It is a very useful feature of FDT to dump the
current saving state into a human readable form. KSTATE can have the
same feature added.

>
>    While with KSTATE solves the same problem in more elegant way, with this:
>         struct kstate_description a_state = {
>                 .name = "a_struct",
>                 .version_id = 1,
>                 .id = KSTATE_TEST_ID,
>                 .state_list = LIST_HEAD_INIT(test_state.state_list),
>                 .fields = (const struct kstate_field[]) {
>                         KSTATE_BASE_TYPE(i, struct a, int),
>                         KSTATE_BASE_TYPE(s, struct a, char [10]),
>                         KSTATE_POINTER(p_ulong, struct a),
>                         KSTATE_PAGE(page, struct a),
>                         KSTATE_END_OF_LIST()
>                 },
>         };
>
>
>         {
>                 static unsigned long ulong
>                 static struct a a_data = { .p_ulong = &ulong };
>
>                 kstate_register(&test_state, &a_data);
>         }
>
>        The driver needs only to have a proper 'kstate_description' and call kstate_register()
>        to save/restore a_data.
>        Basically 'struct kstate_description' provides instructions how to save/restore 'struct a'.
>        And kstate_register() does all this save/restore stuff under the hood.

It seems the KSTATE uses one contiguous stream and the object has to
be loaded in the order it was saved. For the PCI code, the PCI device
scanning and probing might cause the device load out of the order of
saving. (The PCI probing is actually the reverse order of saving).
This kstate_register() might pose restrictions on the restore order.
PCI will need to look up and find the device state based on the PCI
device ID. Other subsystems will likely have the requirement to look
up their own saved state as well.
I also see KSTATE can be extended to support that.

>  - Another bonus point - kstate can preserve migratable memory, which is required
>     to preserve guest memory
>
>
> So now to the part how this works.
>
> State of kernel data (usually it's some struct) is described by the
> 'struct kstate_description' containing the array of individual
> fields descpriptions - 'struct kstate_field'. Each field
> has set of bits in ->flags which instructs how to save/restore
> a certain field of the struct. E.g.:
>   - KS_BASE_TYPE flag tells that field can be just copied by value,
>
>   - KS_POINTER means that the struct member is a pointer to the actual
>      data, so it needs to be dereference before saving/restoring data
>      to/from kstate data steam.
>
>   - KS_STRUCT - contains another struct,  field->ksd must point to
>       another 'struct kstate_dscription'

The field can't have both bits set for KS_BASE_TYPE and KS_STRUCT
type, right? Some of these flag combinations do not make sense. This
part might need more careful planning to keep it simple. Maybe some of
the flags bits should be enum.

>
>   - KS_CUSTOM - Some non-trivial field that requires custom kstate_field->save()
>                ->restore() callbacks to save/restore data.
>
>   - KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by the
>                          field->count() callback
>   - KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or
>      linear address. Store offset

I think we want to describe different stream types.
For example the most simple stream container is just a contiguous
buffer with start address and size.
The more complex one might be and size then an array of page pointers,
all those pointers add to up the new buffer which describe an saved
KSTATE that is larger than 4K and spread into an array of pages. Those
pages don't need to be contiguous. Such a page array buffer stores the
KSTATE entry described by a separate description table.

>
>   - KS_END - special flag indicating the end of migration stream data.
>
> kstate_register() call accepts kstate_description along with an instance
> of an object and registers it in the global 'states' list.
>
> During kexec reboot phase we go through the list of 'kstate_description's
> and each instance of kstate_description forms the 'struct kstate_entry'
> which save into the kstate's data stream.
>
> The 'kstate_entry' contains information like ID of kstate_description, version
> of it, size of migration data and the data itself. The ->data is formed in
> accordance to the kstate_field's of the corresponding kstate_description.

The version for the kstate_description might not be enough. The
version works if there is a linear history. Here we are likely to have
different vendors add their own extension to the device state saving.
I suggest instead we save the old kernel's kstate_description table
(once per description table as a recursive object)  alongside the
object physical address as well. The new kernel has their new version
of the description table. It can compare between the old and new
description tables and find out what fields need to be upgraded or
downgraded. The new kernel will use the old kstate_description table
to decode the previous kernel's saved object. I think that way it is
more flexible to support adding one or more features and not tight to
which version has what feature. It can also make sure the new kernel
can always dump the old KSTATE into YAML.

That way we might be able to simplify the subsection and the
depreciation flags. The new kernel doesn't need to carry the history
of changes made to the old description table.

> After the reboot, when the kstate_register() called it parses migration
> stream, finds the appropriate 'kstate_entry' and restores the contents of
> the object in accordance with kstate_description and ->fields.

Again this restoring can happen in a different order when the PCI
device scanning and probing order. The restoration might not happen in
one single call chain. Material for V3?

I am happy to work with you to get KSTATE working with the existing KHO effort.

Chris