lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+CK2bA_Qb9csWvEQb-zpxgMg7vy+gw9eh0z88QBEdiFdtopMQ@mail.gmail.com>
Date: Fri, 24 Oct 2025 09:57:24 -0400
From: Pasha Tatashin <pasha.tatashin@...een.com>
To: Jason Gunthorpe <jgg@...pe.ca>
Cc: Pratyush Yadav <pratyush@...nel.org>, akpm@...ux-foundation.org, brauner@...nel.org, 
	corbet@....net, graf@...zon.com, linux-kernel@...r.kernel.org, 
	linux-kselftest@...r.kernel.org, linux-mm@...ck.org, masahiroy@...nel.org, 
	ojeda@...nel.org, rdunlap@...radead.org, rppt@...nel.org, tj@...nel.org, 
	jasonmiu@...gle.com, dmatlack@...gle.com, skhawaja@...gle.com, 
	glider@...gle.com, elver@...gle.com
Subject: Re: [PATCH 2/2] liveupdate: kho: allocate metadata directly from the
 buddy allocator

On Fri, Oct 24, 2025 at 9:25 AM Jason Gunthorpe <jgg@...pe.ca> wrote:
>
> On Wed, Oct 15, 2025 at 10:19:08AM -0400, Pasha Tatashin wrote:
> > On Wed, Oct 15, 2025 at 9:05 AM Pratyush Yadav <pratyush@...nel.org> wrote:
> > >
> > > +Cc Marco, Alexander
> > >
> > > On Wed, Oct 15 2025, Pasha Tatashin wrote:
> > >
> > > > KHO allocates metadata for its preserved memory map using the SLUB
> > > > allocator via kzalloc(). This metadata is temporary and is used by the
> > > > next kernel during early boot to find preserved memory.
> > > >
> > > > A problem arises when KFENCE is enabled. kzalloc() calls can be
> > > > randomly intercepted by kfence_alloc(), which services the allocation
> > > > from a dedicated KFENCE memory pool. This pool is allocated early in
> > > > boot via memblock.
> > >
> > > At some point, we'd probably want to add support for preserving slab
> > > objects using KHO. That wouldn't work if the objects can land in scratch
> > > memory. Right now, the kfence pools are allocated right before KHO goes
> > > out of scratch-only and memblock frees pages to buddy.
> >
> > If we do that, most likely we will add a GFP flag that goes with it,
> > so the slab can use a special pool of pages that are preservable.
> > Otherwise, we are going to be leaking memory from the old kernel in
> > the unpreserved parts of the pages.
>
> That isn't an issue. If we make slab preservable then we'd have to
> preserve the page and then somehow record what order is stored in that
> page and a bit map of which parts are allocated to restore the slab
> state on recovery.
>
> So long as the non-preserved memory comes back as freed on the
> sucessor kernel it doesn't matter what was in it in the preceeding
> kernel. The new kernel will eventually zero it. So it isn't a 'leak'.

Hi Jason,

I agree, it's not a "leak" in the traditional sense, as we trust the
successor kernel to manage its own memory.

However, my concern is that without a dedicated GFP flag, this
partial-page preservation model becomes too fragile, inefficient, and
creates a data exposure risk.

You're right the new kernel will eventually zero memory, but KHO
preserves at page granularity. If we preserve a single slab object,
the entire page is handed off. When the new kernel maps that page
(e.g., to userspace) to access the preserved object, it also exposes
the unpreserved portions of that same page. Those portions contain
stale data from the old kernel and won't have been zeroed yet,
creating an easy-to-miss data leak vector. It makes the API very
error-prone.

There's also the inefficiency. The unpreserved parts of that page are
unusable by the new kernel until the preserved object is freed.
Depending on the use case, that object might live for the entire
kernel lifetime, effectively wasting that memory. This waste could
then accumulate with each subsequent live update.

Trying to create a special KHO slab cache isn't a solution either,
since slab caches are often merged.

As I see it, the only robust solution is to use a special GFP flag.
This would force these allocations to come from a dedicated pool of
pages that are fully preserved, with no partial/mixed-use pages and
also retrieved as slabs.

That said, I'm not sure preserving individual slab objects is a high
priority right now. It might be simpler to avoid it altogether.

Pasha

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ