linux-kernel - Re: [RFC PATCH v2 0/7] Introduce persistent memory pool

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <b684d339-991d-be85-692c-75f21679ca69@intel.com>
Date:   Thu, 28 Sep 2023 10:29:32 -0700
From:   Dave Hansen <dave.hansen@...el.com>
To:     Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
Cc:     Baoquan He <bhe@...hat.com>, tglx@...utronix.de, mingo@...hat.com,
        bp@...en8.de, dave.hansen@...ux.intel.com, x86@...nel.org,
        hpa@...or.com, ebiederm@...ssion.com, akpm@...ux-foundation.org,
        stanislav.kinsburskii@...il.com, corbet@....net,
        linux-kernel@...r.kernel.org, kexec@...ts.infradead.org,
        linux-mm@...ck.org, kys@...rosoft.com, jgowans@...zon.com,
        wei.liu@...nel.org, arnd@...db.de, gregkh@...uxfoundation.org,
        graf@...zon.de, pbonzini@...hat.com
Subject: Re: [RFC PATCH v2 0/7] Introduce persistent memory pool

On 9/27/23 16:25, Stanislav Kinsburskii wrote:
> On Thu, Sep 28, 2023 at 06:22:54AM -0700, Dave Hansen wrote:
>> On 9/27/23 09:13, Stanislav Kinsburskii wrote:
>>> Once deposited, these pages can't be accessed by Linux anymore and thus
>>> must be preserved in "used" state across kexec, as hypervisor state is
>>> unware of kexec.
>>
>> If Linux can't access them, they're not RAM any more.  I'd much rather
>> remove them from the memory map and move on with life rather than
>> implement a bunch of new ABI that's got to be handed across kernels.
> 
> Could you elaborate more on the new ABIs? FDT is handled by x86 already,
> and passing it over kexec looks like a natural extension.
> Also, adding more state to it also doens't look like a new ABI.
> Or does it?

FDT makes it easier to pass arbitrary data around, but you're still
creating a new "default_pmpool" device tree node on one end and
consuming it on the other.  That's a new ABI in my book.

> Let me also comment on removing this regions from the memory map. The
> major peculiarity here is that hypervisor distinguish between the pages,
> deposited for guests to rnu and the pages deposited for the Linux root
> partition to keep the guest-related portion of hypervisor state in the
> root partition. And the latter is the matter in question.
> 
> We can indeed isolate and deposit a excessive amount of memory upfront
> in hope that hypervisor will never get into the situation, when it needs
> more memory.
> However, it's not reliable, as the amount of memory will always be an
> estimation, depending on the number of expected guests, guest-attached
> devices, etc. And this becomes even a bigger problem when most of the
> memory is already removed from the memory map to host guest partitions.
> It's also not efficient as the amount of memory required by hypervisor
> can grow or shrink depending on the use case or host configuration, and
> deposting excessive amount of memory will be a waste.
> 
> But, actually, the idea of removing the pages from memory map was
> reflected to some extent in the first version of this proposal,
> so let me elaborate on it a bit.
> 
> Effectively, instead of reserving and depositing a lot of memory to
> hypervisor upfront, the memory can be allocated from kernel memory when
> needed and then returned back when unused.
> This would still require pages removal from the memory map upon kexec,
> but that's another problem.

Let's distill this down a bit.

I agree that it's a waste to reserve an obscene amount of memory up
front for all guests for rare cases.  Having the amount of consumed
memory grow is a nice feature.

You can also quite easily *shrink* the amount of memory on a given
kernel without new code.  Right?

The problem comes when you've grown the footprint of hypervisor-donated
memory, kexec, and *THEN* want to shrink it.  That's what needs new
metadata to be communicated over to the new kernel.

1. Boot some kernel
2. Grow the deposited memory a bunch
3. Kexec
4. Shrink the deposited memory

Right?

That's where you lose me.

Can't the deposited memory just be shrunk before kexec?  Surely there
aren't a bunch of pathological things consuming that memory right before
kexec, which is basically a reboot.