[<prev] [next>] [day] [month] [year] [list]
Message-ID: <b684d339-991d-be85-692c-75f21679ca69@intel.com>
Date: Thu, 28 Sep 2023 10:29:32 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
Cc: Baoquan He <bhe@...hat.com>, tglx@...utronix.de, mingo@...hat.com,
bp@...en8.de, dave.hansen@...ux.intel.com, x86@...nel.org,
hpa@...or.com, ebiederm@...ssion.com, akpm@...ux-foundation.org,
stanislav.kinsburskii@...il.com, corbet@....net,
linux-kernel@...r.kernel.org, kexec@...ts.infradead.org,
linux-mm@...ck.org, kys@...rosoft.com, jgowans@...zon.com,
wei.liu@...nel.org, arnd@...db.de, gregkh@...uxfoundation.org,
graf@...zon.de, pbonzini@...hat.com
Subject: Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
On 9/27/23 16:25, Stanislav Kinsburskii wrote:
> On Thu, Sep 28, 2023 at 06:22:54AM -0700, Dave Hansen wrote:
>> On 9/27/23 09:13, Stanislav Kinsburskii wrote:
>>> Once deposited, these pages can't be accessed by Linux anymore and thus
>>> must be preserved in "used" state across kexec, as hypervisor state is
>>> unware of kexec.
>>
>> If Linux can't access them, they're not RAM any more. I'd much rather
>> remove them from the memory map and move on with life rather than
>> implement a bunch of new ABI that's got to be handed across kernels.
>
> Could you elaborate more on the new ABIs? FDT is handled by x86 already,
> and passing it over kexec looks like a natural extension.
> Also, adding more state to it also doens't look like a new ABI.
> Or does it?
FDT makes it easier to pass arbitrary data around, but you're still
creating a new "default_pmpool" device tree node on one end and
consuming it on the other. That's a new ABI in my book.
> Let me also comment on removing this regions from the memory map. The
> major peculiarity here is that hypervisor distinguish between the pages,
> deposited for guests to rnu and the pages deposited for the Linux root
> partition to keep the guest-related portion of hypervisor state in the
> root partition. And the latter is the matter in question.
>
> We can indeed isolate and deposit a excessive amount of memory upfront
> in hope that hypervisor will never get into the situation, when it needs
> more memory.
> However, it's not reliable, as the amount of memory will always be an
> estimation, depending on the number of expected guests, guest-attached
> devices, etc. And this becomes even a bigger problem when most of the
> memory is already removed from the memory map to host guest partitions.
> It's also not efficient as the amount of memory required by hypervisor
> can grow or shrink depending on the use case or host configuration, and
> deposting excessive amount of memory will be a waste.
>
> But, actually, the idea of removing the pages from memory map was
> reflected to some extent in the first version of this proposal,
> so let me elaborate on it a bit.
>
> Effectively, instead of reserving and depositing a lot of memory to
> hypervisor upfront, the memory can be allocated from kernel memory when
> needed and then returned back when unused.
> This would still require pages removal from the memory map upon kexec,
> but that's another problem.
Let's distill this down a bit.
I agree that it's a waste to reserve an obscene amount of memory up
front for all guests for rare cases. Having the amount of consumed
memory grow is a nice feature.
You can also quite easily *shrink* the amount of memory on a given
kernel without new code. Right?
The problem comes when you've grown the footprint of hypervisor-donated
memory, kexec, and *THEN* want to shrink it. That's what needs new
metadata to be communicated over to the new kernel.
1. Boot some kernel
2. Grow the deposited memory a bunch
3. Kexec
4. Shrink the deposited memory
Right?
That's where you lose me.
Can't the deposited memory just be shrunk before kexec? Surely there
aren't a bunch of pathological things consuming that memory right before
kexec, which is basically a reboot.
Powered by blists - more mailing lists