[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM_iQpW4--H6wqcx-=O5_PhEOkdrZN52qUhRRZO9xwpMxxLPaw@mail.gmail.com>
Date: Sat, 8 Feb 2025 17:00:15 -0800
From: Cong Wang <xiyou.wangcong@...il.com>
To: Pasha Tatashin <pasha.tatashin@...een.com>
Cc: Mike Rapoport <rppt@...nel.org>, linux-kernel@...r.kernel.org,
Alexander Graf <graf@...zon.com>, Andrew Morton <akpm@...ux-foundation.org>,
Andy Lutomirski <luto@...nel.org>, Anthony Yznaga <anthony.yznaga@...cle.com>,
Arnd Bergmann <arnd@...db.de>, Ashish Kalra <ashish.kalra@....com>,
Benjamin Herrenschmidt <benh@...nel.crashing.org>, Borislav Petkov <bp@...en8.de>,
Catalin Marinas <catalin.marinas@....com>, Dave Hansen <dave.hansen@...ux.intel.com>,
David Woodhouse <dwmw2@...radead.org>, Eric Biederman <ebiederm@...ssion.com>,
Ingo Molnar <mingo@...hat.com>, James Gowans <jgowans@...zon.com>, Jonathan Corbet <corbet@....net>,
Krzysztof Kozlowski <krzk@...nel.org>, Mark Rutland <mark.rutland@....com>,
Paolo Bonzini <pbonzini@...hat.com>, "H. Peter Anvin" <hpa@...or.com>,
Peter Zijlstra <peterz@...radead.org>, Pratyush Yadav <ptyadav@...zon.de>,
Rob Herring <robh+dt@...nel.org>, Rob Herring <robh@...nel.org>,
Saravana Kannan <saravanak@...gle.com>,
Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>, Steven Rostedt <rostedt@...dmis.org>,
Thomas Gleixner <tglx@...utronix.de>, Tom Lendacky <thomas.lendacky@....com>,
Usama Arif <usama.arif@...edance.com>, Will Deacon <will@...nel.org>, devicetree@...r.kernel.org,
kexec@...ts.infradead.org, linux-arm-kernel@...ts.infradead.org,
linux-doc@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org
Subject: Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
On Sat, Feb 8, 2025 at 4:14 PM Pasha Tatashin <pasha.tatashin@...een.com> wrote:
>
> On Sat, Feb 8, 2025 at 6:39 PM Cong Wang <xiyou.wangcong@...il.com> wrote:
> >
> > Hi Mike,
> >
> > On Thu, Feb 6, 2025 at 5:28 AM Mike Rapoport <rppt@...nel.org> wrote:
> > >
> > > From: "Mike Rapoport (Microsoft)" <rppt@...nel.org>
> > >
> > > Hi,
> > >
> > > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > > just to make things simpler instead of ftrace we decided to preserve
> > > "reserve_mem" regions.
> > >
> > > The patches are also available in git:
> > > https://git.kernel.org/rppt/h/kho/v4
> > >
> > >
> > > Kexec today considers itself purely a boot loader: When we enter the new
> > > kernel, any state the previous kernel left behind is irrelevant and the
> > > new kernel reinitializes the system.
> > >
> > > However, there are use cases where this mode of operation is not what we
> > > actually want. In virtualization hosts for example, we want to use kexec
> > > to update the host kernel while virtual machine memory stays untouched.
> > > When we add device assignment to the mix, we also need to ensure that
> > > IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> > > need to do the same for the PCI subsystem. If we want to kexec while an
> > > SEV-SNP enabled virtual machine is running, we need to preserve the VM
> > > context pages and physical memory. See "pkernfs: Persisting guest memory
> > > and kernel/device state safely across kexec" Linux Plumbers
> > > Conference 2023 presentation for details:
> > >
> > > https://lpc.events/event/17/contributions/1485/
> > >
> > > To start us on the journey to support all the use cases above, this patch
> > > implements basic infrastructure to allow hand over of kernel state across
> > > kexec (Kexec HandOver, aka KHO). As a really simple example target, we use
> > > memblock's reserve_mem.
> > > With this patch set applied, memory that was reserved using "reserve_mem"
> > > command line options remains intact after kexec and it is guaranteed to
> > > reside at the same physical address.
> >
> > Nice work!
> >
> > One concern there is that using memblock to reserve memory as crashkernel=
> > is not flexible. I worked on kdump years ago and one of the biggest pains
> > of kdump is how much memory should be reserved with crashkernel=. And
> > it is still a pain today.
> >
> > If we reserve more, that would mean more waste for the 1st kernel. If we
> > reserve less, that would induce more OOM for the 2nd kernel.
> >
> > I'd suggest considering using CMA, where the "reserved" memory can be
> > still reusable for other purposes, just that pages can be migrated out of this
> > reserved region on demand, that is, when loading a kexec kernel. Of course,
> > we need to make sure they are not reused by what you want to preserve here,
> > e.g., IOMMU. So you might need additional work to make it work, but still I
> > believe this is the right direction.
>
> This is exactly what scratch memory is used for. Unlike crashkernel=,
> the entire scratch area is available to user applications as CMA, as
> we know that no kernel-reserved memory will come from that area. This
> doesn't work for crashkernel=, because in some cases, the user pages
> might also need to be preserved in the crash dump. However, if user
> pages are going to be discarded from the crash dump (as is done 99% of
> the time), then it is better to also make it use CMA or ZONE_MOVABLE
> and use only the memory occupied by the crash kernel and do not waste
> any memory at all. We have an internal patch at Google that does this,
> and I think it would be a good improvement for the upstream kernel to
> carry as well.
Good to know CMA is already used, I could not tell from the cover letter.
The case that user-space pages need to be preserved is for scenarios like
RDMA which pins user-space pages for DMA transfer. Since the goal here
is also to preserve hardware states like RDMA's I guess the same concern
remains.
Thanks!
Powered by blists - more mailing lists