linux-kernel - Re: [PATCH] kexec_core: Accept unaccepted kexec destination addresses

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z0lWkrsXSpDVfW72@yzhao56-desk.sh.intel.com>
Date: Fri, 29 Nov 2024 13:52:18 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Baoquan He <bhe@...hat.com>
CC: "Eric W. Biederman" <ebiederm@...ssion.com>, "Kirill A. Shutemov"
	<kirill@...temov.name>, <kexec@...ts.infradead.org>,
	<linux-kernel@...r.kernel.org>, <linux-coco@...ts.linux.dev>,
	<x86@...nel.org>, <rick.p.edgecombe@...el.com>,
	<kirill.shutemov@...ux.intel.com>
Subject: Re: [PATCH] kexec_core: Accept unaccepted kexec destination addresses

On Thu, Nov 28, 2024 at 11:19:20PM +0800, Baoquan He wrote:
> On 11/27/24 at 06:01pm, Yan Zhao wrote:
> > On Tue, Nov 26, 2024 at 07:38:05PM +0800, Baoquan He wrote:
> > > On 10/24/24 at 08:15am, Yan Zhao wrote:
> > > > On Wed, Oct 23, 2024 at 10:44:11AM -0500, Eric W. Biederman wrote:
> > > > > "Kirill A. Shutemov" <kirill@...temov.name> writes:
> > > > > 
> > > > > > Waiting minutes to get VM booted to shell is not feasible for most
> > > > > > deployments. Lazy is sane default to me.
> > > > > 
> > > > > Huh?
> > > > > 
> > > > > Unless my guesses about what is happening are wrong lazy is hiding
> > > > > a serious implementation deficiency.  From all hardware I have seen
> > > > > taking minutes is absolutely ridiculous.
> > > > > 
> > > > > Does writing to all of memory at full speed take minutes?  How can such
> > > > > a system be functional?
> > > > > 
> > > > > If you don't actually have to write to the pages and it is just some
> > > > > accounting function it is even more ridiculous.
> > > > > 
> > > > > 
> > > > > I had previously thought that accept_memory was the firmware call.
> > > > > Now that I see that it is just a wrapper for some hardware specific
> > > > > calls I am even more perplexed.
> > > > > 
> > > > > 
> > > > > Quite honestly what this looks like to me is that someone failed to
> > > > > enable write-combining or write-back caching when writing to memory
> > > > > when initializing the protected memory.  With the result that everything
> > > > > is moving dog slow, and people are introducing complexity left and write
> > > > > to avoid that bad implementation.
> > > > > 
> > > > > 
> > > > > Can someone please explain to me why this accept_memory stuff has to be
> > > > > slow, why it has to take minutes to do it's job.
> > > > This kexec patch is a fix to a guest(TD)'s kexce failure.
> > > > 
> > > > For a linux guest, the accept_memory() happens before the guest accesses a page.
> > > > It will (if the guest is a TD)
> > > > (1) trigger the host to allocate the physical page on host to map the accessed
> > > >     guest page, which might be slow with wait and sleep involved, depending on
> > > >     the memory pressure on host.
> > > > (2) initializing the protected page.
> > > > 
> > > > Actually most of guest memory are not accessed by guest during the guest life
> > > > cycle. accept_memory() may cause the host to commit a never-to-be-used page,
> > > > with the host physical page not even being able to get swapped out.
> > > 
> > > So this sounds to me more like a business requirement on cloud platform,
> > > e.g if one customer books a guest instance with 60G memory, while the
> > > customer actually always only cost 20G memory at most. Then the 40G memory
> > > can be saved to reduce pressure for host.
> > Yes.
> 
> That's very interesting, thanks for confirming.
> 
> > 
> > > I could be shallow, just a wild guess.
> > > If my guess is right, at least those cloud service providers must like this
> > > accept_memory feature very much.
> > > 
> > > > 
> > > > That's why we need a lazy accept, which does not accept_memory() until after a
> > > > page is allocated by the kernel (in alloc_page(s)).
> > > 
> > > By the way, I have two questions, maybe very shallow.
> > > 
> > > 1) why can't we only find those already accepted memory to put kexec
> > > kernel/initrd/bootparam/purgatory?
> > 
> > Currently, the first kernel only accepts memory during the memory allocation in
> > a lazy accept mode. Besides reducing boot time, it's also good for memory
> > over-commitment as you mentioned above.
> > 
> > My understanding of why the memory for the kernel/initrd/bootparam/purgatory is
> > not allocated from the first kernel is that this memory usually needs to be
> > physically contiguous. Since this memory will not be used by the first kernel,
> > looking up from free RAM has a lower chance of failure compared to allocating it
> 
> Well, there could be misunderstanding here.The final loaded position of
> kernel/initrd/bootparam/purgatory is not searched from free RAM, it's
Oh, by free RAM, I mean system RAM that is marked as
IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY, but not marked as
IORESOURCE_SYSRAM_DRIVER_MANAGED.


> just from RAM on x86. Means it possibly have been allocated and being
> used by other component of 1st kernel. Not like kdump, the 2nd kernel of
Yes, it's entirely possible that the destination address being searched out has
already been allocated and is in use by the 1st kernel. e.g. for
KEXEC_TYPE_DEFAULT, the source page for each segment is allocated from the 1st
kernel, and it is allowed to have the same address as its corresponding
destination address.

However, it's not guaranteed that the destination address must have been
allocated by the 1st kernel.

> kexec reboot doesn't care about 1st kernel's memory usage. We will copy
> them from intermediat position to the designated location when jumping.
Right. If it's not guaranteed that the destination address has been accepted
before this copying, the copying could trigger an error due to accessing an
unaccepted page, which could be fatal for a linux TDX guest.

> If we take this way, we need search unaccepted->bitmap top down or
> bottom up, according to setting. Then another suit of functions need
> be provided. That looks a little complicated.
Do you mean searching only accepted pages as destination addresses?
That might increase the chance of failure compared to accepting the addressed
being searched out.

> kexec_add_buffer()
> -->arch_kexec_locate_mem_hole()
>    -->kexec_locate_mem_hole()
>       -->kexec_walk_memblock(kbuf, locate_mem_hole_callback) -- on arm64
>       -->kexec_walk_resources(kbuf, locate_mem_hole_callback) -- on x86
>          -->walk_system_ram_res_rev()

Yes.


> Besides, the change in your patch has one issue. Usually we do kexec load to
> read in the kernel/initrd/bootparam/purgatory, while they are loaded to
> the destinations till kexec jumping. We could do kexec loading while 
> never trigger the jumping, your change have done the accept_memory().
> But this doesn't impact much because it always searched and found the
> same location on one system.
Right.
Do you think it's good to move the accept to machine_kexec()?
The machine_kexec() is platform specific though.

> > from the first kernel, especially when memory pressure is high in the first
> > kernel.
> > 
> >  
> > > 2) why can't we accept memory for (kernel, boot params/cmdline/initrd)
> > > in 2nd kernel? Surely this purgatory still need be accepted in 1st kernel.
> > > Sorry, I just read accept_memory() code, haven't gone through x86 boot
> > > code flow.
> > If a page is not already accepted, invoking accept_memory() will trigger a
> > memory accept to zero-out the page content. So, for the pages passed to the
> > second kernel, they must have been accepted before page content is copied in.
> > 
> > For boot params/cmdline/initrd, perhaps we could make those pages in shared
> > memory initially and have the second kernel to accept private memory for copy.
> > However, that would be very complex and IMHO not ideal.
> 
> I asked this because I saw your reply to Eric in another thread, quote
> your saying at below. I am wondering why kernel can accept itself, why
> other parts can't do it similarly.
> =====
> Yes, the kernel actually will accept initial memory used by itself in
> extract_kernel(), as in arch/x86/boot/compressed/misc.c.
> 
> But the target kernel may not be able to accept memory for purgatory.
> And it's currently does not accept memory for boot params/cmdline,
> and initrd .
> ====
Thanks for pointing this out.
I also found that my previous reply was confusing and misleading.

The 2nd kernel will accept the addresses before it decompresses itself there.
Since these addresses are somewhere "random", the 2nd kernel (and for the 1st
kernel for itself) needs to call accept_memory() in case that they might not
have been accepted.

So, previously, I thought a workable approach might be for kexec to map the
destination addresses in shared memory, perform the copy/jump, and then have the
2nd kernel accept the addresses for decompressing and other parts.
However, aside from the complications and security concerns, this approach is
problematic because the 2nd kernel may clear the pages by accepting them if the
addresses for decompressing overlap with the ones before decompressing.

That said, would it be acceptable if I update the patch log and maybe also move
the accept call to machine_kexec()?

New patch log:
The kexec segments's destination addresses are searched from the memblock
or RAM resources. They are not allocated by the first kernel, though they
may overlap with the memory in used by the first kernel. So, it is not
guaranteed that they are accepted before kexec relocates to the second
kernel.

Accept the destination addresses before kexec relocates to the second
kernel, since kexec would access them by swapping content of source and
destination pages.