linux-kernel - Re: [PATCH v2 0/1] Accept unaccepted kexec segments' destination addresses

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z4YMdKz93PI7wsEl@yzhao56-desk.sh.intel.com>
Date: Tue, 14 Jan 2025 15:04:20 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Eric W. Biederman" <ebiederm@...ssion.com>
CC: Baoquan He <bhe@...hat.com>, "Kirill A. Shutemov" <kirill@...temov.name>,
	<akpm@...ux-foundation.org>, <kexec@...ts.infradead.org>,
	<linux-kernel@...r.kernel.org>, <linux-coco@...ts.linux.dev>,
	<x86@...nel.org>, <rick.p.edgecombe@...el.com>,
	<kirill.shutemov@...ux.intel.com>, <security@...nel.org>
Subject: Re: [PATCH v2 0/1] Accept unaccepted kexec segments' destination
 addresses

On Mon, Jan 13, 2025 at 08:59:29AM -0600, Eric W. Biederman wrote:
> Baoquan He <bhe@...hat.com> writes:
> 
> > On 01/13/25 at 12:01pm, Kirill A. Shutemov wrote:
> >> On Fri, Dec 13, 2024 at 05:49:30PM +0800, Yan Zhao wrote:
> >> > Hi Eric,
> >> > 
> >> > This is a repost of the patch "kexec_core: Accept unaccepted kexec
> >> > destination addresses" [1], rebased to v6.13-rc2.
> >> 
> >> Can we get this patch applied?
> >
> > This looks good to me. In v1, we have analyzed all other possible
> > solutions, however change in this patch seems the simplest and most
> > accepatable one.
> 
> Truly?  I will go back and look and see what I missed but I haven't seen
> anything that I addressed my original objections.
> 
> To repeat my objection.  The problem I saw was that the performance of
> the accepted memory paradigm was so terrible that they had to resort to
> lazily ``accepting'' memory, which leads to hacks in kexec.  I would not
> like to included hacks in kexec just so that other people can avoid
> fixing their bugs.
Hi Eric,
Your previous concerns in v1 [1] include:

1. "an unfiltered accept_memory may result in memory that has already been
   ``accepted'' being accepted again.
2. "target kernel won't know about about accepting memory, or might not perform
   the work early enough and try to use memory without accepting it first."
3. "this is has the potential to conflict with the accounting in
   try_to_accept_memory"

For 1/2, as we explained in [2], accept_memory() is not unfiltered. A bitmap in
       the virtual firmware maintains the accepted/unaccepted status and the
       bitmap is passed across the kernels.

For 3, sorry that I didn't explain clearly enough in v1, so I explained it in
       detail in the v2's cover letter (please check bullet 6 in [3]).
       The accounting in try_to_accept_memory_one() includes
       zone->unaccepted_pages,
       zone_page_state(zone, NR_FREE_PAGES),
       zone_page_state(zone, NR_UNACCEPTED),
       which are updated in try_to_accept_memory_one()-->__accept_page().

       However, the accounting will not be affected by invoking accept_memory()
       in kexec, since accept_memory() does not modify them, and it's correct to
       do so because of the way how the "accept_memory=lazy" works:
       
       (1) when to release free pages to the buddy allocator in
        memblock_free_all() during kernel boot, "accept_memory=lazy" withholds
        some pages out of the buddy allocator by recording them in the
        zone->unaccepted_pages list. The NR_FREE_PAGES, NR_UNACCEPTED are
        increased accordingly. By NR_UNACCEPTED, it just means the count of
        pages that are potentially available but currently not available to
        buddy's freelists. It does not mean alls the pages must be in
        unaccepted status.
        (see __free_pages_core() and __free_unaccepted()).
       (2) When the kernel runs out of memory, indicated by no enough
        NR_UNACCEPTED, it invokes
        cond_accept_memory()-->try_to_accept_memory_one()-->__accept_page() to
        put the pages from zone->unaccepted_pages to buddy's freelists and
        further call accept_memory() to accept those pages.
       Before (2), though accept_memory() can also accept a page, the page is
       not available to the buddy and hence not available to other kernel
       components. When accept_memory() is invoked in (2), the page will not be
       re-accepted.

The reason for this series to have kexec to accept_memory() to kexec segments'
destination addresses is that those addresses are not necessarily allocated by
the first kernel's buddy allocator. So, before kexec accessing those pages
(which could be earlier than the second kernel), we invoke the accept_memory()
to trigger the physical page allocation in host, GFN->PFN mapping, physical page
initialization and encryption. After that, kexec can copy source pages into the
destination pages and start the transition to the second kernel.

With that, do you still think this patch is a hack ?

[1] https://lore.kernel.org/all/87frop8r0y.fsf@email.froward.int.ebiederm.org/
[2] https://lore.kernel.org/all/tpbcun3d4wrnbtsvx3b3hjpdl47f2zuxvx6zqsjoelazdt3eyv@kgqnedtcejta/
[3] https://lore.kernel.org/all/20241213094930.748-1-yan.y.zhao@intel.com


> 
> I did see a coherent explanation of the bad performance that pointed the
> finger squarely at the fact that everything is happening a page at a
> time.  AKA that the design of the ACPI interface has a flaw that needs
> to be fixed.
By flaw, do you mean accepting page by page?

The accept_memory() only takes effect in a guest, demanding physical page
allocation in host OS, which is slow by nature. It's also the truth that
once a page has been accepted, it cannot be swapped out in host.

> I really don't think we should be making complicated work-arounds for
> someone else's bad software decision just because someone immortalized
> their bad decision in a standard.  Just accepting all of memory and
> letting the folks who made the bad decision deal with the consequences
> seems much more reasonable to me.
> 
> > If Eric has no objection, maybe Andrew can help pick this into his
> > tree.
> 
> I have a new objection.  I believe ``unaccepted memory'' and especially
> lazily initialized ``unaccepted memory'' is an information leak that
> could defeat the purpose of encrypted memory.  For that reason I have
Do you mean before the lazy acceptance, the host can access the page?
Or are you referring to that the host can know the GFN of a page when it
responds to the page allocation request?

For the former, the page will be regarded as private by the guest only after
it's accepted in the guest. So no data will be leaked before guest completes
accept_memory() that initializes the memory data to 0 and encrypts the memory.

For the latter, the lazy memory acceptance still happens in a bulk way, i.e.
not in response to the guest accessing of a specific memory.
So, I can't see which information is leaked.

> Cc'd the security list.  I don't know who to CC to get expertise on this
> issue, and the security list folks should.
> 
> Unless I am misunderstanding things the big idea with encrypted
> memory is that the hypervisor won't be able to figure out what you
> are doing, because it can't read your memory.
There might be some misunderstanding.
A page will only be regarded as private by the guest after the guest's explicit
acceptance of the memory.

> 
> My concern is that by making the ``acceptance'' of memory lazy, that
> there is a fairly strong indication of the function of different parts
> of memory.  I expect that signal is strong enough to defeat whatever
> elements of memory address randomization that we implement in the
> kernel.
memory randomization also invokes accept_memory() in extract_kernel().

> So not only does it appear to me that implementation of ``accepting''
> memory has a stupidly slow implementation, somewhat enshrined by a bad
> page at a time ACPI standard, but it appears to me that lazily
> ``accepting'' that memory probably defeats the purpose of having
> encrypted memory.
Not only for performance, but also for memory over-commitment.

Hope the above explanations have addressed your concerns.
Please let me know if anything still doesn't sound correct to you.

Thanks
Yan

> I think the actual solution is to remove all code except for the
> "accept_memory=eager" code paths.  AKA delete the "accept_memory=lazy"
> code.  At that point there are no more changes that need to be made to
> kexec.
> 
> Eric