lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241213094930.748-1-yan.y.zhao@intel.com>
Date: Fri, 13 Dec 2024 17:49:30 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: ebiederm@...ssion.com
Cc: kexec@...ts.infradead.org,
	linux-kernel@...r.kernel.org,
	linux-coco@...ts.linux.dev,
	x86@...nel.org,
	rick.p.edgecombe@...el.com,
	kirill.shutemov@...ux.intel.com,
	bhe@...hat.com,
	Yan Zhao <yan.y.zhao@...el.com>
Subject: [PATCH v2 0/1] Accept unaccepted kexec segments' destination addresses

Hi Eric,

This is a repost of the patch "kexec_core: Accept unaccepted kexec
destination addresses" [1], rebased to v6.13-rc2.

The code implementation remains unchanged, but the patch message now
includes more background and explanations to address previous concerns from
you and Baoquan.

Additionally, below is a more detailed explanation of unaccepted memory in
TDX. Please let me know if it is still not clear enough.


== UnAccepted memory in TDX ==

Intel TDX (Trusted Domain Extension) provides a hardware-based trusted
execution environment for TDs (hardware-isolated VMs). The host OS is not
trusted. Although it allocates physical pages for TDs, it does not and
cannot know the content of TD's pages.

TD's memory is added via two methods by invoking different instructions in
the host:
1. For TD's initial private memory, such as for firmware HOBs:
   - This type of memory is added without requiring the TD's acceptance.
   - The TD will perform attestation of the page GPA and content later.

2. For TD's runtime private memory:
   - After the host adds memory, it is pending for the TD's acceptance.

Memory added by method 1 is not relevant to the unaccepted memory we will
discuss.

For memory added by method 2, the TD's acceptance can occur before or after
the TD's memory access:
(a) Access first:
    - TD accesses a private GPA,
    - Host OS allocates physical memory,
    - Host OS requests hardware to map the physical page to the GPA,
    - TD accepts the GPA.

(b) Accept first:
    - TD accepts a private GPA,
    - Host OS allocates physical memory,
    - Host OS requests hardware to map the physical page to the GPA,
    - TD accesses the GPA.

For "(a) Access first", it is regarded as unsafe for a Linux guest and is
therefore not chosen.
For "(b) Accept first", the TD's "accept" operation includes the following
steps:
- Trigger a VM-exit
- The host OS allocates a physical page and requests hardware to map the
  physical page to the GPA.
- Initialize the physical page with content set to 0.
- Encrypt the memory 


To enable the "Accept first" approach, an "unaccepted memory" mechanism is
used, which requires cooperation from the virtual firmware and the Linux
guest.

1. The host OS adds initial private memory that does not require TD's
   acceptance. The host OS composes EFI_HOB_RESOURCE_DESCRIPTORs and loads
   the virtual firmware first. Guest RAM, excluding that for initial
   memory, is reported as UNACCEPTED in the descriptor.

2. The virtual firmware parses the descriptors and accepts the UNACCEPTED
   memory below 4G. It then excludes the below-4G range from the UNACCEPTED
   range.

3. The virtual firmware loads the Linux guest image (the address to load is
   below 4G).

4. The Linux guest requests the UNACCEPTED bitmap from the virtual
   firmware:
   - Locate EFI_UNACCEPTED_MEMORY entries from the memory map returned by
     the efi_get_memory_map boot service.
   - Request via EFI boot service to allocate an unaccepted_table in memory
     of type EFI_ACPI_RECLAIM_MEMORY (E820_TYPE_ACPI) to hold the
     unaccepted bitmap.
   - Install the unaccepted_table as an EFI configuration table via the
     boot service.
   - Initialize the unaccepted bitmap according to the
     EFI_UNACCEPTED_MEMORY entries.

5. The Linux guest decompresses the kernel image. It accepts the target GPA
   for decompression first in case it is not accepted by the virtual
   firmware.

6. The Linux guest calls memblock_free_all() to put all memory into the
   freelists for the buddy allocator. memblock_free_all() further calls
   down to __free_pages_core() to handle memory in 4M (order 10) units.

  - In eager mode, the Linux guest accepts all memory and appends it to the
    freelists.
  - In lazy mode, the Linux guest checks if the entire 4M memory has been
    accepted by querying the unaccepted bitmap.
    a) If all memory is accepted, it adds the 4M memory to the freelists.
    b) If any memory is unaccepted (even if the range contains accepted
       pages), the Linux guest does not add the 4M memory to the freelists.
       Instead, it queues the first page in the 4M range onto the list
       zone->unaccepted_pages and sets the first page with the Unaccepted
       flag.

7. When there is not enough free memory, cond_accept_memory() in the Linux
   guest calls try_to_accept_memory_one() to dequeue a page from the list
   zone->unaccepted_pages, clear its Unaccepted flag, accept the entire 4M
   memory range represented by the page, and add the 4M memory to the
   freelists.


== Conclusion ==
- The zone->unaccepted_pages is a mechanism to conditionally make accepted
  private memory available to the page allocators.
- The unaccepted bitmap resides in the firmware's reserved memory and
  persists across guest OSs. It records exactly which pages have not been
  accepted.
- Memory ranges represented by zone->unaccepted_pages may contain accepted
  pages.


For kexec in TDs,
- If the segments' destination addresses are within the range managed by
  the buddy allocator, the pages must have been in an accepted state.
  Calling accept_memory() will check the unaccepted bitmap and do nothing.
- If the segments' destination addresses are not yet managed by the buddy
  allocator, the pages may or may not have been accepted.
  Calling accept_memory() will perform the "accept" operation if they are
  not accepted.

For the kexec's second guest kernel, it obtains the unaccepted bitmap by
locating the unaccepted_table in the EFI configuration tables. So, pages
unset in the unaccepted bitmap are not accepted repeatedly.


The unaccepted table/bitmap is only useful for TDs. For a Linux host, it
will detect that the physical firmware does not support the memory
acceptance protocol, and accept_memory() will simply bail out.

Thanks
Yan

[1] https://lore.kernel.org/all/20241021034553.18824-1-yan.y.zhao@intel.com

Yan Zhao (1):
  kexec_core: Accept unaccepted kexec segments' destination addresses

 kernel/kexec_core.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

-- 
2.43.2


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ