linux-kernel - Re: [RFC V1 0/5] x86: CVMs: Align memory conversions to 2M granularity

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGtprH9FA3-RetE=6i7ezxfV0qEV-8z3HLgPEPY=pzuxSiOD+w@mail.gmail.com>
Date: Tue, 30 Jan 2024 22:12:54 +0530
From: Vishal Annapurve <vannapurve@...gle.com>
To: x86@...nel.org, linux-kernel@...r.kernel.org, hch@....de, 
	petrtesarik@...weicloud.com, Dave Hansen <dave.hansen@...ux.intel.com>, 
	Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	m.szyprowski@...sung.com, robin.murphy@....com
Cc: pbonzini@...hat.com, rientjes@...gle.com, seanjc@...gle.com, 
	erdemaktas@...gle.com, ackerleytng@...gle.com, jxgao@...gle.com, 
	sagis@...gle.com, oupton@...gle.com, peterx@...hat.com, vkuznets@...hat.com, 
	dmatlack@...gle.com, pgonda@...gle.com, michael.roth@....com, 
	kirill@...temov.name, thomas.lendacky@....com, linux-coco@...ts.linux.dev, 
	chao.p.peng@...ux.intel.com, isaku.yamahata@...il.com, andrew.jones@...ux.dev, 
	corbet@....net, rostedt@...dmis.org, iommu@...ts.linux.dev
Subject: Re: [RFC V1 0/5] x86: CVMs: Align memory conversions to 2M granularity

On Fri, Jan 12, 2024 at 11:22 AM Vishal Annapurve <vannapurve@...gle.com> wrote:
>
> Goal of this series is aligning memory conversion requests from CVMs to
> huge page sizes to allow better host side management of guest memory and
> optimized page table walks.
>
> This patch series is partially tested and needs more work, I am seeking
> feedback from wider community before making further progress.
>
> Background
> =====================
> Confidential VMs(CVMs) support two types of guest memory ranges:
> 1) Private Memory: Intended to be consumed/modified only by the CVM.
> 2) Shared Memory: visible to both guest/host components, used for
> non-trusted IO.
>
> Guest memfd [1] support is set to be merged upstream to handle guest private
> memory isolation from host usersapace. Guest memfd approach allows following
> setup:
> * private memory backed using the guest memfd file which is not accessible
>   from host userspace.
> * Shared memory backed by tmpfs/hugetlbfs files that are accessible from
>   host userspace.
>
> Userspace VMM needs to register two backing stores for all of the guest
> memory ranges:
> * HVA for shared memory
> * Guest memfd ranges for private memory
>
> KVM keeps track of shared/private guest memory ranges that can be updated at
> runtime using IOCTLs. This allows KVM to back the guest memory using either HVA
> (shared) or guest memfd file offsets (private) based on the attributes of the
> guest memory ranges.
>
> In this setup, there is possibility of "double allocation" i.e. scenarios where
> both shared and private memory backing stores mapped to the same guest memory
> ranges have memory allocated.
>
> Guest issues an hypercall to convert the memory types which is forwarded by KVM
> to the host userspace.
> Userspace VMM is supposed to handle conversion as follows:
> 1) Private to shared conversion:
>   * Update guest memory attributes for the range to be shared using KVM
>     supported IOCTLs.
>     - While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
>       to the guest memory being converted.
>   * Unback the guest memfd range.
> 2) Shared to private conversion:
>   * Update guest memory attributes for the range to be private using KVM
>     supported IOCTLs.
>     - While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
>       to the guest memory being converted.
>   * Unback the shared memory file.
>
> Note that unbacking needs to be done for both kinds of conversions in order to
> avoid double allocation.
>
> Problem
> =====================
> CVMs can convert memory between these two types at 4K granularity. Conversion
> done at 4K granularity causes issues when using guest memfd support
> with hugetlb/Hugepage backed guest private memory:
> 1) Hugetlb fs doesn't allow freeing subpage ranges when punching holes,
> causing all the private to shared memory conversions to result in double
> allocation.
> 2) Even if a new fs is implemented for guest memfd that allows splitting
> hugepages, punching holes at 4K will cause:
>    - loss of vmemmmap optimization [2]
>    - more memory for EPT/NPT entries and extra pagetable walks for guest
>      side accesses.
>    - Shared memory mappings to consume more host pagetable entries and
>      extra pagetalble walks for host side access.
>    - Higher number of conversions with additional overhead of VM exits
>      serviced by host userspace.
>
> Memory conversion scenarios in the guest that are of major concern:
> - SWIOTLB area conversion early during boot.
>    * dma_map_* API invocations for CVMs result in using bounce buffers
>      from SWIOTLB region which is already marked as shared.
> - Device drivers allocating memory using dma_alloc_* APIs at runtime
>   that bypass SWIOTLB.
>
> Proposal
> =====================
> To counter above issues, this series proposes following:
> 1) Use boot time allocated SWIOTLB pools for all DMA memory allocated
> using dma_alloc_* APIs.
> 2) Increase memory allocated at boot for SWIOTLB from 6% to 8% for CVMs.
> 3) Enable dynamic SWIOTLB [4] by default for CVMs so that SWITLB can be
> scaled up as needed.
> 4) Ensure SWIOTLB pool is 2MB aligned so that all the conversions happen at
> 2M granularity once during boot.
> 5) Add a check to ensure all conversions happen at 2M granularity.
>
> ** This series leaves out some of the conversion sites which might not
> be 2M aligned but should be easy to fix once the approach is finalized. **
>
> 1G alignment for conversion:
> * Using 1G alignment may cause over-allocated SWIOTLB buffers but might
>   be acceptable for CVMs depending on more considerations.
> * It might be challenging to use 1G aligned conversion in OVMF. 2M
>   alignment should be achievable with OVMF changes [3].
>
> Alternatives could be:
> 1) Separate hugepage aligned DMA pools setup by individual device drivers in
> case of CVMs.
>
> [1] https://lore.kernel.org/linux-mips/20231105163040.14904-1-pbonzini@redhat.com/
> [2] https://www.kernel.org/doc/html/next/mm/vmemmap_dedup.html
> [3] https://github.com/tianocore/edk2/pull/3784
> [4] https://lore.kernel.org/lkml/20230908080031.GA7848@lst.de/T/
>
> Vishal Annapurve (5):
>   swiotlb: Support allocating DMA memory from SWIOTLB
>   swiotlb: Allow setting up default alignment of SWIOTLB region
>   x86: CVMs: Enable dynamic swiotlb by default for CVMs
>   x86: CVMs: Allow allocating all DMA memory from SWIOTLB
>   x86: CVMs: Ensure that memory conversions happen at 2M alignment
>
>  arch/x86/Kconfig             |  2 ++
>  arch/x86/kernel/pci-dma.c    |  2 +-
>  arch/x86/mm/mem_encrypt.c    |  8 ++++++--
>  arch/x86/mm/pat/set_memory.c |  6 ++++--
>  include/linux/swiotlb.h      | 22 ++++++----------------
>  kernel/dma/direct.c          |  4 ++--
>  kernel/dma/swiotlb.c         | 17 ++++++++++++-----
>  7 files changed, 33 insertions(+), 28 deletions(-)
>
> --
> 2.43.0.275.g3460e3d667-goog
>

Ping for review of this series.

Thanks,
Vishal