[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGtprH9FA3-RetE=6i7ezxfV0qEV-8z3HLgPEPY=pzuxSiOD+w@mail.gmail.com>
Date: Tue, 30 Jan 2024 22:12:54 +0530
From: Vishal Annapurve <vannapurve@...gle.com>
To: x86@...nel.org, linux-kernel@...r.kernel.org, hch@....de,
petrtesarik@...weicloud.com, Dave Hansen <dave.hansen@...ux.intel.com>,
Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
m.szyprowski@...sung.com, robin.murphy@....com
Cc: pbonzini@...hat.com, rientjes@...gle.com, seanjc@...gle.com,
erdemaktas@...gle.com, ackerleytng@...gle.com, jxgao@...gle.com,
sagis@...gle.com, oupton@...gle.com, peterx@...hat.com, vkuznets@...hat.com,
dmatlack@...gle.com, pgonda@...gle.com, michael.roth@....com,
kirill@...temov.name, thomas.lendacky@....com, linux-coco@...ts.linux.dev,
chao.p.peng@...ux.intel.com, isaku.yamahata@...il.com, andrew.jones@...ux.dev,
corbet@....net, rostedt@...dmis.org, iommu@...ts.linux.dev
Subject: Re: [RFC V1 0/5] x86: CVMs: Align memory conversions to 2M granularity
On Fri, Jan 12, 2024 at 11:22 AM Vishal Annapurve <vannapurve@...gle.com> wrote:
>
> Goal of this series is aligning memory conversion requests from CVMs to
> huge page sizes to allow better host side management of guest memory and
> optimized page table walks.
>
> This patch series is partially tested and needs more work, I am seeking
> feedback from wider community before making further progress.
>
> Background
> =====================
> Confidential VMs(CVMs) support two types of guest memory ranges:
> 1) Private Memory: Intended to be consumed/modified only by the CVM.
> 2) Shared Memory: visible to both guest/host components, used for
> non-trusted IO.
>
> Guest memfd [1] support is set to be merged upstream to handle guest private
> memory isolation from host usersapace. Guest memfd approach allows following
> setup:
> * private memory backed using the guest memfd file which is not accessible
> from host userspace.
> * Shared memory backed by tmpfs/hugetlbfs files that are accessible from
> host userspace.
>
> Userspace VMM needs to register two backing stores for all of the guest
> memory ranges:
> * HVA for shared memory
> * Guest memfd ranges for private memory
>
> KVM keeps track of shared/private guest memory ranges that can be updated at
> runtime using IOCTLs. This allows KVM to back the guest memory using either HVA
> (shared) or guest memfd file offsets (private) based on the attributes of the
> guest memory ranges.
>
> In this setup, there is possibility of "double allocation" i.e. scenarios where
> both shared and private memory backing stores mapped to the same guest memory
> ranges have memory allocated.
>
> Guest issues an hypercall to convert the memory types which is forwarded by KVM
> to the host userspace.
> Userspace VMM is supposed to handle conversion as follows:
> 1) Private to shared conversion:
> * Update guest memory attributes for the range to be shared using KVM
> supported IOCTLs.
> - While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
> to the guest memory being converted.
> * Unback the guest memfd range.
> 2) Shared to private conversion:
> * Update guest memory attributes for the range to be private using KVM
> supported IOCTLs.
> - While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
> to the guest memory being converted.
> * Unback the shared memory file.
>
> Note that unbacking needs to be done for both kinds of conversions in order to
> avoid double allocation.
>
> Problem
> =====================
> CVMs can convert memory between these two types at 4K granularity. Conversion
> done at 4K granularity causes issues when using guest memfd support
> with hugetlb/Hugepage backed guest private memory:
> 1) Hugetlb fs doesn't allow freeing subpage ranges when punching holes,
> causing all the private to shared memory conversions to result in double
> allocation.
> 2) Even if a new fs is implemented for guest memfd that allows splitting
> hugepages, punching holes at 4K will cause:
> - loss of vmemmmap optimization [2]
> - more memory for EPT/NPT entries and extra pagetable walks for guest
> side accesses.
> - Shared memory mappings to consume more host pagetable entries and
> extra pagetalble walks for host side access.
> - Higher number of conversions with additional overhead of VM exits
> serviced by host userspace.
>
> Memory conversion scenarios in the guest that are of major concern:
> - SWIOTLB area conversion early during boot.
> * dma_map_* API invocations for CVMs result in using bounce buffers
> from SWIOTLB region which is already marked as shared.
> - Device drivers allocating memory using dma_alloc_* APIs at runtime
> that bypass SWIOTLB.
>
> Proposal
> =====================
> To counter above issues, this series proposes following:
> 1) Use boot time allocated SWIOTLB pools for all DMA memory allocated
> using dma_alloc_* APIs.
> 2) Increase memory allocated at boot for SWIOTLB from 6% to 8% for CVMs.
> 3) Enable dynamic SWIOTLB [4] by default for CVMs so that SWITLB can be
> scaled up as needed.
> 4) Ensure SWIOTLB pool is 2MB aligned so that all the conversions happen at
> 2M granularity once during boot.
> 5) Add a check to ensure all conversions happen at 2M granularity.
>
> ** This series leaves out some of the conversion sites which might not
> be 2M aligned but should be easy to fix once the approach is finalized. **
>
> 1G alignment for conversion:
> * Using 1G alignment may cause over-allocated SWIOTLB buffers but might
> be acceptable for CVMs depending on more considerations.
> * It might be challenging to use 1G aligned conversion in OVMF. 2M
> alignment should be achievable with OVMF changes [3].
>
> Alternatives could be:
> 1) Separate hugepage aligned DMA pools setup by individual device drivers in
> case of CVMs.
>
> [1] https://lore.kernel.org/linux-mips/20231105163040.14904-1-pbonzini@redhat.com/
> [2] https://www.kernel.org/doc/html/next/mm/vmemmap_dedup.html
> [3] https://github.com/tianocore/edk2/pull/3784
> [4] https://lore.kernel.org/lkml/20230908080031.GA7848@lst.de/T/
>
> Vishal Annapurve (5):
> swiotlb: Support allocating DMA memory from SWIOTLB
> swiotlb: Allow setting up default alignment of SWIOTLB region
> x86: CVMs: Enable dynamic swiotlb by default for CVMs
> x86: CVMs: Allow allocating all DMA memory from SWIOTLB
> x86: CVMs: Ensure that memory conversions happen at 2M alignment
>
> arch/x86/Kconfig | 2 ++
> arch/x86/kernel/pci-dma.c | 2 +-
> arch/x86/mm/mem_encrypt.c | 8 ++++++--
> arch/x86/mm/pat/set_memory.c | 6 ++++--
> include/linux/swiotlb.h | 22 ++++++----------------
> kernel/dma/direct.c | 4 ++--
> kernel/dma/swiotlb.c | 17 ++++++++++++-----
> 7 files changed, 33 insertions(+), 28 deletions(-)
>
> --
> 2.43.0.275.g3460e3d667-goog
>
Ping for review of this series.
Thanks,
Vishal
Powered by blists - more mailing lists