[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ad2f9bcb-1aa7-0a35-942f-6a5f674823fe@gmail.com>
Date: Mon, 16 May 2022 21:08:58 +0800
From: Tianyu Lan <ltykernel@...il.com>
To: Christoph Hellwig <hch@...radead.org>
Cc: m.szyprowski@...sung.com, robin.murphy@....com,
michael.h.kelley@...rosoft.com, kys@...rosoft.com,
Tianyu Lan <Tianyu.Lan@...rosoft.com>,
iommu@...ts.linux-foundation.org, linux-kernel@...r.kernel.org,
vkuznets@...hat.com, brijesh.singh@....com, konrad.wilk@...cle.com,
hch@....de, wei.liu@...nel.org, parri.andrea@...il.com,
thomas.lendacky@....com, linux-hyperv@...r.kernel.org,
andi.kleen@...el.com, kirill.shutemov@...el.com
Subject: Re: [RFC PATCH V2 1/2] swiotlb: Add Child IO TLB mem support
On 5/16/2022 3:34 PM, Christoph Hellwig wrote:
> I don't really understand how 'childs' fit in here. The code also
> doesn't seem to be usable without patch 2 and a caller of the
> new functions added in patch 2, so it is rather impossible to review.
Hi Christoph:
OK. I will merge two patches and add a caller patch. The motivation
is to avoid global spin lock when devices use swiotlb bounce buffer and
this introduces overhead during high throughput cases. In my test
environment, current code can achieve about 24Gb/s network throughput
with SWIOTLB force enabled and it can achieve about 40Gb/s without
SWIOTLB force. Storage also has the same issue.
Per-device IO TLB mem may resolve global spin lock issue among
devices but device still may have multi queues. Multi queues still need
to share one spin lock. This is why introduce child or IO tlb areas in
the previous patches. Each device queues will have separate child IO TLB
mem and single spin lock to manage their IO TLB buffers.
Otherwise, global spin lock still cost cpu usage during high
throughput even when there is performance regression. Each device queues
needs to spin on the different cpus to acquire the global lock. Child IO
TLB mem also may resolve the cpu issue.
>
> Also:
>
> 1) why is SEV/TDX so different from other cases that need bounce
> buffering to treat it different and we can't work on a general
> scalability improvement
Other cases also have global spin lock issue but it depends on
whether hits the bottleneck. The cpu usage issue may be ignored.
> 2) per previous discussions at how swiotlb itself works, it is
> clear that another option is to just make pages we DMA to
> shared with the hypervisor. Why don't we try that at least
> for larger I/O?
For confidential VM(Both TDX and SEV), we need to use bounce
buffer to copy between private memory that hypervisor can't
access directly and shared memory. For security consideration,
confidential VM should not share IO stack DMA pages with
hypervisor directly to avoid attack from hypervisor when IO
stack handles the DMA data.
Powered by blists - more mailing lists