linux-kernel - RE: [RFC 0/7] Introduce swiotlb throttling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <SN6PR02MB4157F23504BEB951BB19CE9BD4952@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Wed, 28 Aug 2024 16:30:04 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Petr Tesařík <petr@...arici.cz>, Robin Murphy
	<robin.murphy@....com>
CC: "kbusch@...nel.org" <kbusch@...nel.org>, "axboe@...nel.dk"
	<axboe@...nel.dk>, "sagi@...mberg.me" <sagi@...mberg.me>,
	"James.Bottomley@...senPartnership.com"
	<James.Bottomley@...senPartnership.com>, "martin.petersen@...cle.com"
	<martin.petersen@...cle.com>, "kys@...rosoft.com" <kys@...rosoft.com>,
	"haiyangz@...rosoft.com" <haiyangz@...rosoft.com>, "wei.liu@...nel.org"
	<wei.liu@...nel.org>, "decui@...rosoft.com" <decui@...rosoft.com>,
	"hch@....de" <hch@....de>, "m.szyprowski@...sung.com"
	<m.szyprowski@...sung.com>, "iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-nvme@...ts.infradead.org" <linux-nvme@...ts.infradead.org>,
	"linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
	"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
	"linux-coco@...ts.linux.dev" <linux-coco@...ts.linux.dev>
Subject: RE: [RFC 0/7] Introduce swiotlb throttling

From: Petr Tesařík <petr@...arici.cz> Sent: Wednesday, August 28, 2024 6:04 AM
> 
> On Wed, 28 Aug 2024 13:02:31 +0100
> Robin Murphy <robin.murphy@....com> wrote:
> 
> > On 2024-08-22 7:37 pm, mhkelley58@...il.com wrote:
> > > From: Michael Kelley <mhklinux@...look.com>
> > >
> > > Background
> > > ==========
> > > Linux device drivers may make DMA map/unmap calls in contexts that
> > > cannot block, such as in an interrupt handler. Consequently, when a
> > > DMA map call must use a bounce buffer, the allocation of swiotlb
> > > memory must always succeed immediately. If swiotlb memory is
> > > exhausted, the DMA map call cannot wait for memory to be released. The
> > > call fails, which usually results in an I/O error.
> > >
> > > Bounce buffers are usually used infrequently for a few corner cases,
> > > so the default swiotlb memory allocation of 64 MiB is more than
> > > sufficient to avoid running out and causing errors. However, recently
> > > introduced Confidential Computing (CoCo) VMs must use bounce buffers
> > > for all DMA I/O because the VM's memory is encrypted. In CoCo VMs
> > > a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for
> > > swiotlb memory. This large allocation reduces the likelihood of a
> > > spike in usage causing DMA map failures. Unfortunately for most
> > > workloads, this insurance against spikes comes at the cost of
> > > potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb
> > > memory can't be used for other purposes.
> > >
> > > Approach
> > > ========
> > > The goal is to significantly reduce the amount of memory reserved as
> > > swiotlb memory in CoCo VMs, while not unduly increasing the risk of
> > > DMA map failures due to memory exhaustion.
> >
> > Isn't that fundamentally the same thing that SWIOTLB_DYNAMIC was already
> > meant to address? Of course the implementation of that is still young
> > and has plenty of scope to be made more effective, and some of the ideas
> > here could very much help with that, but I'm struggling a little to see
> > what's really beneficial about having a completely disjoint mechanism
> > for sitting around doing nothing in the precise circumstances where it
> > would seem most possible to allocate a transient buffer and get on with it.
> 
> This question can be probably best answered by Michael, but let me give
> my understanding of the differences. First the similarity: Yes, one
> of the key new concepts is that swiotlb allocation may block, and I
> introduced a similar attribute in one of my dynamic SWIOTLB patches; it
> was later dropped, but dynamic SWIOTLB would still benefit from it.
> 
> More importantly, dynamic SWIOTLB may deplete memory following an I/O
> spike. I do have some ideas how memory could be returned back to the
> allocator, but the code is not ready (unlike this patch series).
> Moreover, it may still be a better idea to throttle the devices
> instead, because returning DMA'able memory is not always cheap. In a
> CoCo VM, this memory must be re-encrypted, and that requires a
> hypercall that I'm told is expensive.
> 
> In short, IIUC it is faster in a CoCo VM to delay some requests a bit
> than to grow the swiotlb.
> 
> Michael, please add your insights.
> 
> Petr T
> 

The other limitation of SWIOTLB_DYNAMIC is that growing swiotlb
memory requires large chunks of physically contiguous memory,
which may be impossible to get after a system has been running a
while. With a major rework of swiotlb memory allocation code, it might
be possible to get by with a piecewise assembly of smaller contiguous
memory chunks, but getting many smaller chunks could also be
challenging.

Growing swiotlb memory also must be done as a background async
operation if the DMA map operation can't block. So transient buffers
are needed, which must be encrypted and decrypted on every round
trip in a CoCo VM. The transient buffer memory comes from the
atomic pool, which typically isn't that large and could itself become
exhausted. So we're somewhat playing whack-a-mole on the memory
allocation problem.

We discussed the limitations of SWIOTLB_DYNAMIC in large CoCo VMs
at the time SWIOTLB_DYNAMIC was being developed, and I think there
was general agreement that throttling would be better for the CoCo
VM scenario.

Broadly, throttling DMA map requests seems like a fundamentally more
robust approach than growing swiotlb memory. And starting down
the path of allowing designated DMA map requests to block might have
broader benefits as well, perhaps on the IOMMU path.

These points are all arguable, and your point about having two somewhat
overlapping mechanisms is valid. Between the two, my personal viewpoint
is that throttling is the better approach, but I'm probably biased by my
background in the CoCo VM world. Petr and others may see the tradeoffs
differently.

Michael