linux-kernel - Re: [RFC 0/7] Introduce swiotlb throttling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bb7dd4ca-948f-4af7-b2ac-3ca02e82699f@arm.com>
Date: Wed, 28 Aug 2024 20:50:21 +0100
From: Robin Murphy <robin.murphy@....com>
To: Petr Tesařík <petr@...arici.cz>
Cc: mhklinux@...look.com, kbusch@...nel.org, axboe@...nel.dk,
 sagi@...mberg.me, James.Bottomley@...senPartnership.com,
 martin.petersen@...cle.com, kys@...rosoft.com, haiyangz@...rosoft.com,
 wei.liu@...nel.org, decui@...rosoft.com, hch@....de,
 m.szyprowski@...sung.com, iommu@...ts.linux.dev,
 linux-kernel@...r.kernel.org, linux-nvme@...ts.infradead.org,
 linux-scsi@...r.kernel.org, linux-hyperv@...r.kernel.org,
 linux-coco@...ts.linux.dev
Subject: Re: [RFC 0/7] Introduce swiotlb throttling

On 2024-08-28 2:03 pm, Petr Tesařík wrote:
> On Wed, 28 Aug 2024 13:02:31 +0100
> Robin Murphy <robin.murphy@....com> wrote:
> 
>> On 2024-08-22 7:37 pm, mhkelley58@...il.com wrote:
>>> From: Michael Kelley <mhklinux@...look.com>
>>>
>>> Background
>>> ==========
>>> Linux device drivers may make DMA map/unmap calls in contexts that
>>> cannot block, such as in an interrupt handler. Consequently, when a
>>> DMA map call must use a bounce buffer, the allocation of swiotlb
>>> memory must always succeed immediately. If swiotlb memory is
>>> exhausted, the DMA map call cannot wait for memory to be released. The
>>> call fails, which usually results in an I/O error.
>>>
>>> Bounce buffers are usually used infrequently for a few corner cases,
>>> so the default swiotlb memory allocation of 64 MiB is more than
>>> sufficient to avoid running out and causing errors. However, recently
>>> introduced Confidential Computing (CoCo) VMs must use bounce buffers
>>> for all DMA I/O because the VM's memory is encrypted. In CoCo VMs
>>> a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for
>>> swiotlb memory. This large allocation reduces the likelihood of a
>>> spike in usage causing DMA map failures. Unfortunately for most
>>> workloads, this insurance against spikes comes at the cost of
>>> potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb
>>> memory can't be used for other purposes.
>>>
>>> Approach
>>> ========
>>> The goal is to significantly reduce the amount of memory reserved as
>>> swiotlb memory in CoCo VMs, while not unduly increasing the risk of
>>> DMA map failures due to memory exhaustion.
>>
>> Isn't that fundamentally the same thing that SWIOTLB_DYNAMIC was already
>> meant to address? Of course the implementation of that is still young
>> and has plenty of scope to be made more effective, and some of the ideas
>> here could very much help with that, but I'm struggling a little to see
>> what's really beneficial about having a completely disjoint mechanism
>> for sitting around doing nothing in the precise circumstances where it
>> would seem most possible to allocate a transient buffer and get on with it.
> 
> This question can be probably best answered by Michael, but let me give
> my understanding of the differences. First the similarity: Yes, one
> of the key new concepts is that swiotlb allocation may block, and I
> introduced a similar attribute in one of my dynamic SWIOTLB patches; it
> was later dropped, but dynamic SWIOTLB would still benefit from it.
> 
> More importantly, dynamic SWIOTLB may deplete memory following an I/O
> spike. I do have some ideas how memory could be returned back to the
> allocator, but the code is not ready (unlike this patch series).
> Moreover, it may still be a better idea to throttle the devices
> instead, because returning DMA'able memory is not always cheap. In a
> CoCo VM, this memory must be re-encrypted, and that requires a
> hypercall that I'm told is expensive.

Sure, making a hypercall in order to progress is expensive relative to 
being able to progress without doing that, but waiting on a lock for an 
unbounded time in the hope that other drivers might release their DMA 
mappings soon represents a potentially unbounded expense, since it 
doesn't even carry any promise of progress at all - oops userspace just 
filled up SWIOTLB with a misguided dma-buf import and now the OS has 
livelocked on stalled I/O threads fighting to retry :(

As soon as we start tracking thresholds etc. then that should equally 
put us in the position to be able to manage the lifecycle of both 
dynamic and transient pools more effectively - larger allocations which 
can be reused by multiple mappings until the I/O load drops again could 
amortise that initial cost quite a bit.

Furthermore I'm not entirely convinced that the rationale for throttling 
being beneficial is even all that sound. Serialising requests doesn't 
make them somehow use less memory, it just makes them use it... 
serially. If a single CPU is capable of queueing enough requests at once 
to fill the SWIOTLB, this is going to do absolutely nothing; if two CPUs 
are capable of queueing enough requests together to fill the SWIOTLB, 
making them take slightly longer to do so doesn't inherently mean 
anything more than reaching the same outcome more slowly. At worst, if a 
thread is blocked from polling for completion and releasing a bunch of 
mappings of already-finished descriptors because it's stuck on an unfair 
lock trying to get one last one submitted, then throttling has actively 
harmed the situation.

AFAICS this is dependent on rather particular assumptions of driver 
behaviour in terms of DMA mapping patterns and interrupts, plus the 
overall I/O workload shape, and it's not clear to me how well that 
really generalises.

> In short, IIUC it is faster in a CoCo VM to delay some requests a bit
> than to grow the swiotlb.

I'm not necessarily disputing that for the cases where the assumptions 
do hold, it's still more a question of why those two things should be 
separate and largely incompatible (I've only skimmed the patches here, 
but my impression is that it doesn't look like they'd play all that 
nicely together if both enabled). To me it would make far more sense for 
this to be a tuneable policy of a more holistic SWIOTLB_DYNAMIC itself, 
i.e. blockable calls can opportunistically wait for free space up to a 
well-defined timeout, but then also fall back to synchronously 
allocating a new pool in order to assure a definite outcome of success 
or system-is-dying-level failure.

Thanks,
Robin.