[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <630334b5-05d0-0152-7c2c-79174703f0ed@huaweicloud.com>
Date: Mon, 27 Mar 2023 13:06:34 +0200
From: Petr Tesarik <petrtesarik@...weicloud.com>
To: Jonathan Corbet <corbet@....net>, Christoph Hellwig <hch@....de>,
Marek Szyprowski <m.szyprowski@...sung.com>,
Robin Murphy <robin.murphy@....com>,
Borislav Petkov <bp@...e.de>,
"Paul E. McKenney" <paulmck@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Randy Dunlap <rdunlap@...radead.org>,
Damien Le Moal <damien.lemoal@...nsource.wdc.com>,
Kim Phillips <kim.phillips@....com>,
"Steven Rostedt (Google)" <rostedt@...dmis.org>,
"open list:DOCUMENTATION" <linux-doc@...r.kernel.org>,
open list <linux-kernel@...r.kernel.org>,
"open list:DMA MAPPING HELPERS" <iommu@...ts.linux.dev>
Cc: Roberto Sassu <roberto.sassu@...wei.com>, petr@...arici.cz
Subject: Re: [RFC v1 0/4] Allow dynamic allocation of software IO TLB bounce
buffers
Hello after a week,
I'm top-posting, because I'm not replying to anything particular in my
original RFC. It seems there are no specific comments from other people
either. My understanding is that the general idea of making SWIOTLB grow
and shrink as needed is welcome, but nobody can suggest a good method to
measure the expected performance impact.
FWIW I have seen zero impact if the patches are applied, but not enabled
through "swiotlb=dynamic" boot parameter. I believe this is sufficient
to start developing dynamic SWIOTLB as an experimental feature.
@Christoph: I was kind of expecting you to jump up and tell me that the
non-coherent DMA API cannot be used to allocate bounce buffers. ;-) But
since that has not happened, I will assume that there is no weird
architecture which treats SWIOTLB differently from other DMA-capable
regions.
Anyway, I have some comments of my own to the v1 RFC series.
First, the maple tree cannot be used to track buffers, because it is not
safe to use from interrupt context. The maple tree implementation
currently allocates new nodes with kmem_cache_alloc_bulk(), which must
be called with interrupts enabled. This can be changed in the maple tree
code, but I could also pre-allocate nodes from non-irq context, or use
another data structure instead.
Second, based on discussions with my dear colleague Roberto, I have
considered alternative approaches:
A. Keep track of SWIOTLB utilization and allocate additional io_tlb_mem
structures from a separate kernel thread when necessary. The advantage
is that allocations are moved away from the hot path (map/unmap). The
disadvantage is that these additional tables would necessarily be small
(limited by MAX_ORDER and physical memory fragmentation). Another
disadvantage is that such a multi-table SWIOTLB is not very likely to
shrink, because a table can be freed only if ALL slots are free.
B. Allocate a very big SWIOTLB, but allow to use it for normal
allocations (similar to the CMA approach). The advantage is that there
is only one table, pushing performance impact down to almost zero. The
main challenge is migrating pages to/from the SWIOTLB. Existing CMA code
cannot be reused, because CMA cannot be used from atomic contexts,
unlike SWIOTLB.
For the time being, I propose to start with my RFC series, accepting
some performance drop if the feature is explicitly enabled. With more
feedback from the field (especially from you, SEV SNP guys), I can work
on improving performance.
Petr T
On 3/20/2023 1:28 PM, Petr Tesarik wrote:
> From: Petr Tesarik <petr.tesarik.ext@...wei.com>
>
> The goal of my work is to provide more flexibility in the sizing of
> SWIOTLB. This patch series is a request for comments from the wider
> community. The code is more of a crude hack than final solution.
>
> I would appreciate suggestions for measuring the performance impact
> of changes in SWIOTLB. More info at the end of this cover letter.
>
> The software IO TLB was designed with these assumptions:
>
> 1. It would not be used much, especially on 64-bit systems.
> 2. A small fixed memory area (64 MiB by default) is sufficient to
> handle the few cases which require a bounce buffer.
> 3. 64 MiB is little enough that it has no impact on the rest of the
> system.
>
> First, if SEV is active, all DMA must be done through shared
> unencrypted pages, and SWIOTLB is used to make this happen without
> changing device drivers. The software IO TLB size is increased to
> 6% of total memory in sev_setup_arch(), but that is more of an
> approximation. The actual requirements may vary depending on the
> amount of I/O and which drivers are used. These factors may not be
> know at boot time, i.e. when SWIOTLB is allocated.
>
> Second, on the Raspberry Pi 4, swiotlb is used by dma-buf for pages
> moved from the rendering GPU (v3d driver), which can access all
> memory, to the display output (vc4 driver), which is connected to a
> bus with an address limit of 1 GiB and no IOMMU. These buffers can
> be large (8 MiB with a FullHD monitor, 34 MiB with a 4K monitor)
> and cannot be even handled by current SWIOTLB, because they exceed
> the maximum segment size of 256 KiB. Mapping failures can be
> easily reproduced with GNOME remote desktop on a Raspberry Pi 4.
>
> Third, other colleagues have noticed that they can reliably get rid
> of occasional OOM kills on an Arm embedded device by reducing the
> SWIOTLB size. This can be achieved with a kernel parameter, but
> determining the right value puts additional burden on pre-release
> testing, which could be avoided if SWIOTLB is allocated small and
> grows only when necessary.
>
> I have tried to measure the expected performance degradation so
> that I could reduce it and/or compare it to alternative approaches.
> I have performed all tests on an otherwise idle Raspberry Pi 4 with
> swiotlb=force (which, addmittedly, is a bit artificial). I quickly
> ran into trouble.
>
> I ran fio against an ext3 filesystem mounted from a UAS drive. To
> my surprise, forcing swiotlb (without my patches) *improved* IOPS
> and bandwidth for 4K and 64K blocks by 3 to 7 percent, and made no
> visible difference for 1M blocks. I also observed smaller minimum
> and average completion latencies, and even smaller maximum
> latencies for 4K blocks. However, when I ran the tests again later
> to verify some oddities, there was a performance drop. It appears
> that I/O, bandwidth and latencies reported by two consecutive fio
> runs may differ by as much as 10%, so the results are invalid.
>
> I tried to make a micro-benchmark on dma_map_page_attrs() using the
> bcc tool funclatency, but just loading the eBPF program was enough
> to change the behaviour of the system wildly.
>
> I wonder if anyone can give me advice on measuring SWIOTLB
> performance. I can see that AMD, IBM and Microsoft people have
> mentioned performance in their patches, but AFAICS without
> explaining how it was measured. Knowing a bit more would be much
> appreciated.
>
> Petr Tesarik (4):
> dma-mapping: introduce the DMA_ATTR_MAY_SLEEP attribute
> swiotlb: Move code around in preparation for dynamic bounce buffers
> swiotlb: Allow dynamic allocation of bounce buffers
> swiotlb: Add an option to allow dynamic bounce buffers
>
> .../admin-guide/kernel-parameters.txt | 6 +-
> Documentation/core-api/dma-attributes.rst | 10 +
> include/linux/dma-mapping.h | 6 +
> include/linux/swiotlb.h | 17 +-
> kernel/dma/swiotlb.c | 233 +++++++++++++++---
> 5 files changed, 241 insertions(+), 31 deletions(-)
>
Powered by blists - more mailing lists