linux-kernel - RE: [PATCH V4] swiotlb: Split up single swiotlb lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <PH0PR21MB3025C32C80BFBE8651CF196AD78C9@PH0PR21MB3025.namprd21.prod.outlook.com>
Date:   Mon, 18 Jul 2022 17:41:58 +0000
From:   "Michael Kelley (LINUX)" <mikelley@...rosoft.com>
To:     Tianyu Lan <ltykernel@...il.com>,
        "corbet@....net" <corbet@....net>,
        "hch@...radead.org" <hch@...radead.org>,
        "m.szyprowski@...sung.com" <m.szyprowski@...sung.com>,
        "robin.murphy@....com" <robin.murphy@....com>,
        "paulmck@...nel.org" <paulmck@...nel.org>,
        "bp@...e.de" <bp@...e.de>,
        "keescook@...omium.org" <keescook@...omium.org>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "pmladek@...e.com" <pmladek@...e.com>,
        "rdunlap@...radead.org" <rdunlap@...radead.org>,
        "damien.lemoal@...nsource.wdc.com" <damien.lemoal@...nsource.wdc.com>,
        KY Srinivasan <kys@...rosoft.com>
CC:     Tianyu Lan <Tianyu.Lan@...rosoft.com>,
        "iommu@...ts.linux-foundation.org" <iommu@...ts.linux-foundation.org>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        vkuznets <vkuznets@...hat.com>,
        "wei.liu@...nel.org" <wei.liu@...nel.org>,
        "parri.andrea@...il.com" <parri.andrea@...il.com>,
        "thomas.lendacky@....com" <thomas.lendacky@....com>,
        "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
        "kirill.shutemov@...el.com" <kirill.shutemov@...el.com>,
        "andi.kleen@...el.com" <andi.kleen@...el.com>,
        Andi Kleen <ak@...ux.intel.com>
Subject: RE: [PATCH V4] swiotlb: Split up single swiotlb lock

From: Tianyu Lan <ltykernel@...il.com> Sent: Friday, July 8, 2022 9:16 AM
> 
> From: Tianyu Lan <Tianyu.Lan@...rosoft.com>
> 
> Traditionally swiotlb was not performance critical because it was only
> used for slow devices. But in some setups, like TDX/SEV confidential
> guests, all IO has to go through swiotlb. Currently swiotlb only has a
> single lock. Under high IO load with multiple CPUs this can lead to
> significat lock contention on the swiotlb lock.
> 
> This patch splits the swiotlb bounce buffer pool into individual areas
> which have their own lock. Each CPU tries to allocate in its own area
> first. Only if that fails does it search other areas. On freeing the
> allocation is freed into the area where the memory was originally
> allocated from.
> 
> Area number can be set via swiotlb kernel parameter and is default
> to be possible cpu number. If possible cpu number is not power of
> 2, area number will be round up to the next power of 2.
> 
> This idea from Andi Kleen
> patch (https://github.com/intel/tdx/commit/master 
> 4529b5784c141782c72ec9bd9a92df2b68cb7d45).
> 
> Based-on-idea-by: Andi Kleen <ak@...ux.intel.com>
> Signed-off-by: Tianyu Lan <Tianyu.Lan@...rosoft.com>
> ---
> Change sicne v3:
>        * Make area number to be zero by default
> 
> Change since v2:
>        * Use possible cpu number to adjust iotlb area number
> 
> Change since v1:
>        * Move struct io_tlb_area to swiotlb.c
>        * Fix some coding style issue.

I apologize for being late to the party, but I've spotted a couple of
things that are minor bugs and I have a few nit-picky comments.

> ---
>  .../admin-guide/kernel-parameters.txt         |   4 +-
>  include/linux/swiotlb.h                       |   5 +
>  kernel/dma/swiotlb.c                          | 230 +++++++++++++++---
>  3 files changed, 198 insertions(+), 41 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> index 2522b11e593f..4a6ad177d4b8 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5904,8 +5904,10 @@
>  			it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst)
> 
>  	swiotlb=	[ARM,IA-64,PPC,MIPS,X86]
> -			Format: { <int> | force | noforce }
> +			Format: { <int> [,<int>] | force | noforce }
>  			<int> -- Number of I/O TLB slabs
> +			<int> -- Second integer after comma. Number of swiotlb
> +				 areas with their own lock. Must be power of 2.

Rather than "Must be power of 2" it would be more accurate to
say "Will be rounded up to a power of 2".

>  			force -- force using of bounce buffers even if they
>  			         wouldn't be automatically used by the kernel
>  			noforce -- Never use bounce buffers (for debugging)
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 7ed35dd3de6e..5f898c5e9f19 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -89,6 +89,8 @@ extern enum swiotlb_force swiotlb_force;
>   * @late_alloc:	%true if allocated using the page allocator
>   * @force_bounce: %true if swiotlb bouncing is forced
>   * @for_alloc:  %true if the pool is used for memory allocation
> + * @nareas:  The area number in the pool.
> + * @area_nslabs: The slot number in the area.
>   */
>  struct io_tlb_mem {
>  	phys_addr_t start;
> @@ -102,6 +104,9 @@ struct io_tlb_mem {
>  	bool late_alloc;
>  	bool force_bounce;
>  	bool for_alloc;
> +	unsigned int nareas;
> +	unsigned int area_nslabs;
> +	struct io_tlb_area *areas;
>  	struct io_tlb_slot {
>  		phys_addr_t orig_addr;
>  		size_t alloc_size;
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index cb50f8d38360..9f547d8ab550 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -70,6 +70,43 @@ struct io_tlb_mem io_tlb_default_mem;
>  phys_addr_t swiotlb_unencrypted_base;
> 
>  static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
> +static unsigned long default_nareas;
> +
> +/**
> + * struct io_tlb_area - IO TLB memory area descriptor
> + *
> + * This is a single area with a single lock.
> + *
> + * @used:	The number of used IO TLB block.
> + * @index:	The slot index to start searching in this area for next round.
> + * @lock:	The lock to protect the above data structures in the map and
> + *		unmap calls.
> + */
> +struct io_tlb_area {
> +	unsigned long used;

I didn't see this field getting initialized anywhere.  'index' and
'lock' are initialized in swiotlb_init_io_tlb_mem(), but not 'used'.
>From what I can tell, memblock_alloc() doesn't zero the memory.

> +	unsigned int index;
> +	spinlock_t lock;
> +};
> +
> +static void swiotlb_adjust_nareas(unsigned int nareas)
> +{
> +	if (!is_power_of_2(nareas))
> +		nareas = roundup_pow_of_two(nareas);

If the swiotlb= boot line option specifies "0" as the number
of areas, is_power_of_2() will return false, and
roundup_pow_of_two() will be called with an input
argument of 0, the results of which are undefined.  

> +
> +	default_nareas = nareas;
> +
> +	pr_info("area num %d.\n", nareas);
> +	/*
> +	 * Round up number of slabs to the next power of 2.
> +	 * The last area is going be smaller than the rest if
> +	 * default_nslabs is not power of two.
> +	 */
> +	if (nareas && !is_power_of_2(default_nslabs)) {
> +		default_nslabs = roundup_pow_of_two(default_nslabs);
> +		pr_info("SWIOTLB bounce buffer size roundup to %luMB",
> +			(default_nslabs << IO_TLB_SHIFT) >> 20);
> +	}
> +}
> 
>  static int __init
>  setup_io_tlb_npages(char *str)
> @@ -79,6 +116,10 @@ setup_io_tlb_npages(char *str)
>  		default_nslabs =
>  			ALIGN(simple_strtoul(str, &str, 0), IO_TLB_SEGSIZE);
>  	}
> +	if (*str == ',')
> +		++str;
> +	if (isdigit(*str))
> +		swiotlb_adjust_nareas(simple_strtoul(str, &str, 0));
>  	if (*str == ',')
>  		++str;
>  	if (!strcmp(str, "force"))
> @@ -112,8 +153,19 @@ void __init swiotlb_adjust_size(unsigned long size)
>  	 */
>  	if (default_nslabs != IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT)
>  		return;
> +
> +	/*
> +	 * Round up number of slabs to the next power of 2.
> +	 * The last area is going be smaller than the rest if
> +	 * default_nslabs is not power of two.
> +	 */
>  	size = ALIGN(size, IO_TLB_SIZE);
>  	default_nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
> +	if (default_nareas) {
> +		default_nslabs = roundup_pow_of_two(default_nslabs);
> +		size = default_nslabs << IO_TLB_SHIFT;
> +	}
> +
>  	pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20);
>  }
> 
> @@ -192,7 +244,8 @@ void __init swiotlb_update_mem_attributes(void)
>  }
> 
>  static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
> -		unsigned long nslabs, unsigned int flags, bool late_alloc)
> +		unsigned long nslabs, unsigned int flags,
> +		bool late_alloc, unsigned int nareas)
>  {
>  	void *vaddr = phys_to_virt(start);
>  	unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
> @@ -202,10 +255,17 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
>  	mem->end = mem->start + bytes;
>  	mem->index = 0;
>  	mem->late_alloc = late_alloc;
> +	mem->nareas = nareas;
> +	mem->area_nslabs = nslabs / mem->nareas;
> 
>  	mem->force_bounce = swiotlb_force_bounce || (flags & SWIOTLB_FORCE);
> 
>  	spin_lock_init(&mem->lock);
> +	for (i = 0; i < mem->nareas; i++) {
> +		spin_lock_init(&mem->areas[i].lock);
> +		mem->areas[i].index = 0;
> +	}
> +
>  	for (i = 0; i < mem->nslabs; i++) {
>  		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
>  		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
> @@ -232,7 +292,7 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
>  		int (*remap)(void *tlb, unsigned long nslabs))
>  {
>  	struct io_tlb_mem *mem = &io_tlb_default_mem;
> -	unsigned long nslabs = default_nslabs;
> +	unsigned long nslabs;
>  	size_t alloc_size;
>  	size_t bytes;
>  	void *tlb;
> @@ -242,6 +302,15 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
>  	if (swiotlb_force_disable)
>  		return;
> 
> +	/*
> +	 * default_nslabs maybe changed when adjust area number.
> +	 * So allocate bounce buffer after adjusting area number.
> +	 */
> +	if (!default_nareas)
> +		swiotlb_adjust_nareas(num_possible_cpus());
> +
> +	nslabs = default_nslabs;
> +
>  	/*
>  	 * By default allocate the bounce buffer memory from low memory, but
>  	 * allow to pick a location everywhere for hypervisors with guest
> @@ -274,7 +343,13 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
>  		panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
>  		      __func__, alloc_size, PAGE_SIZE);
> 
> -	swiotlb_init_io_tlb_mem(mem, __pa(tlb), nslabs, flags, false);
> +	mem->areas = memblock_alloc(sizeof(struct io_tlb_area) *
> +		default_nareas, SMP_CACHE_BYTES);

The size to allocate is open coded here, whereas swiotlb_init_late()
uses array_size().  Any reason for the difference?

> +	if (!mem->areas)
> +		panic("%s: Failed to allocate mem->areas.\n", __func__);
> +
> +	swiotlb_init_io_tlb_mem(mem, __pa(tlb), nslabs, flags, false,
> +				default_nareas);
> 
>  	if (flags & SWIOTLB_VERBOSE)
>  		swiotlb_print_info();
> @@ -296,7 +371,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>  	struct io_tlb_mem *mem = &io_tlb_default_mem;
>  	unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
>  	unsigned char *vstart = NULL;
> -	unsigned int order;
> +	unsigned int order, area_order;
>  	bool retried = false;
>  	int rc = 0;
> 
> @@ -337,19 +412,34 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>  			(PAGE_SIZE << order) >> 20);
>  	}
> 
> +	if (!default_nareas)
> +		swiotlb_adjust_nareas(num_possible_cpus());
> +
> +	area_order = get_order(array_size(sizeof(*mem->areas),
> +		default_nareas));

See above comment about array_size() vs. open coding.

> +	mem->areas = (struct io_tlb_area *)
> +		__get_free_pages(GFP_KERNEL | __GFP_ZERO, area_order);
> +	if (!mem->areas)
> +		goto error_area;
> +
>  	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
>  		get_order(array_size(sizeof(*mem->slots), nslabs)));
> -	if (!mem->slots) {
> -		free_pages((unsigned long)vstart, order);
> -		return -ENOMEM;
> -	}
> +	if (!mem->slots)
> +		goto error_slots;
> 
>  	set_memory_decrypted((unsigned long)vstart,
>  			     (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
> -	swiotlb_init_io_tlb_mem(mem, virt_to_phys(vstart), nslabs, 0, true);
> +	swiotlb_init_io_tlb_mem(mem, virt_to_phys(vstart), nslabs, 0, true,
> +				default_nareas);
> 
>  	swiotlb_print_info();
>  	return 0;
> +
> +error_slots:
> +	free_pages((unsigned long)mem->areas, area_order);
> +error_area:
> +	free_pages((unsigned long)vstart, order);
> +	return -ENOMEM;
>  }
> 
>  void __init swiotlb_exit(void)
> @@ -357,6 +447,7 @@ void __init swiotlb_exit(void)
>  	struct io_tlb_mem *mem = &io_tlb_default_mem;
>  	unsigned long tbl_vaddr;
>  	size_t tbl_size, slots_size;
> +	unsigned int area_order;
> 
>  	if (swiotlb_force_bounce)
>  		return;
> @@ -371,9 +462,14 @@ void __init swiotlb_exit(void)
> 
>  	set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
>  	if (mem->late_alloc) {
> +		area_order = get_order(array_size(sizeof(*mem->areas),
> +			mem->nareas));
> +		free_pages((unsigned long)mem->areas, area_order);
>  		free_pages(tbl_vaddr, get_order(tbl_size));
>  		free_pages((unsigned long)mem->slots, get_order(slots_size));
>  	} else {
> +		memblock_free_late(__pa(mem->areas),
> +				   mem->nareas * sizeof(struct io_tlb_area));

Open coding vs. array_size() difference is here as well.

>  		memblock_free_late(mem->start, tbl_size);
>  		memblock_free_late(__pa(mem->slots), slots_size);
>  	}
> @@ -476,9 +572,9 @@ static inline unsigned long get_max_slots(unsigned long boundary_mask)
>  	return nr_slots(boundary_mask + 1);
>  }
> 
> -static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index)
> +static unsigned int wrap_area_index(struct io_tlb_mem *mem, unsigned int index)
>  {
> -	if (index >= mem->nslabs)
> +	if (index >= mem->area_nslabs)
>  		return 0;
>  	return index;
>  }
> @@ -487,10 +583,11 @@ static unsigned int wrap_index(struct io_tlb_mem *mem,
> unsigned int index)
>   * Find a suitable number of IO TLB entries size that will fit this request and
>   * allocate a buffer from that IO TLB pool.
>   */
> -static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
> -			      size_t alloc_size, unsigned int alloc_align_mask)
> +static int swiotlb_do_find_slots(struct io_tlb_mem *mem,
> +		struct io_tlb_area *area, int area_index,
> +		struct device *dev, phys_addr_t orig_addr,
> +		size_t alloc_size, unsigned int alloc_align_mask)

A nit, but why pass the area and area_index as arguments?  The latter
is sufficient to determine the former.  Arguably, don't need the "mem"
argument either since it can be re-read from the dev.  There are a lot
of arguments to this function, so I was trying to see if they are all
really needed.

>  {
> -	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>  	unsigned long boundary_mask = dma_get_seg_boundary(dev);
>  	dma_addr_t tbl_dma_addr =
>  		phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
> @@ -501,8 +598,11 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>  	unsigned int index, wrap, count = 0, i;
>  	unsigned int offset = swiotlb_align_offset(dev, orig_addr);
>  	unsigned long flags;
> +	unsigned int slot_base;
> +	unsigned int slot_index;
> 
>  	BUG_ON(!nslots);
> +	BUG_ON(area_index >= mem->nareas);
> 
>  	/*
>  	 * For mappings with an alignment requirement don't bother looping to
> @@ -514,16 +614,20 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>  		stride = max(stride, stride << (PAGE_SHIFT - IO_TLB_SHIFT));
>  	stride = max(stride, (alloc_align_mask >> IO_TLB_SHIFT) + 1);
> 
> -	spin_lock_irqsave(&mem->lock, flags);
> -	if (unlikely(nslots > mem->nslabs - mem->used))
> +	spin_lock_irqsave(&area->lock, flags);
> +	if (unlikely(nslots > mem->area_nslabs - area->used))
>  		goto not_found;
> 
> -	index = wrap = wrap_index(mem, ALIGN(mem->index, stride));
> +	slot_base = area_index * mem->area_nslabs;
> +	index = wrap = wrap_area_index(mem, ALIGN(area->index, stride));
> +
>  	do {
> +		slot_index = slot_base + index;
> +
>  		if (orig_addr &&
> -		    (slot_addr(tbl_dma_addr, index) & iotlb_align_mask) !=
> -			    (orig_addr & iotlb_align_mask)) {
> -			index = wrap_index(mem, index + 1);
> +		    (slot_addr(tbl_dma_addr, slot_index) &
> +		     iotlb_align_mask) != (orig_addr & iotlb_align_mask)) {
> +			index = wrap_area_index(mem, index + 1);
>  			continue;
>  		}
> 
> @@ -532,26 +636,26 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t
> orig_addr,
>  		 * contiguous buffers, we allocate the buffers from that slot
>  		 * and mark the entries as '0' indicating unavailable.
>  		 */
> -		if (!iommu_is_span_boundary(index, nslots,
> +		if (!iommu_is_span_boundary(slot_index, nslots,
>  					    nr_slots(tbl_dma_addr),
>  					    max_slots)) {
> -			if (mem->slots[index].list >= nslots)
> +			if (mem->slots[slot_index].list >= nslots)
>  				goto found;
>  		}
> -		index = wrap_index(mem, index + stride);
> +		index = wrap_area_index(mem, index + stride);
>  	} while (index != wrap);
> 
>  not_found:
> -	spin_unlock_irqrestore(&mem->lock, flags);
> +	spin_unlock_irqrestore(&area->lock, flags);
>  	return -1;
> 
>  found:
> -	for (i = index; i < index + nslots; i++) {
> +	for (i = slot_index; i < slot_index + nslots; i++) {
>  		mem->slots[i].list = 0;
> -		mem->slots[i].alloc_size =
> -			alloc_size - (offset + ((i - index) << IO_TLB_SHIFT));
> +		mem->slots[i].alloc_size = alloc_size - (offset +
> +				((i - slot_index) << IO_TLB_SHIFT));
>  	}
> -	for (i = index - 1;
> +	for (i = slot_index - 1;
>  	     io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&
>  	     mem->slots[i].list; i--)
>  		mem->slots[i].list = ++count;
> @@ -559,14 +663,43 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t
> orig_addr,
>  	/*
>  	 * Update the indices to avoid searching in the next round.
>  	 */
> -	if (index + nslots < mem->nslabs)
> -		mem->index = index + nslots;
> +	if (index + nslots < mem->area_nslabs)
> +		area->index = index + nslots;
>  	else
> -		mem->index = 0;
> -	mem->used += nslots;
> +		area->index = 0;
> +	area->used += nslots;
> +	spin_unlock_irqrestore(&area->lock, flags);
> +	return slot_index;
> +}
> 
> -	spin_unlock_irqrestore(&mem->lock, flags);
> -	return index;
> +static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
> +		size_t alloc_size, unsigned int alloc_align_mask)
> +{
> +	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> +	int start = raw_smp_processor_id() & ((1U << __fls(mem->nareas)) - 1);

Is the __fls needed?  mem->nareas must be a power of 2, so it seems like
raw_smp_processor_id() & (mem->nareas - 1) would be sufficient.

> +	int i = start, index;
> +
> +	do {
> +		index = swiotlb_do_find_slots(mem, mem->areas + i, i,
> +					      dev, orig_addr, alloc_size,
> +					      alloc_align_mask);
> +		if (index >= 0)
> +			return index;
> +		if (++i >= mem->nareas)
> +			i = 0;
> +	} while (i != start);
> +
> +	return -1;
> +}
> +
> +static unsigned long mem_used(struct io_tlb_mem *mem)
> +{
> +	int i;
> +	unsigned long used = 0;
> +
> +	for (i = 0; i < mem->nareas; i++)
> +		used += mem->areas[i].used;
> +	return used;
>  }
> 
>  phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
> @@ -598,7 +731,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>  		if (!(attrs & DMA_ATTR_NO_WARN))
>  			dev_warn_ratelimited(dev,
>  	"swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
> -				 alloc_size, mem->nslabs, mem->used);
> +				 alloc_size, mem->nslabs, mem_used(mem));
>  		return (phys_addr_t)DMA_MAPPING_ERROR;
>  	}
> 
> @@ -628,6 +761,8 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
>  	unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
>  	int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
>  	int nslots = nr_slots(mem->slots[index].alloc_size + offset);
> +	int aindex = index / mem->area_nslabs;
> +	struct io_tlb_area *area = &mem->areas[aindex];
>  	int count, i;
> 
>  	/*
> @@ -636,7 +771,9 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
>  	 * While returning the entries to the free list, we merge the entries
>  	 * with slots below and above the pool being returned.
>  	 */
> -	spin_lock_irqsave(&mem->lock, flags);
> +	BUG_ON(aindex >= mem->nareas);
> +
> +	spin_lock_irqsave(&area->lock, flags);
>  	if (index + nslots < ALIGN(index + 1, IO_TLB_SEGSIZE))
>  		count = mem->slots[index + nslots].list;
>  	else
> @@ -660,8 +797,8 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
>  	     io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 && mem->slots[i].list;
>  	     i--)
>  		mem->slots[i].list = ++count;
> -	mem->used -= nslots;
> -	spin_unlock_irqrestore(&mem->lock, flags);
> +	area->used -= nslots;
> +	spin_unlock_irqrestore(&area->lock, flags);
>  }
> 
>  /*
> @@ -759,12 +896,14 @@ EXPORT_SYMBOL_GPL(is_swiotlb_active);
>  static void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
>  					 const char *dirname)
>  {
> +	unsigned long used = mem_used(mem);
> +
>  	mem->debugfs = debugfs_create_dir(dirname, io_tlb_default_mem.debugfs);
>  	if (!mem->nslabs)
>  		return;
> 
>  	debugfs_create_ulong("io_tlb_nslabs", 0400, mem->debugfs, &mem->nslabs);
> -	debugfs_create_ulong("io_tlb_used", 0400, mem->debugfs, &mem->used);
> +	debugfs_create_ulong("io_tlb_used", 0400, mem->debugfs, &used);

If I understand debugfs_create_ulong() correctly, it doesn't work to specify a local
variable address as the last argument.  That address is saved and accessed after this
function exits.

>  }
> 
>  static int __init __maybe_unused swiotlb_create_default_debugfs(void)
> @@ -815,6 +954,9 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
>  	struct io_tlb_mem *mem = rmem->priv;
>  	unsigned long nslabs = rmem->size >> IO_TLB_SHIFT;
> 
> +	/* Set Per-device io tlb area to one */
> +	unsigned int nareas = 1;
> +
>  	/*
>  	 * Since multiple devices can share the same pool, the private data,
>  	 * io_tlb_mem struct, will be initialized by the first device attached
> @@ -831,10 +973,18 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
>  			return -ENOMEM;
>  		}
> 
> +		mem->areas = kcalloc(nareas, sizeof(*mem->areas),
> +				GFP_KERNEL);
> +		if (!mem->areas) {
> +			kfree(mem);
> +			kfree(mem->slots);

These two kfree() calls are in the wrong order.

> +			return -ENOMEM;
> +		}
> +
>  		set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
>  				     rmem->size >> PAGE_SHIFT);
>  		swiotlb_init_io_tlb_mem(mem, rmem->base, nslabs, SWIOTLB_FORCE,
> -				false);
> +					false, nareas);
>  		mem->for_alloc = true;
> 
>  		rmem->priv = mem;
> --
> 2.25.1