linux-kernel - Re: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve per-numa CMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200821142804.GR969206@linux.ibm.com>
Date:   Fri, 21 Aug 2020 17:28:04 +0300
From:   Mike Rapoport <rppt@...ux.ibm.com>
To:     Barry Song <song.bao.hua@...ilicon.com>
Cc:     hch@....de, m.szyprowski@...sung.com, robin.murphy@....com,
        will@...nel.org, ganapatrao.kulkarni@...ium.com,
        catalin.marinas@....com, akpm@...ux-foundation.org,
        iommu@...ts.linux-foundation.org,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
        prime.zeng@...ilicon.com, huangdaode@...wei.com,
        linuxarm@...wei.com,
        Jonathan Cameron <Jonathan.Cameron@...wei.com>,
        Nicolas Saenz Julienne <nsaenzjulienne@...e.de>,
        Steve Capper <steve.capper@....com>
Subject: Re: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve
 per-numa CMA

On Fri, Aug 21, 2020 at 11:33:53PM +1200, Barry Song wrote:
> Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get
> coherent DMA buffers to save their command queues and page tables. As
> there is only one default CMA in the whole system, SMMUs on nodes other
> than node0 will get remote memory. This leads to significant latency.
> 
> This patch provides per-numa CMA so that drivers like SMMU can get local
> memory. Tests show localizing CMA can decrease dma_unmap latency much.
> For instance, before this patch, SMMU on node2  has to wait for more than
> 560ns for the completion of CMD_SYNC in an empty command queue; with this
> patch, it needs 240ns only.
> 
> A positive side effect of this patch would be improving performance even
> further for those users who are worried about performance more than DMA
> security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all
> drivers can get local coherent DMA buffers.
> 
> Cc: Jonathan Cameron <Jonathan.Cameron@...wei.com>
> Cc: Christoph Hellwig <hch@....de>
> Cc: Marek Szyprowski <m.szyprowski@...sung.com>
> Cc: Will Deacon <will@...nel.org>
> Cc: Robin Murphy <robin.murphy@....com>
> Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@...ium.com>
> Cc: Catalin Marinas <catalin.marinas@....com>
> Cc: Nicolas Saenz Julienne <nsaenzjulienne@...e.de>
> Cc: Steve Capper <steve.capper@....com>
> Cc: Andrew Morton <akpm@...ux-foundation.org>
> Cc: Mike Rapoport <rppt@...ux.ibm.com>
> Signed-off-by: Barry Song <song.bao.hua@...ilicon.com>
> ---
>  -v7: with respect to Will's comments
>  * move to use for_each_online_node
>  * add description if users don't specify pernuma_cma
>  * provide default value for CONFIG_DMA_PERNUMA_CMA
> 
>  .../admin-guide/kernel-parameters.txt         |  11 ++
>  include/linux/dma-contiguous.h                |   6 ++
>  kernel/dma/Kconfig                            |  11 ++
>  kernel/dma/contiguous.c                       | 100 ++++++++++++++++--
>  4 files changed, 118 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index bdc1f33fd3d1..c609527fc35a 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -599,6 +599,17 @@
>  			altogether. For more information, see
>  			include/linux/dma-contiguous.h
>  
> +	pernuma_cma=nn[MG]

Maybe cma_pernuma or cma_pernode?

> +			[ARM64,KNL]
> +			Sets the size of kernel per-numa memory area for
> +			contiguous memory allocations. A value of 0 disables
> +			per-numa CMA altogether. And If this option is not
> +			specificed, the default value is 0.
> +			With per-numa CMA enabled, DMA users on node nid will
> +			first try to allocate buffer from the pernuma area
> +			which is located in node nid, if the allocation fails,
> +			they will fallback to the global default memory area.
> +
>  	cmo_free_hint=	[PPC] Format: { yes | no }
>  			Specify whether pages are marked as being inactive
>  			when they are freed.  This is used in CMO environments
> diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h
> index 03f8e98e3bcc..fe55e004f1f4 100644
> --- a/include/linux/dma-contiguous.h
> +++ b/include/linux/dma-contiguous.h
> @@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct device *dev, struct page *page,
>  
>  #endif
>  
> +#ifdef CONFIG_DMA_PERNUMA_CMA
> +void dma_pernuma_cma_reserve(void);
> +#else
> +static inline void dma_pernuma_cma_reserve(void) { }
> +#endif
> +
>  #endif
>  
>  #endif
> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> index 847a9d1fa634..c38979d45b13 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -118,6 +118,17 @@ config DMA_CMA
>  	  If unsure, say "n".
>  
>  if  DMA_CMA
> +
> +config DMA_PERNUMA_CMA
> +	bool "Enable separate DMA Contiguous Memory Area for each NUMA Node"
> +	default NUMA && ARM64
> +	help
> +	  Enable this option to get pernuma CMA areas so that devices like
> +	  ARM64 SMMU can get local memory by DMA coherent APIs.
> +
> +	  You can set the size of pernuma CMA by specifying "pernuma_cma=size"
> +	  on the kernel's command line.
> +
>  comment "Default contiguous memory area size:"
>  
>  config CMA_SIZE_MBYTES
> diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> index cff7e60968b9..0383c9b86715 100644
> --- a/kernel/dma/contiguous.c
> +++ b/kernel/dma/contiguous.c
> @@ -69,6 +69,19 @@ static int __init early_cma(char *p)
>  }
>  early_param("cma", early_cma);
>  
> +#ifdef CONFIG_DMA_PERNUMA_CMA
> +
> +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
> +static phys_addr_t pernuma_size_bytes __initdata;
> +
> +static int __init early_pernuma_cma(char *p)
> +{
> +	pernuma_size_bytes = memparse(p, &p);
> +	return 0;
> +}
> +early_param("pernuma_cma", early_pernuma_cma);
> +#endif
> +
>  #ifdef CONFIG_CMA_SIZE_PERCENTAGE
>  
>  static phys_addr_t __init __maybe_unused cma_early_percent_memory(void)
> @@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t cma_early_percent_memory(void)
>  
>  #endif
>  
> +#ifdef CONFIG_DMA_PERNUMA_CMA
> +void __init dma_pernuma_cma_reserve(void)
> +{
> +	int nid;
> +
> +	if (!pernuma_size_bytes)
> +		return;
> +
> +	for_each_online_node(nid) {
> +		int ret;
> +		char name[20];
> +		struct cma **cma = &dma_contiguous_pernuma_area[nid];
> +
> +		snprintf(name, sizeof(name), "pernuma%d", nid);
> +		ret = cma_declare_contiguous_nid(0, pernuma_size_bytes, 0, 0,
> +						 0, false, name, cma, nid);
> +		if (ret) {
> +			pr_warn("%s: reservation failed: err %d, node %d", __func__,
> +				ret, nid);
> +			continue;
> +		}
> +
> +		pr_debug("%s: reserved %llu MiB on node %d\n", __func__,
> +			(unsigned long long)pernuma_size_bytes / SZ_1M, nid);
> +	}
> +}
> +#endif
> +
>  /**
>   * dma_contiguous_reserve() - reserve area(s) for contiguous memory handling
>   * @limit: End address of the reserved memory (optional, 0 for any).
> @@ -228,23 +269,44 @@ static struct page *cma_alloc_aligned(struct cma *cma, size_t size, gfp_t gfp)
>   * @size:  Requested allocation size.
>   * @gfp:   Allocation flags.
>   *
> - * This function allocates contiguous memory buffer for specified device. It
> - * tries to use device specific contiguous memory area if available, or the
> - * default global one.
> + * tries to use device specific contiguous memory area if available, or it
> + * tries to use per-numa cma, if the allocation fails, it will fallback to
> + * try default global one.
>   *
> - * Note that it byapss one-page size of allocations from the global area as
> - * the addresses within one page are always contiguous, so there is no need
> - * to waste CMA pages for that kind; it also helps reduce fragmentations.
> + * Note that it bypass one-page size of allocations from the per-numa and
> + * global area as the addresses within one page are always contiguous, so
> + * there is no need to waste CMA pages for that kind; it also helps reduce
> + * fragmentations.
>   */
>  struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
>  {
> +#ifdef CONFIG_DMA_PERNUMA_CMA
> +	int nid = dev_to_node(dev);
> +#endif
> +
>  	/* CMA can be used only in the context which permits sleeping */
>  	if (!gfpflags_allow_blocking(gfp))
>  		return NULL;
>  	if (dev->cma_area)
>  		return cma_alloc_aligned(dev->cma_area, size, gfp);
> -	if (size <= PAGE_SIZE || !dma_contiguous_default_area)
> +	if (size <= PAGE_SIZE)
> +		return NULL;
> +
> +#ifdef CONFIG_DMA_PERNUMA_CMA
> +	if (nid != NUMA_NO_NODE && !(gfp & (GFP_DMA | GFP_DMA32))) {
> +		struct cma *cma = dma_contiguous_pernuma_area[nid];

It could be that for some node reservation failedm than
dma_contiguous_pernuma_area[nid] would be NULL.
I'd add a fallback to another node here.

> +		struct page *page;
> +
> +		if (cma) {
> +			page = cma_alloc_aligned(cma, size, gfp);
> +			if (page)
> +				return page;
> +		}
> +	}
> +#endif

I think the selection of the area can be put in a helper funciton and
then here we just try to allocate from the selected area. E.g.

static struct cma* dma_get_cma_area(struct device *dev)
{
#ifdef CONFIG_DMA_PERNUMA_CMA
	int nid = dev_to_node(dev);
	struct cma *cma = dma_contiguous_pernuma_area[nid];

	if (!cma)
		/* select cma from another node */ ;
		
	return cma;
#else
	return dma_contiguous_default_area;
#endif
}

struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
{
	struct cma *cma;
	...


	cma = dma_get_cma_area(dev);
	if (!cma)
		return NULL;

	return cma_alloc_aligned(cma, size, gfp);
}	
 
> +	if (!dma_contiguous_default_area)
>  		return NULL;
> +
>  	return cma_alloc_aligned(dma_contiguous_default_area, size, gfp);
>  }
>  
> @@ -261,9 +323,27 @@ struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
>   */
>  void dma_free_contiguous(struct device *dev, struct page *page, size_t size)
>  {
> -	if (!cma_release(dev_get_cma_area(dev), page,
> -			 PAGE_ALIGN(size) >> PAGE_SHIFT))
> -		__free_pages(page, get_order(size));

Here as well, dev_get_cma_area() can be replaced with, say
dma_get_dev_cma_area(dev, page) that will hide the below logic.

> +	unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
> +
> +	/* if dev has its own cma, free page from there */
> +	if (dev->cma_area) {
> +		if (cma_release(dev->cma_area, page, count))
> +			return;
> +	} else {
> +		/*
> +		 * otherwise, page is from either per-numa cma or default cma
> +		 */
> +#ifdef CONFIG_DMA_PERNUMA_CMA
> +		if (cma_release(dma_contiguous_pernuma_area[page_to_nid(page)],
> +					page, count))
> +			return;
> +#endif
> +		if (cma_release(dma_contiguous_default_area, page, count))
> +			return;
> +	}
> +
> +	/* not in any cma, free from buddy */
> +	__free_pages(page, get_order(size));
>  }
>  
>  /*
> -- 
> 2.27.0
> 
> 

-- 
Sincerely yours,
Mike.