lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y30gZm0mO4YNO85d@localhost.localdomain>
Date:   Tue, 22 Nov 2022 14:17:58 -0500
From:   Michal Kubiak <michal.kubiak@...el.com>
To:     Eric Dumazet <edumazet@...gle.com>
CC:     Joerg Roedel <joro@...tes.org>,
        Robin Murphy <robin.murphy@....com>,
        "Will Deacon" <will@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        <netdev@...r.kernel.org>, Eric Dumazet <eric.dumazet@...il.com>,
        <iommu@...ts.linux.dev>, <maciej.fijalkowski@...el.com>,
        <magnus.karlsson@...el.com>
Subject: Re: [PATCH -next] iommu/dma: avoid expensive indirect calls for sync
 operations

On Sat, Nov 12, 2022 at 04:04:52AM +0000, Eric Dumazet wrote:
> Quite often, NIC devices do not need dma_sync operations
> on x86_64 at least.
> 
> Indeed, when dev_is_dma_coherent(dev) is true and
> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
> and friends do nothing.
> 
> However, indirectly calling them when CONFIG_RETPOLINE=y
> consumes about 10% of cycles on a cpu receiving packets
> from softirq at ~100Gbit rate, as shown in [1]
> 
> Even if/when CONFIG_RETPOLINE is not set, there
> is a cost of about 3%.
> 
> This patch adds a copy of iommu_dma_ops structure,
> where sync_single_for_cpu, sync_single_for_device,
> sync_sg_for_cpu and sync_sg_for_device are unset.


Larysa from our team has found out this patch introduces also a
functional improvement for batch allocation in AF_XDP while iommmu is
turned on.
In 'xp_alloc_batch()' function there is a check if DMA needs a
synchronization. If so, batch allocation is not supported and we can
allocate only one buffer at a time.

The flag 'dma_need_sync' is being set according to the value returned by
the function 'dma_need_sync()' (from '/kernel/dma/mapping.c').
That function only checks if at least one of two DMA ops is defined:
'ops->sync_single_for_cpu' or 'ops->sync_single_for_device'.

> +static const struct dma_map_ops iommu_nosync_dma_ops = {
> +	iommu_dma_ops_common_fields
> +
> +	.sync_single_for_cpu	= NULL,
> +	.sync_single_for_device	= NULL,
> +	.sync_sg_for_cpu	= NULL,
> +	.sync_sg_for_device	= NULL,
> +};
> +#undef iommu_dma_ops_common_fields
> +
>  /*
>   * The IOMMU core code allocates the default DMA domain, which the underlying
>   * IOMMU driver needs to support via the dma-iommu layer.
> @@ -1586,7 +1612,8 @@ void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit)
>  	if (iommu_is_dma_domain(domain)) {
>  		if (iommu_dma_init_domain(domain, dma_base, dma_limit, dev))
>  			goto out_err;
> -		dev->dma_ops = &iommu_dma_ops;
> +		dev->dma_ops = dev_is_dma_sync_needed(dev) ?
> +				&iommu_dma_ops : &iommu_nosync_dma_ops;
>  	}
>  
>  	return;

 This code removes defining 'sync_*' DMA ops if they are not actually
 used. Thanks to that improvement the function 'dma_need_sync()' will
 always return more meaningful information if any DMA synchronization is
 actually needed for iommu.

 Together with Larysa we have applied that patch and we can confirm it
 helps for batch buffer allocation in AF_XDP ('xsk_buff_alloc_batch()'
 call) when iommu is enabled.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ