lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <TY2PR0101MB313667D70F284FC28BE4039B84C3A@TY2PR0101MB3136.apcprd01.prod.exchangelabs.com>
Date:   Tue, 26 Sep 2023 04:33:12 +0000
From:   Kelly Devilliv <kelly.devilliv@...look.com>
To:     Christian König <christian.koenig@....com>,
        Robin Murphy <robin.murphy@....com>,
        "joro@...tes.org" <joro@...tes.org>,
        "will@...nel.org" <will@...nel.org>
CC:     "iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: 答复: dma_map_resource() has a bad performance in pcie peer to peer transactions when iommu enabled in Linux

On 2023-09-26 01:58, Christian König wrote:
> Am 25.09.23 um 16:17 schrieb Kelly Devilliv:
>> On 2023-09-25 19:16, Robin Murphy wrote:
>>> On 2023-09-25 04:59, Kelly Devilliv wrote:
>>>> Dear all,
>>>>
>>>> I am working on an ARM-V8 server with two gpu cards on it. Recently,
>>>> I need
>>>> to test pcie peer to peer communication between the two gpu cards,
>>>> but the throughput is only 4GB/s.
>>>> After I explored the gpu's kernel mode driver, I found it was using
>>>> the dma_map_resource() API to map the peer device's MMIO space. The arm
>>>> iommu driver then will hardcode a 'IOMMU_MMIO' prot in the later dma map:
>>>>          static dma_addr_t iommu_dma_map_resource(struct device
>>>> *dev,
>>>> phys_addr_t phys,
>>>>                                   size_t size, enum
>>>> dma_data_direction
>>>> dir, unsigned long attrs)
>>>>           {
>>>>                   return __iommu_dma_map(dev, phys, size,
>>>>                                           dma_info_to_prot(dir,
>>>> false,
>>>> attrs) | IOMMU_MMIO,
>>>>                                           dma_get_mask(dev));
>>>>           }
>>>>
>>>> And that will finally set the 'ARM_LPAE_PTE_MEMATTR_DEV' attribute
>>>> in PTE,
>>>> which may have a negative impact on the performance of the pcie peer
>>>> to peer transactions.
>>>>           /*
>>>>            * Note that this logic is structured to accommodate Mali LPAE
>>>>            * having stage-1-like attributes but stage-2-like permissions.
>>>>            */
>>>>           if (data->iop.fmt == ARM_64_LPAE_S2 ||
>>>>               data->iop.fmt == ARM_32_LPAE_S2) {
>>>>                   if (prot & IOMMU_MMIO)
>>>>                           pte |= ARM_LPAE_PTE_MEMATTR_DEV;
>>>>                   else if (prot & IOMMU_CACHE)
>>>>                           pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
>>>>                   else
>>>>                           pte |= ARM_LPAE_PTE_MEMATTR_NC;
>>>>           } else {
>>>>                   if (prot & IOMMU_MMIO)
>>>>                           pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
>>>>                                   <<
>ARM_LPAE_PTE_ATTRINDX_SHIFT);
>>>>                   else if (prot & IOMMU_CACHE)
>>>>                           pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
>>>>                                   <<
>ARM_LPAE_PTE_ATTRINDX_SHIFT);
>>>>           }
>>>>
>>>> I tried to remove the 'IOMMU_MMIO' prot in the dma_map_resource()
>>>> API
>>>> and re-compile the linux kernel, the throughput then can be up to 28GB/s.
>>>> Is there an elegant way to solve this issue without modifying the linux kernel?
>>>> e.g., a substitution of dma_map_resource() API?
>>>
>>> Not really. Other use-cases for dma_map_resource() include DMA
>>> offload engines accessing FIFO registers, where allowing reordering,
>>> write-gathering, etc. would be a terrible idea. Thus it needs to
>>> assume a "safe" MMIO memory type, which on Arm means Device-nGnRE.
>>>
>>> However, the "proper" PCI peer-to-peer support under
>>> CONFIG_PCI_P2PDMA ended up moving away from the
>dma_map_resource()
>>> approach anyway, and allows this kind of device memory to be treated
>>> more like regular memory (via
>>> ZONE_DEVICE) rather than arbitrary MMIO resources, so your best bet
>>> would be to get the GPU driver converted over to using that.
>>
>> Thanks Robin.
>> So your suggestion is we'd better work out a new implementation just
>> as what it does under CONFIG_PCI_P2PDMA instead of just using the
>> dma_map_resource() API?
>>
>> I have explored the GPU drivers from AMD, Nvidia and habanalabs, e.g.,
>> and found they all using the dma_map_resource() API to map the peer
>> device's bar address.
>> If so, is it possible to be a common performance issue in PCI peer-to-peer
>> scenario?
>
> That's not an issue, but expected behavior.
>
> When you enable IOMMU every transaction needs to go through the root
> complex for address translation and you completely lose the performance
> benefit of PCIe P2P.

Thanks Christian. That's true.

>
> This is a hardware limitation and not really related to
> dma_map_resource() in any way.
>

But when I removed the 'IOMMU_MMIO' prot in dma_map_resource(), the performace was significantly improved (from 4GB/s to 28GB/s), which was almost the same as what it can be when IOMMU disabled. So I guess in my common pci topology, what really matters may not be whether IOMMU is enabled or not, but in fact the attributes in dma mapping or ARM PTE does.

I don't know if there is a way to make the memory attributes more configurable in order to be distinguished from the "safe" MMIO memory type, which on Arm means Device-nGnRE as Robin said.

Sincerely,
Kelly

> Regards,
> Christian.
>
>>
>>> Thanks,
>>> Robin.
>>>
>>>> Thank you!
>>>>
>>>> Platform info:
>>>> Linux kernel version: 5.10
>>>> PCIE GEN4 x16
>>>>
>>>> Sincerely,
>>>> Kelly
>>>>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ