lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <TY2PR0101MB3136A7480B842E009BCBDA8384C3A@TY2PR0101MB3136.apcprd01.prod.exchangelabs.com>
Date:   Tue, 26 Sep 2023 15:30:57 +0000
From:   Kelly Devilliv <kelly.devilliv@...look.com>
To:     Christian König <christian.koenig@....com>,
        Robin Murphy <robin.murphy@....com>,
        "joro@...tes.org" <joro@...tes.org>,
        "will@...nel.org" <will@...nel.org>
CC:     "iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: dma_map_resource() has a bad performance in pcie peer to peer
 transactions when iommu enabled in Linux

On 2023-09-26 13:32, Christian König wrote:
> Am 26.09.23 um 06:33 schrieb Kelly Devilliv:
>> On 2023-09-26 01:58, Christian König wrote:
>>> Am 25.09.23 um 16:17 schrieb Kelly Devilliv:
>>>> On 2023-09-25 19:16, Robin Murphy wrote:
>>>>> On 2023-09-25 04:59, Kelly Devilliv wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> I am working on an ARM-V8 server with two gpu cards on it.
>>>>>> Recently, I need to test pcie peer to peer communication between
>>>>>> the two gpu cards, but the throughput is only 4GB/s.
>>>>>> After I explored the gpu's kernel mode driver, I found it was
>>>>>> using the dma_map_resource() API to map the peer device's MMIO
>>>>>> space. The arm iommu driver then will hardcode a 'IOMMU_MMIO' prot in the later dma map:
>>>>>>           static dma_addr_t iommu_dma_map_resource(struct device
>>>>>> *dev, phys_addr_t phys,
>>>>>>                                    size_t size, enum
>>>>>> dma_data_direction dir, unsigned long attrs)
>>>>>>            {
>>>>>>                    return __iommu_dma_map(dev, phys, size,
>>>>>>                                            dma_info_to_prot(dir,
>>>>>> false,
>>>>>> attrs) | IOMMU_MMIO,
>>>>>>                                            dma_get_mask(dev));
>>>>>>            }
>>>>>>
>>>>>> And that will finally set the 'ARM_LPAE_PTE_MEMATTR_DEV' attribute
>>>>>> in PTE, which may have a negative impact on the performance of the
>>>>>> pcie peer to peer transactions.
>>>>>>            /*
>>>>>>             * Note that this logic is structured to accommodate Mali LPAE
>>>>>>             * having stage-1-like attributes but stage-2-like permissions.
>>>>>>             */
>>>>>>            if (data->iop.fmt == ARM_64_LPAE_S2 ||
>>>>>>                data->iop.fmt == ARM_32_LPAE_S2) {
>>>>>>                    if (prot & IOMMU_MMIO)
>>>>>>                            pte |= ARM_LPAE_PTE_MEMATTR_DEV;
>>>>>>                    else if (prot & IOMMU_CACHE)
>>>>>>                            pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
>>>>>>                    else
>>>>>>                            pte |= ARM_LPAE_PTE_MEMATTR_NC;
>>>>>>            } else {
>>>>>>                    if (prot & IOMMU_MMIO)
>>>>>>                            pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
>>>>>>                                    << ARM_LPAE_PTE_ATTRINDX_SHIFT);
>>>>>>                    else if (prot & IOMMU_CACHE)
>>>>>>                            pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
>>>>>>                                    << ARM_LPAE_PTE_ATTRINDX_SHIFT);
>>>>>>            }
>>>>>>
>>>>>> I tried to remove the 'IOMMU_MMIO' prot in the dma_map_resource()
>>>>>> API and re-compile the linux kernel, the throughput then can be up
>>>>>> to 28GB/s.
>>>>>> Is there an elegant way to solve this issue without modifying the linux kernel?
>>>>>> e.g., a substitution of dma_map_resource() API?
>>>>> Not really. Other use-cases for dma_map_resource() include DMA
>>>>> offload engines accessing FIFO registers, where allowing
>>>>> reordering, write-gathering, etc. would be a terrible idea. Thus it
>>>>> needs to assume a "safe" MMIO memory type, which on Arm means Device-nGnRE.
>>>>>
>>>>> However, the "proper" PCI peer-to-peer support under
>>>>> CONFIG_PCI_P2PDMA ended up moving away from the dma_map_resource()
>>>>> approach anyway, and allows this kind of device memory to be
>>>>> treated more like regular memory (via
>>>>> ZONE_DEVICE) rather than arbitrary MMIO resources, so your best bet
>>>>> would be to get the GPU driver converted over to using that.
>>>> Thanks Robin.
>>>> So your suggestion is we'd better work out a new implementation just
>>>> as what it does under CONFIG_PCI_P2PDMA instead of just using the
>>>> dma_map_resource() API?
>>>>
>>>> I have explored the GPU drivers from AMD, Nvidia and habanalabs,
>>>> e.g., and found they all using the dma_map_resource() API to map the
>>>> peer device's bar address.
>>>> If so, is it possible to be a common performance issue in PCI
>>>> peer-to-peer scenario?
>>> That's not an issue, but expected behavior.
>>>
>>> When you enable IOMMU every transaction needs to go through the root
>>> complex for address translation and you completely lose the
>>> performance benefit of PCIe P2P.
>> Thanks Christian. That's true.
>>
>>> This is a hardware limitation and not really related to
>>> dma_map_resource() in any way.
>>>
>> But when I removed the 'IOMMU_MMIO' prot in dma_map_resource(), the
>> performace was significantly improved (from 4GB/s to 28GB/s), which was
>> almost the same as what it can be when IOMMU disabled. So I guess in my common pci topology,
>> what really matters may not be whether IOMMU is enabled or not, but in fact the attributes in dma mapping or ARM PTE does.
>
> The key point is that nobody really supports that configuration, so you probably
> will find nobody looking into it.
>
> BTW: ARM isn't really supported as a platform for amdgpu either. E.g. we have
> seen tons of boards which implement the PCIe standard incorrectly, if you run
> into any trouble with that you are pretty much on your own.
>

Thanks Christian. I am going to disable IOMMU or do some tricks in PCI peer-to-peer scenario.

>> I don't know if there is a way to make the memory attributes more configurable in order to be distinguished
>> from the "safe" MMIO memory type, which on Arm means Device-nGnRE as Robin said.
>
> Well we would need to extend dma_map_resource() to include some use case
> so that the mapping attributes don't need to be guessed.
>

Hi Robin, 

Is there any chance to extend the dma_map_resource() API as discussed above?

Sincerely,
Kelly

> Regards,
> Christian.
>
>>
>> Sincerely,
>> Kelly
>>
>>> Regards,
>>> Christian.
>>>
>>>>> Thanks,
>>>>> Robin.
>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Platform info:
>>>>>> Linux kernel version: 5.10
>>>>>> PCIE GEN4 x16
>>>>>>
>>>>>> Sincerely,
>>>>>> Kelly
>>>>>>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ