linux-kernel - Re: [PATCH 2/2] dma: add Qualcomm Technologies HIDMA channel driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5637B7C1.2070200@codeaurora.org>
Date:	Mon, 2 Nov 2015 14:21:37 -0500
From:	Sinan Kaya <okaya@...eaurora.org>
To:	Arnd Bergmann <arnd@...db.de>
Cc:	dmaengine@...r.kernel.org, timur@...eaurora.org,
	cov@...eaurora.org, jcm@...hat.com,
	Rob Herring <robh+dt@...nel.org>,
	Pawel Moll <pawel.moll@....com>,
	Mark Rutland <mark.rutland@....com>,
	Ian Campbell <ijc+devicetree@...lion.org.uk>,
	Kumar Gala <galak@...eaurora.org>,
	Vinod Koul <vinod.koul@...el.com>,
	Dan Williams <dan.j.williams@...el.com>,
	devicetree@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] dma: add Qualcomm Technologies HIDMA channel driver

On 11/2/2015 11:33 AM, Arnd Bergmann wrote:
> On Sunday 01 November 2015 13:50:53 Sinan Kaya wrote:
>>
>>>> The issue is not writel_relaxed vs. writel. After I issue reset, I need
>>>> wait for some time to confirm reset was done. I can use readl_polling
>>>> instead of mdelay if we don't like mdelay.
>>>
>>> I meant that both _relaxed() and mdelay() are probably wrong here.
>>
>> You are right about redundant writel_relaxed + wmb. They are effectively
>> equal to writel.
>
> Actually, writel() is wmb()+writel_relaxed(), not the other way round:

I agree. Semantics... but important one since order matters this time.

>
> When sending a command to a device that can start a DMA transfer,
> the barrier is required to ensure that the DMA happens after previously
> written data has gone from the CPU write buffers into the memory that
> is used as the source for the transfer.
>
Agreed.

> A barrier after the writel() has no effect, as MMIO writes are posted
> on the bus.

I had two use cases in the original code. We are talking about start 
routine here. I was giving reference to enable/reset/disable uses above.

1. Start routine
--------------
spin_lock
writel_relaxed
spin_unlock

and

2. enable/reset/disable
--------------
writel_relaxed
wmb

I changed writel_relaxed to writel now in start routine and submitted 
the second version of the patchset yesterday. I hope you have received 
it. I was relying on the spinlocks before.


>
>> However, after issuing the command; I still need to wait some amount of
>> time until hardware acknowledges the commands like reset/enable/disable.
>> These are relatively faster operations happening in microseconds. That's
>> why, I have mdelay there.
>>
>> I'll take a look at workqueues but it could turn out to be an overkill
>> for few microseconds.
>
> Most devices are able to provide an interrupt for long-running commands.
> Are you sure that yours is unable to do this? If so, is this a design
> mistake or an implementation bug?

I think I was not clear on how long these command take. These command 
are really fast and get acknowledged at status register in few 
microseconds. That's why I choose polling.

I was waiting up to 10ms before and manually sleeping 1 milliseconds in 
between each using mdelay. I followed your suggestion and got rid of the 
mdelay. Then, I used polled read command which calls the usleep_range 
function as you suggested.

Hardware supports error interrupts but this is a SW design philosophy 
discussion. Why would you want to trigger an interrupt for few 
microseconds delay that only happens during the first time init from probe?

>
>>>>> Reading the status probably requires a readl() rather than readl_relaxed()
>>>>> to guarantee that the DMA data has arrived in memory by the time that the
>>>>> register data is seen by the CPU. If using readl_relaxed() here is a valid
>>>>> and required optimization, please add a comment to explain why it works
>>>>> and how much you gain.
>>>>
>>>> I will add some description. This is a high speed peripheral. I don't
>>>> like spreading barriers as candies inside the readl and writel unless I
>>>> have to.
>>>>
>>>> According to the barriers video, I watched on youtube this should be the
>>>> rule for ordering.
>>>>
>>>> "if you do two relaxed reads and check the results of the returned
>>>> variables, ARM architecture guarantees that these two relaxed variables
>>>> will get observed during the check."
>>>>
>>>> this is called implied ordering or something of that sort.
>>>
>>> My point was a bit different: while it is guaranteed that the
>>> result of the readl_relaxed() is observed in order, they do not
>>> guarantee that a DMA from device to memory that was started by
>>> the device before the readl_relaxed() has arrived in memory
>>> by the time that the readl_relaxed() result is visible to the
>>> CPU and it starts accessing the memory.
>>>
>> I checked with the hardware designers. Hardware guarantees that by the
>> time interrupt is observed, all data transactions in flight are
>> delivered to their respective places and are visible to the CPU. I'll
>> add a comment in the code about this.
>
> I'm curious about this. Does that mean the device is not meant for
> high-performance transfers and just synchronizes the bus before
> triggering the interrupt?

HIDMA meaning, as you probably guessed, is high performance DMA. We had 
several name iterations in the company and this was the one that sticked.

I'm a SW person. I don't have the expertise to go deeper into HW design.
I'm following the programming document. It says coherency and guaranteed 
interrupt ordering. High performance can mean how fast you can move data 
from one location to the other one vs. how fast you can queue up 
multiple requests and get acks in response.

I followed a simple design here. HW can take multiple requests 
simultaneously and give me an ack when it is finished with interrupt.

If there are requests in flight, other requests will get queued up in SW 
and will not be serviced until the previous requests get acknowledged. 
Then, as soon as HW stops processing; I queue a bunch of other requests 
and kick start it. Current SW design does not allow simultaneous SW 
queuing vs. HW processing. I can try this on the next iteration. This 
implementation, IMO, is good enough now and has been working reliably 
for a long time (since 2014).

>
>>> In other words, when the hardware sends you data followed by an
>>> interrupt to tell you the data is there, your interrupt handler
>>> can tell the driver that is waiting for this data that the DMA
>>> is complete while the data itself is still in flight, e.g. waiting
>>> for an IOMMU to fetch page table entries.
>>>
>> There is HW guarantee for ordering.
>>
>> On demand paging for IOMMU is only supported for PCIe via PRI (Page
>> Request Interface) not for HIDMA. All other hardware instances work on
>> pinned DMA addresses. I'll drop a note about this too to the code as well.
>
> I wasn't talking about paging, just fetching the IOTLB from the
> preloaded page tables in RAM. This can takes several uncached memory
> accesses, so it would generally be slow.
>
I see.

HIDMA is not aware of IOMMU presence since it follows the DMA API. All 
IOMMU latency will be built into the data movement time. By the time 
interrupt happens, IOMMU lookups + data movement has already taken place.

>
> 	Arnd
>

-- 
Sinan Kaya
Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/