[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5769440D.5030505@gmail.com>
Date: Tue, 21 Jun 2016 15:41:33 +0200
From: Matthias Brugger <matthias.bgg@...il.com>
To: Horng-Shyang Liao <hs.liao@...iatek.com>
Cc: Rob Herring <robh+dt@...nel.org>,
Daniel Kurtz <djkurtz@...omium.org>,
Sascha Hauer <s.hauer@...gutronix.de>,
devicetree@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-arm-kernel@...ts.infradead.org,
linux-mediatek@...ts.infradead.org, srv_heupstream@...iatek.com,
Sascha Hauer <kernel@...gutronix.de>,
Philipp Zabel <p.zabel@...gutronix.de>,
Nicolas Boichat <drinkcat@...omium.org>,
CK HU <ck.hu@...iatek.com>,
cawa cheng <cawa.cheng@...iatek.com>,
Bibby Hsieh <bibby.hsieh@...iatek.com>,
YT Shen <yt.shen@...iatek.com>,
Daoyuan Huang <daoyuan.huang@...iatek.com>,
Damon Chu <damon.chu@...iatek.com>,
Josh-YC Liu <josh-yc.liu@...iatek.com>,
Glory Hung <glory.hung@...iatek.com>,
Jiaguang Zhang <jiaguang.zhang@...iatek.com>,
Dennis-YC Hsieh <dennis-yc.hsieh@...iatek.com>,
Monica Wang <monica.wang@...iatek.com>,
jassisinghbrar@...il.com, jaswinder.singh@...aro.org
Subject: Re: [PATCH v8 2/3] CMDQ: Mediatek CMDQ driver
On 21/06/16 07:52, Horng-Shyang Liao wrote:
> On Fri, 2016-06-17 at 17:57 +0200, Matthias Brugger wrote:
>>
>> On 17/06/16 10:28, Horng-Shyang Liao wrote:
>>> Hi Matthias,
>>>
>>> On Tue, 2016-06-14 at 20:07 +0800, Horng-Shyang Liao wrote:
>>>> Hi Matthias,
>>>>
>>>> On Tue, 2016-06-14 at 12:17 +0200, Matthias Brugger wrote:
>>>>>
>>>>> On 14/06/16 09:44, Horng-Shyang Liao wrote:
>>>>>> Hi Matthias,
>>>>>>
>>>>>> On Wed, 2016-06-08 at 17:35 +0200, Matthias Brugger wrote:
>>>>>>>
>>>>>>> On 08/06/16 14:25, Horng-Shyang Liao wrote:
>>>>>>>> Hi Matthias,
>>>>>>>>
>>>>>>>> On Wed, 2016-06-08 at 12:45 +0200, Matthias Brugger wrote:
>>>>>>>>>
>>>>>>>>> On 08/06/16 07:40, Horng-Shyang Liao wrote:
>>>>>>>>>> Hi Matthias,
>>>>>>>>>>
>>>>>>>>>> On Tue, 2016-06-07 at 18:59 +0200, Matthias Brugger wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 03/06/16 15:11, Matthias Brugger wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> [...]
>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>> + smp_mb(); /* modify jump before enable thread */
>>>>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>> + cmdq_thread_writel(thread, task->pa_base +
>>>>>>>>>>>>>>>>>>>>> task->command_size,
>>>>>>>>>>>>>>>>>>>>> + CMDQ_THR_END_ADDR);
>>>>>>>>>>>>>>>>>>>>> + cmdq_thread_resume(thread);
>>>>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>>>>> + list_move_tail(&task->list_entry, &thread->task_busy_list);
>>>>>>>>>>>>>>>>>>>>> + spin_unlock_irqrestore(&cmdq->exec_lock, flags);
>>>>>>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>> +static void cmdq_handle_error_done(struct cmdq *cmdq,
>>>>>>>>>>>>>>>>>>>>> + struct cmdq_thread *thread, u32 irq_flag)
>>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>>> + struct cmdq_task *task, *tmp, *curr_task = NULL;
>>>>>>>>>>>>>>>>>>>>> + u32 curr_pa;
>>>>>>>>>>>>>>>>>>>>> + struct cmdq_cb_data cmdq_cb_data;
>>>>>>>>>>>>>>>>>>>>> + bool err;
>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>> + if (irq_flag & CMDQ_THR_IRQ_ERROR)
>>>>>>>>>>>>>>>>>>>>> + err = true;
>>>>>>>>>>>>>>>>>>>>> + else if (irq_flag & CMDQ_THR_IRQ_DONE)
>>>>>>>>>>>>>>>>>>>>> + err = false;
>>>>>>>>>>>>>>>>>>>>> + else
>>>>>>>>>>>>>>>>>>>>> + return;
>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>> + curr_pa = cmdq_thread_readl(thread, CMDQ_THR_CURR_ADDR);
>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>> + list_for_each_entry_safe(task, tmp, &thread->task_busy_list,
>>>>>>>>>>>>>>>>>>>>> + list_entry) {
>>>>>>>>>>>>>>>>>>>>> + if (curr_pa >= task->pa_base &&
>>>>>>>>>>>>>>>>>>>>> + curr_pa < (task->pa_base + task->command_size))
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> What are you checking here? It seems as if you make some implcit
>>>>>>>>>>>>>>>>>>>> assumptions about pa_base and the order of execution of
>>>>>>>>>>>>>>>>>>>> commands in the
>>>>>>>>>>>>>>>>>>>> thread. Is it save to do so? Does dma_alloc_coherent give any
>>>>>>>>>>>>>>>>>>>> guarantees
>>>>>>>>>>>>>>>>>>>> about dma_handle?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. Check what is the current running task in this GCE thread.
>>>>>>>>>>>>>>>>>>> 2. Yes.
>>>>>>>>>>>>>>>>>>> 3. Yes, CMDQ doesn't use iommu, so physical address is continuous.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Yes, physical addresses might be continous, but AFAIK there is no
>>>>>>>>>>>>>>>>>> guarantee that the dma_handle address is steadily growing, when
>>>>>>>>>>>>>>>>>> calling
>>>>>>>>>>>>>>>>>> dma_alloc_coherent. And if I understand the code correctly, you
>>>>>>>>>>>>>>>>>> use this
>>>>>>>>>>>>>>>>>> assumption to decide if the task picked from task_busy_list is
>>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>> executing. So I think this mecanism is not working.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't use dma_handle address, and just use physical addresses.
>>>>>>>>>>>>>>>>> From CPU's point of view, tasks are linked by the busy list.
>>>>>>>>>>>>>>>>> From GCE's point of view, tasks are linked by the JUMP command.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In which cases does the HW thread raise an interrupt.
>>>>>>>>>>>>>>>>>> In case of error. When does CMDQ_THR_IRQ_DONE get raised?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> GCE will raise interrupt if any task is done or error.
>>>>>>>>>>>>>>>>> However, GCE is fast, so CPU may get multiple done tasks
>>>>>>>>>>>>>>>>> when it is running ISR.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In case of error, that GCE thread will pause and raise interrupt.
>>>>>>>>>>>>>>>>> So, CPU may get multiple done tasks and one error task.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think we should reimplement the ISR mechanism. Can't we just read
>>>>>>>>>>>>>>>> CURR_IRQ_STATUS and THR_IRQ_STATUS in the handler and leave
>>>>>>>>>>>>>>>> cmdq_handle_error_done to the thread_fn? You will need to pass
>>>>>>>>>>>>>>>> information from the handler to thread_fn, but that shouldn't be an
>>>>>>>>>>>>>>>> issue. AFAIK interrupts are disabled in the handler, so we should stay
>>>>>>>>>>>>>>>> there as short as possible. Traversing task_busy_list is expensive, so
>>>>>>>>>>>>>>>> we need to do it in a thread context.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Actually, our initial implementation is similar to your suggestion,
>>>>>>>>>>>>>>> but display needs CMDQ to return callback function very precisely,
>>>>>>>>>>>>>>> else display will drop frame.
>>>>>>>>>>>>>>> For display, CMDQ interrupt will be raised every 16 ~ 17 ms,
>>>>>>>>>>>>>>> and CMDQ needs to call callback function in ISR.
>>>>>>>>>>>>>>> If we defer callback to workqueue, the time interval may be larger than
>>>>>>>>>>>>>>> 32 ms.sometimes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think the problem is, that you implemented the workqueue as a ordered
>>>>>>>>>>>>>> workqueue, so there is no parallel processing. I'm still not sure why
>>>>>>>>>>>>>> you need the workqueue to be ordered. Can you please explain.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The order should be kept.
>>>>>>>>>>>>> Let me use mouse cursor as an example.
>>>>>>>>>>>>> If task 1 means move mouse cursor to point A, task 2 means point B,
>>>>>>>>>>>>> and task 3 means point C, our expected result is A -> B -> C.
>>>>>>>>>>>>> If the order is not kept, the result could become A -> C -> B.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Got it, thanks for the clarification.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think a way to get rid of the workqueue is to use a timer, which gets
>>>>>>>>>>> programmed to the time a timeout in the first task in the busy list
>>>>>>>>>>> would happen. Everytime we update the busy list (e.g. because of task
>>>>>>>>>>> got finished by the thread), we update the timer. When the timer
>>>>>>>>>>> triggers, which hopefully won't happen too often, we return timeout on
>>>>>>>>>>> the busy list elements, until the time is lower then the actual time.
>>>>>>>>>>>
>>>>>>>>>>> At least with this we can reduce the data structures in this driver and
>>>>>>>>>>> make it more lightweight.
>>>>>>>>>>
>>>>>>>>>> From my understanding, your proposed method can handle timeout case.
>>>>>>>>>>
>>>>>>>>>> However, the workqueue is also in charge of releasing tasks.
>>>>>>>>>> Do you take releasing tasks into consideration by using the proposed
>>>>>>>>>> timer method?
>>>>>>>>>> Furthermore, I think the code will become more complex if we also use
>>>>>>>>>> timer to implement releasing tasks.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can't we call
>>>>>>>>> clk_disable_unprepare(cmdq->clock);
>>>>>>>>> cmdq_task_release(task);
>>>>>>>>> after invoking the callback?
>>>
>>> After I put clk_disable_unprepare(cmdq->clock) into ISR, I encounter
>>> another BUG.
>>>
>>> (Quote some Linux 4.7 source code.)
>>>
>>> 605 void clk_unprepare(struct clk *clk)
>>> 606 {
>>> 607 if (IS_ERR_OR_NULL(clk))
>>> 608 return;
>>> 609
>>> 610 clk_prepare_lock(); // <-- Here
>>> 611 clk_core_unprepare(clk->core);
>>> 612 clk_prepare_unlock();
>>> 613 }
>>> 614 EXPORT_SYMBOL_GPL(clk_unprepare);
>>>
>>> 91 static void clk_prepare_lock(void)
>>> 92 {
>>> 93 if (!mutex_trylock(&prepare_lock)) { // <-- Here
>>> 94 if (prepare_owner == current) {
>>> 95 prepare_refcnt++;
>>> 96 return;
>>> 97 }
>>> 98 mutex_lock(&prepare_lock);
>>> 99 }
>>> 100 WARN_ON_ONCE(prepare_owner != NULL);
>>> 101 WARN_ON_ONCE(prepare_refcnt != 0);
>>> 102 prepare_owner = current;
>>> 103 prepare_refcnt = 1;
>>> 104 }
>>>
>>> So, 'unprepare' can sleep and cannot be put into ISR.
>>> I also try to put it into a timer, but the error is the same
>>> since timer callback is executed by softirq.
>>>
>>> We need clk_disable_unprepare() since it can save power consumption
>>> in idle.
>>
>> We can call clk_prepare in probe and then use clk_enable/clk_disable,
>> which don't sleep.
>>
>> Regards,
>> Matthias
>
> Hi Matthias,
>
> Because clock gate and MUX are controlled by clk_enable/clk_disable,
> and PLL is controlled by clk_prepare/clk_unprepare,
> I still need to call clk_unprepare.
>
> After I remove releasing buffer, releasing task, and timeout task from
> work, the work can be detached from task.
>
> Therefore, I can use the following flow to reduce the number of works.
>
> if task_busy_list from empty to non-empty
> clk_prepare_enable
> if task_busy_list from non-empty to empty
> in ISR, add work for clk_disable_unprepare
>
> What do you think of this solution?
Can't we just call clk_prepare in probe and clk_unprepare in remove? I
think this could be a good starting point, and if we see, that we need
to save more energy in the future, we can think of some other mechanism.
What do you think?
Regards,
Matthias
>
> Thanks,
> HS
>
>>> Therefore, I plan to
>>> (1) move releasing buffer and task into ISR,
>>> (2) move timeout into timer, and
>>> (3) keep workqueue for clk_disable_unprepare().
>>>
>>> What do you think?
>>>
>>> Thanks,
>>> HS
>>>
>>>>>>>>
>>>>>>>> Do you mean just call these two functions in ISR?
>>>>>>>> My major concern is dma_free_coherent() and kfree() in
>>>>>>>> cmdq_task_release(task).
>>>>>>>
>>>>>>> Why do we need the dma calls at all? Can't we just calculate the
>>>>>>> physical address using __pa(x)?
>>>>>>
>>>>>> I prefer to use dma_map_single/dma_unmap_single.
>>>>>>
>>>>>
>>>>> Can you please elaborate why you need this. We don't do dma, so we
>>>>> should not use dma memory for this.
>>>>
>>>> We need a buffer to share between CPU and GCE, so we do need DMA.
>>>> CPU is in charge of writing GCE commands into this buffer.
>>>> GCE is in charge of reading and running GCE commands from this buffer.
>>>> When we chain CMDQ tasks, we also need to modify GCE JUMP command.
>>>> Therefore, I prefer to use dma_alloc_coherent and dma_free_coherent.
>>>>
>>>> However, if we want to use timer to handle timeout, we need to release
>>>> memory in ISR.
>>>> In this case, using kmalloc/kfree + dma_map_single/dma_unmap_single
>>>> instead of dma_alloc_coherent/dma_free_coherent is an alternative
>>>> solution, but taking care the synchronization between cache and memory
>>>> is the expected overhead.
>>>>
>>>>>>>> Therefore, your suggestion is to use GFP_ATOMIC for both
>>>>>>>> dma_alloc_coherent() and kzalloc(). Right?
>>>>>>>
>>>>>>> I don't think we need GFP_ATOMIC, the critical path will just free the
>>>>>>> memory.
>>>>>>
>>>>>> I tested these two functions, and kfree was safe.
>>>>>> However, dma_free_coherent raised BUG.
>>>>>> BUG: failure at
>>>>>> /mnt/host/source/src/third_party/kernel/v3.18/mm/vmalloc.c:1514/vunmap()!
>>>>>
>>>>> Just a general hint. Please try to evaluate on a recent kernel. It looks
>>>>> like as if you tried this on a v3.18 based one.
>>>>
>>>> This driver should be backward compatible to v3.18 for a MTK project.
>>>>
>>>>> Best regards,
>>>>> Matthias
>>>>
>>>> Thanks,
>>>> HS
>>>>
>>>>>> 1512 void vunmap(const void *addr)
>>>>>> 1513 {
>>>>>> 1514 BUG_ON(in_interrupt()); // <-- here
>>>>>> 1515 might_sleep();
>>>>>> 1516 if (addr)
>>>>>> 1517 __vunmap(addr, 0);
>>>>>> 1518 }
>>>>>> 1519 EXPORT_SYMBOL(vunmap);
>>>>>>
>>>>>> Therefore, I plan to use kmalloc + dma_map_single instead of
>>>>>> dma_alloc_coherent, and dma_unmap_single + kfree instead of
>>>>>> dma_free_coherent.
>>>>>>
>>>>>> What do you think about the function replacement?
>>>>>>
>>>>>>>> If so, I can try to implement timeout by timer, and discuss with you
>>>>>>>> if I have further questions.
>>>>>>>>
>>>>>>>
>>>>>>> Sounds good :)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Matthias
>>>>>>
>>>>>> Thanks,
>>>>>> HS
>>>>>>
>>>>>>>>> Regrading the clock, wouldn't it be easier to handle the clock
>>>>>>>>> enable/disable depending on the state of task_busy_list? I suppose we
>>>>>>>>> can't as we would need to check the task_busy_list of all threads, right?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Matthias
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> HS
>
>
Powered by blists - more mailing lists