linux-kernel - Re: [PATCH 1/2] mailbox: switch to hrtimer for tx

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55B5FE53.8020005@arm.com>
Date:	Mon, 27 Jul 2015 10:48:03 +0100
From:	Sudeep Holla <sudeep.holla@....com>
To:	Jassi Brar <jassisinghbrar@...il.com>
CC:	Sudeep Holla <sudeep.holla@....com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Juri Lelli <Juri.Lelli@....com>
Subject: Re: [PATCH 1/2] mailbox: switch to hrtimer for tx_complete polling



On 27/07/15 04:26, Jassi Brar wrote:
> On Fri, Jul 24, 2015 at 2:17 PM, Sudeep Holla <sudeep.holla@....com> wrote:
>> On 24/07/15 06:02, Jassi Brar wrote:
>>>
>>> On Wed, Jul 22, 2015 at 5:58 PM, Sudeep Holla <sudeep.holla@....com>
>>> wrote:
>>>
>>>> we might end-up waiting
>>>> for atleast a jiffy even though the response for that message from the
>>>> remote is received via interrupt and processed in relatively smaller
>>>> time granularity.
>>>>
>>> That is wrong.
>>
>> No see below.
>>
>>>    If the controller supports TX interrupt it should set txdone_irq,
>>> which prevents polling i.e, controller driver calls mbox_chan_txdone.
>>>
>>>    If the controller doesn't support TX interrupt but the client
>>> receives some ack packet, then the client should set knows_txdone and
>>> call mbox_client_txdone. Again you don't have to wait on polling.
>>>
>>
>> Sorry if I was not clear in the commit message, but I thought I did
>> mention TXDONE_BY_POLL. The case I am referring is definitely not
>> TXDONE_BY_IRQ or TXDONE_BY_ACK.
>>
> That statement is still wrong. The TXDONE_BY_POLL modifier does't make it right.
>

I am fine to modify/clarify that statement.

> Anyways, I see you meant the 3rd case of neither IRQ nor ACK.
>

Yes the remote indicates by setting a flag in status register.

>> Few reasons I have in mind as why we need that:
>>
>> 1. The remote processor is capable of dealing with requests in orders of
>>     milliseconds. E.g. on Juno it reports the DVFS transition latency is
>>     1.2ms. Now if we report that to CPUFreq governor, we need to ensure
>>     that. With current jiffy based timer we see latencies > 10ms.
>>
> Can I see the code somewhere?

Part of Juno SCPI support series[1].

> It seems your remote doesn't send some protocol level 'ack' packet
> replying if the command was successfully executed or not. That means
> Linux can't differentiate successful execution of the command from a
> silent failure (remote still has to set the TX_done flag to make way
> for next messages).

Agreed and again I confirm the remote processor in question just sets
the flag always and correctly and doesn't use a protocol ACK. So I am
not sure the context of the above statement for this particular issue
under discussion.

> From Linux' POV our best bet seems to be to submit
> the request and simply assume it done. So probably do async rather
> than blocking.
>

How if it's not Tx is not based on protocol ACK packet. The timer is
only solution IMO for the case where Tx done is indicated by a status flag.

>> 2. Because of that, under stress testing with multiple clients active at
>>     a time, I am seeing the mailbox buffer overflows quite easily just
>>     because it's blocked on Tx polling(almost 10x slower) and doesn't
>>     process new requests though the remote can handle.
>>
> Yeah this situation may arise. The following fix is needed regardless, I think.
>

IIUC it just triggers poll_txdone when chan->active_req is set, but that
will anyway happen through timer, no ?

Anyway it's not the main topic of discussion here and can be taken up
separately.

> ============================
> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
> index 07fd507..a1dded9 100644
> --- a/drivers/mailbox/mailbox.c
> +++ b/drivers/mailbox/mailbox.c
> @@ -64,7 +64,12 @@ static void msg_submit(struct mbox_chan *chan)
>
>    spin_lock_irqsave(&chan->lock, flags);
>
> - if (!chan->msg_count || chan->active_req)
> + if (chan->active_req) {
> + err = 0;
> + goto exit;
> + }
> +
> + if (!chan->msg_count)
>    goto exit;
>
>    count = chan->msg_count;
> ============================
>
>> 3. Also there are efforts for scheduler-driven cpu frequency selection
>>     where the desired latency should be as low as possible. Juri(cc-ed)
>>     is working on that and reported this issue with mailbox core.
>>
> Sure, latency should be as low as possible. Sounds similar to (1). Was
> it reported on some mailing list?
>

No, I am reporting it now in the form of this patch (with the solution).

>> Hope this clarifies the reasons for switching to hrtimer.
>>
> I am not against using hrtimer, just need to make sure we don't simply
> suppress the symptoms of wrong implementation.

Agreed, and that's a valid concern. So far based on the testing and
benchmarking done so far, we don't think this patch is suppressing
anything incorrectly.

If you still have concerns with this solution, please explain them here
so that we can discuss and come to conclusion and the issue is fixed.

Regards,
Sudeep

[1] https://lkml.org/lkml/2015/7/23/270
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/