lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABb+yY1T03YLwiFvBykxsAHQ9Kpu=r1nRTuaP3Emf5dP=Upm0g@mail.gmail.com>
Date:   Wed, 10 Jun 2020 10:21:19 -0500
From:   Jassi Brar <jassisinghbrar@...il.com>
To:     Sudeep Holla <sudeep.holla@....com>
Cc:     linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Viresh Kumar <viresh.kumar@...aro.org>,
        Rob Herring <robh@...nel.org>,
        Frank Rowand <frowand.list@...il.com>,
        Bjorn Andersson <bjorn.andersson@...aro.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        "arnd@...db.de" <arnd@...db.de>,
        Jassi Brar <jaswinder.singh@...aro.org>
Subject: Re: [PATCH] firmware: arm_scmi: fix timeout value for send_message

On Wed, Jun 10, 2020 at 3:23 AM Sudeep Holla <sudeep.holla@....com> wrote:
>
> On Sun, Jun 07, 2020 at 02:30:23PM -0500, jassisinghbrar@...il.com wrote:
> > From: Jassi Brar <jaswinder.singh@...aro.org>
> >
> > Currently scmi_do_xfer() submits a message to mailbox api and waits
> > for an apparently very short time. This works if there are not many
> > messages in the queue already. However, if many clients share a
> > channel and/or each client submits many messages in a row, the
>
> The recommendation in such scenarios is to use multiple channel.
>
If SCMI is to be accepted as a standard (which I hope), it has to
support most kinds of controllers, but currently the implementation is
too myopic. It is only a matter of time, when someone sees value in
reusing firmware implementation (scmi) but does not have a MHU like
controller.

> > timeout value becomes too short and returns error even if the mailbox
> > is working fine according to the load. The timeout occurs when the
> > message is still in the api/queue awaiting its turn to ride the bus.
> >
> >  Fix this by increasing the timeout value enough (500ms?) so that it
> > fails only if there is an actual problem in the transmission (like a
> > lockup or crash).
> >
> > [If we want to capture a situation when the remote didn't
> > respond within expected latency, then the timeout should not
> > start here, but from tx_prepare callback ... just before the
> > message physically gets on the channel]
> >
>
> The bottle neck may not be in the remote. It may be mailbox serialising
> the requests even when it can parallelise.
>
Your logs show (in your test case), using 1 physical channel shows
better transfer (those that complete) rates than virtual channels.
The transfers that fail are purely because of this short timeout.

> >
> >       if (xfer->hdr.poll_completion) {
> > -             ktime_t stop = ktime_add_ns(ktime_get(), SCMI_MAX_POLL_TO_NS);
> > +             ktime_t stop = ktime_add_ns(ktime_get(), 500 * 1000 * NSEC_PER_USEC);
> >
>
> This is unacceptable delay for schedutil fast_switch. So no for this one.
>
Increasing timeout does not increase latency.
Also scmi_xfer() can not know if it was reached from the fast_switch path.

If a platform has many users over a channel such that it can not
guarantee low enough latency, then it must not set the
fast_switch_possible flag, which is optional for this reason.


> > @@ -313,7 +313,7 @@ int scmi_do_xfer(const struct scmi_handle *handle, struct scmi_xfer *xfer)
> >                       ret = -ETIMEDOUT;
> >       } else {
> >               /* And we wait for the response. */
> > -             timeout = msecs_to_jiffies(info->desc->max_rx_timeout_ms);
> > +             timeout = msecs_to_jiffies(500);
>
> In general, this hides issues in the remote.
>
If you want to uncover remote issues, start the timeout in
tx_prepare() because that is when the message is physically sent to
the remote.

> We are trying to move towards
> tops 1ms for a request and with MBOX_QUEUE at 20, I see 20ms is more that
> big enough. We have it set to 30ms now. 500ms is way too large and not
> required IMO.
>
Again, increasing timeout does not slow the system down. It is to
support more variety of platform setups.

Cheers!

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ