linux-kernel - Re: [PATCH v2] soc: qcom: rpmh-rsc: Sleep waiting for tcs slots to be free

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAD=FV=XNw45=NGm9Qj4_Kyq9oQMF6zFaaWz33dfqqDC0A-fF+Q@mail.gmail.com>
Date:   Tue, 28 Jul 2020 07:48:44 -0700
From:   Doug Anderson <dianders@...omium.org>
To:     Stanimir Varbanov <stanimir.varbanov@...aro.org>
Cc:     Stephen Boyd <sboyd@...nel.org>, Andy Gross <agross@...nel.org>,
        Bjorn Andersson <bjorn.andersson@...aro.org>,
        Stephen Boyd <swboyd@...omium.org>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-arm-msm <linux-arm-msm@...r.kernel.org>,
        Maulik Shah <mkshah@...eaurora.org>,
        Lina Iyer <ilina@...eaurora.org>,
        Rajendra Nayak <rnayak@...eaurora.org>
Subject: Re: [PATCH v2] soc: qcom: rpmh-rsc: Sleep waiting for tcs slots to be free

Hi,

On Sun, Jul 26, 2020 at 2:44 AM Stanimir Varbanov
<stanimir.varbanov@...aro.org> wrote:
>
> Hi Stephen,
>
> On 7/25/20 12:17 AM, Stephen Boyd wrote:
> > From: Stephen Boyd <swboyd@...omium.org>
> >
> > The busy loop in rpmh_rsc_send_data() is written with the assumption
> > that the udelay will be preempted by the tcs_tx_done() irq handler when
> > the TCS slots are all full. This doesn't hold true when the calling
> > thread is an irqthread and the tcs_tx_done() irq is also an irqthread.
> > That's because kernel irqthreads are SCHED_FIFO and thus need to
> > voluntarily give up priority by calling into the scheduler so that other
> > threads can run.
> >
> > I see RCU stalls when I boot with irqthreads on the kernel commandline
> > because the modem remoteproc driver is trying to send an rpmh async
> > message from an irqthread that needs to give up the CPU for the rpmh
> > irqthread to run and clear out tcs slots.
> >
> >  rcu: INFO: rcu_preempt self-detected stall on CPU
> >  rcu:     0-....: (1 GPs behind) idle=402/1/0x4000000000000002 softirq=2108/2109 fqs=4920
> >   (t=21016 jiffies g=2933 q=590)
> >  Task dump for CPU 0:
> >  irq/11-smp2p    R  running task        0   148      2 0x00000028
> >  Call trace:
> >   dump_backtrace+0x0/0x154
> >   show_stack+0x20/0x2c
> >   sched_show_task+0xfc/0x108
> >   dump_cpu_task+0x44/0x50
> >   rcu_dump_cpu_stacks+0xa4/0xf8
> >   rcu_sched_clock_irq+0x7dc/0xaa8
> >   update_process_times+0x30/0x54
> >   tick_sched_handle+0x50/0x64
> >   tick_sched_timer+0x4c/0x8c
> >   __hrtimer_run_queues+0x21c/0x36c
> >   hrtimer_interrupt+0xf0/0x22c
> >   arch_timer_handler_phys+0x40/0x50
> >   handle_percpu_devid_irq+0x114/0x25c
> >   __handle_domain_irq+0x84/0xc4
> >   gic_handle_irq+0xd0/0x178
> >   el1_irq+0xbc/0x180
> >   save_return_addr+0x18/0x28
> >   return_address+0x54/0x88
> >   preempt_count_sub+0x40/0x88
> >   _raw_spin_unlock_irqrestore+0x4c/0x6c
> >   ___ratelimit+0xd0/0x128
> >   rpmh_rsc_send_data+0x24c/0x378
> >   __rpmh_write+0x1b0/0x208
> >   rpmh_write_async+0x90/0xbc
> >   rpmhpd_send_corner+0x60/0x8c
> >   rpmhpd_aggregate_corner+0x8c/0x124
> >   rpmhpd_set_performance_state+0x8c/0xbc
> >   _genpd_set_performance_state+0xdc/0x1b8
> >   dev_pm_genpd_set_performance_state+0xb8/0xf8
> >   q6v5_pds_disable+0x34/0x60 [qcom_q6v5_mss]
> >   qcom_msa_handover+0x38/0x44 [qcom_q6v5_mss]
> >   q6v5_handover_interrupt+0x24/0x3c [qcom_q6v5]
> >   handle_nested_irq+0xd0/0x138
> >   qcom_smp2p_intr+0x188/0x200
> >   irq_thread_fn+0x2c/0x70
> >   irq_thread+0xfc/0x14c
> >   kthread+0x11c/0x12c
> >   ret_from_fork+0x10/0x18
> >
> > This busy loop naturally lends itself to using a wait queue so that each
> > thread that tries to send a message will sleep waiting on the waitqueue
> > and only be woken up when a free slot is available. This should make
> > things more predictable too because the scheduler will be able to sleep
> > tasks that are waiting on a free tcs instead of the busy loop we
> > currently have today.
> >
> > Cc: Douglas Anderson <dianders@...omium.org>
> > Cc: Maulik Shah <mkshah@...eaurora.org>
> > Cc: Lina Iyer <ilina@...eaurora.org>
> > Signed-off-by: Stephen Boyd <swboyd@...omium.org>
> > ---
> >
> > Changes in v2:
> >  * Document tcs_wait
> >  * Move wake_up() outside of the spinlock
> >  * Document claim_tcs_for_req()
> >
> >  drivers/soc/qcom/rpmh-internal.h |   4 ++
> >  drivers/soc/qcom/rpmh-rsc.c      | 115 +++++++++++++++----------------
> >  2 files changed, 58 insertions(+), 61 deletions(-)
>
> This also fixes an issue related to TCS busy, seen with Venus driver
> with these [1] patches applied.
>
> Tested-by: Stanimir Varbanov <stanimir.varbanov@...aro.org>
>
> --
> regards,
> Stan
>
> [1] https://lkml.org/lkml/2020/7/23/394

It worries me that you say that this fixes any issues for you.
Specifically I don't see how this could fix anything except:

1. Fix the problem with irqthreads, which is how Stephen originally found this.

2. Fix things to be a little more efficient.

3. Avoid the small handful of messages that show up during normal
usage that look like:

  TCS Busy, retrying RPMH message send: addr=...

I'm guessing you're referring to #3.  Is that correct?

If so, you might want to double-check to confirm that you aren't
totally spamming the RPMh bus with your patch series.  I found that
when I was seeing a lot of "TCS Busy, retrying RPMH message send"
spammed to the console that it was a sign that the code was being
really inefficient.

Specifically the code to add interconnect bandwidth and OPP to the SPI
drivers would run at "runtime_suspend" and "runtime_resume".  For our
SPI drivers this meant that they were running after every single
transfer, and they were quite slow.  If your code is doing similar
then you probably have a problem.

The problem was fixed by adding an autosuspend delay.

References (from linuxnext):

8592eb959ad4 spi: spi-qcom-qspi: Set an autosuspend delay of 250 ms
cfdab2cd85ec spi: spi-geni-qcom: Set an autosuspend delay of 250 ms


Also if your code is adjusting its clock rate too frequently it can
now trigger similar problems.  References:

2124331411a1 spi: spi-qcom-qspi: Avoid clock setting if not needed
5f219524ec6e spi: spi-geni-qcom: Set the clock properly at runtime resume
68890e20942b spi: spi-geni-qcom: Avoid clock setting if not needed


-Doug