linux-kernel - Re: [PATCH] coresight: tmc-etr: Don't enable ETR when it's not ready

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALJ9ZPNa41rO3iG1WhWAF5fBTUGF1ULUG62_zMTA0EwzGnw=8w@mail.gmail.com>
Date:   Tue, 24 Jan 2023 12:09:29 -0800
From:   Yabin Cui <yabinc@...gle.com>
To:     Suzuki K Poulose <suzuki.poulose@....com>
Cc:     Mike Leach <mike.leach@...aro.org>,
        James Clark <james.clark@....com>, coresight@...ts.linaro.org,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
        Mathieu Poirier <mathieu.poirier@...aro.org>,
        Leo Yan <leo.yan@...aro.org>
Subject: Re: [PATCH] coresight: tmc-etr: Don't enable ETR when it's not ready

Ping for review. And I still can't reproduce it, even if I reduced the
timeout to 2us and tried different workloads. Any suggestions for how
to reproduce it?

Thanks,
Yabin

On Tue, Jan 10, 2023 at 1:06 PM Yabin Cui <yabinc@...gle.com> wrote:
>
> > Do you have a reproducer for this or some more info?
> > For example is it a regression or has it always been there? And on which
> > platform.
>
> It happens on Pixel 6 and 7. We collect ETM data periodically from some
> internal dogfood devices. The problem has happened several times on
> dogfood devices. But I am still trying to reproduce it locally.
>
> We use the scatter-gather mode of ETR, and allocate a 4M buffer. In userspace,
> we use simpleperf in Android to collect system wide ETM data. What is special
> is, simpleperf disables and reenables perf events every 100ms to flush ETM
> data to perf aux buffer.
>
> Pixel 6 and 7 have hardware monitoring AXI traffic. The hardware finds ETR is
> trying to read from or write to a low invalid address (like 0x2E0000). The
> problem always happens right after the "tmc_etr: timeout while waiting for TMC
> to be Ready" message. And in almost all cases, I can find a "timeout while
> waiting for completion of Manual Flush" message from the previous session.
>
> One log history is below:
> [11484.610008][    C0] coresight tmc_etr0: timeout while waiting for
> completion of Manual Flush
> [11484.610177][    C0] coresight tmc_etr0: timeout while waiting for
> TMC to be Ready
> [11484.615367][    C0] coresight tmc_etr0: timeout while waiting for
> completion of Manual Flush
> [11484.615534][    C0] coresight tmc_etr0: timeout while waiting for
> TMC to be Ready
> [12089.486044][    C0] coresight tmc_etr0: timeout while waiting for
> TMC to be Ready
> AXI error report reading from invalid address
>
> Another log history is below:
> [76709.382650][    C5] coresight tmc_etf1: timeout while waiting for
> TMC to be Ready
> [76709.382852][    C7] coresight tmc_etr0: timeout while waiting for
> completion of Manual Flush
> [76709.382995][    C7] coresight tmc_etr0: timeout while waiting for
> TMC to be Ready
> [76709.384510][    C7] coresight tmc_etr0: timeout while waiting for
> completion of Manual Flush
> [76709.384649][    C7] coresight tmc_etr0: timeout while waiting for
> TMC to be Ready
> [76709.384838][    C0] coresight tmc_etr0: timeout while waiting for
> TMC to be Ready
> AIX error report writing to invalid address
>
> It seems if the previous manual flush doesn't finish gracefully, ETR may not be
> ready for the next enable (even after 10min as in the first log). And if we
> continue to enable ETR, an invalid AXI IO may happen.
>
> Thanks,
> Yabin
>
> On Tue, Jan 10, 2023 at 10:04 AM Suzuki K Poulose
> <suzuki.poulose@....com> wrote:
> >
> > On 10/01/2023 17:48, Mike Leach wrote:
> > > Hi,
> > >
> > > On Tue, 10 Jan 2023 at 09:30, James Clark <james.clark@....com> wrote:
> > >>
> > >>
> > >>
> > >> On 09/01/2023 23:43, Yabin Cui wrote:
> > >>> Otherwise, it may cause error in AXI bus and result in a kernel panic.
> > >>
> > >> Hi Yabin,
> > >>
> > >> Thanks for the fix. Do you have a reproducer for this or some more info?
> > >> For example is it a regression or has it always been there? And on which
> > >> platform.
> > >>
> > >> Thanks
> > >> James
> > >>
> > >>>
> > >>> Signed-off-by: Yabin Cui <yabinc@...gle.com>
> > >>> ---
> > >>>   .../hwtracing/coresight/coresight-tmc-core.c   |  4 +++-
> > >>>   .../hwtracing/coresight/coresight-tmc-etr.c    | 18 +++++++++++++++---
> > >>>   drivers/hwtracing/coresight/coresight-tmc.h    |  2 +-
> > >>>   3 files changed, 19 insertions(+), 5 deletions(-)
> > >>>
> > >>> diff --git a/drivers/hwtracing/coresight/coresight-tmc-core.c b/drivers/hwtracing/coresight/coresight-tmc-core.c
> > >>> index 07abf28ad725..c106d142e632 100644
> > >>> --- a/drivers/hwtracing/coresight/coresight-tmc-core.c
> > >>> +++ b/drivers/hwtracing/coresight/coresight-tmc-core.c
> > >>> @@ -31,7 +31,7 @@ DEFINE_CORESIGHT_DEVLIST(etb_devs, "tmc_etb");
> > >>>   DEFINE_CORESIGHT_DEVLIST(etf_devs, "tmc_etf");
> > >>>   DEFINE_CORESIGHT_DEVLIST(etr_devs, "tmc_etr");
> > >>>
> > >>> -void tmc_wait_for_tmcready(struct tmc_drvdata *drvdata)
> > >>> +int tmc_wait_for_tmcready(struct tmc_drvdata *drvdata)
> > >>>   {
> > >>>        struct coresight_device *csdev = drvdata->csdev;
> > >>>        struct csdev_access *csa = &csdev->access;
> > >>> @@ -40,7 +40,9 @@ void tmc_wait_for_tmcready(struct tmc_drvdata *drvdata)
> > >>>        if (coresight_timeout(csa, TMC_STS, TMC_STS_TMCREADY_BIT, 1)) {
> > >>>                dev_err(&csdev->dev,
> > >>>                        "timeout while waiting for TMC to be Ready\n");
> > >>> +             return -EBUSY;
> > >>>        }
> > >>> +     return 0;
> > >>>   }
> > >>>
> > >>>   void tmc_flush_and_stop(struct tmc_drvdata *drvdata)
> > >>> diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c
> > >>> index 867ad8bb9b0c..2da99dd41ed6 100644
> > >>> --- a/drivers/hwtracing/coresight/coresight-tmc-etr.c
> > >>> +++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c
> > >>> @@ -983,15 +983,21 @@ static void tmc_sync_etr_buf(struct tmc_drvdata *drvdata)
> > >>>        etr_buf->ops->sync(etr_buf, rrp, rwp);
> > >>>   }
> > >>>
> > >>> -static void __tmc_etr_enable_hw(struct tmc_drvdata *drvdata)
> > >>> +static int __tmc_etr_enable_hw(struct tmc_drvdata *drvdata)
> > >>>   {
> > >>>        u32 axictl, sts;
> > >>>        struct etr_buf *etr_buf = drvdata->etr_buf;
> > >>> +     int rc = 0;
> > >>>
> > >>>        CS_UNLOCK(drvdata->base);
> > >>>
> > >>>        /* Wait for TMCSReady bit to be set */
> > >>> -     tmc_wait_for_tmcready(drvdata);
> > >>> +     rc = tmc_wait_for_tmcready(drvdata);
> > >>> +     if (rc) {
> > >>> +             dev_err(&drvdata->csdev->dev, "not ready ETR isn't enabled\n");
> > >>> +             CS_LOCK(drvdata->base);
> > >>> +             return rc;
> > >>> +     }
> > >>>
> > >>>        writel_relaxed(etr_buf->size / 4, drvdata->base + TMC_RSZ);
> > >>>        writel_relaxed(TMC_MODE_CIRCULAR_BUFFER, drvdata->base + TMC_MODE);
> > >>> @@ -1032,6 +1038,7 @@ static void __tmc_etr_enable_hw(struct tmc_drvdata *drvdata)
> > >>>        tmc_enable_hw(drvdata);
> > >>>
> > >>>        CS_LOCK(drvdata->base);
> > >>> +     return rc;
> > >>>   }
> > >>>
> > >>>   static int tmc_etr_enable_hw(struct tmc_drvdata *drvdata,
> > >>> @@ -1060,7 +1067,12 @@ static int tmc_etr_enable_hw(struct tmc_drvdata *drvdata,
> > >>>        rc = coresight_claim_device(drvdata->csdev);
> > >>>        if (!rc) {
> > >>>                drvdata->etr_buf = etr_buf;
> > >>> -             __tmc_etr_enable_hw(drvdata);
> > >>> +             rc = __tmc_etr_enable_hw(drvdata);
> > >>> +             if (rc) {
> > >>> +                     drvdata->etr_buf = NULL;
> > >>> +                     coresight_disclaim_device(drvdata->csdev);
> > >>> +                     tmc_etr_disable_catu(drvdata);
> > >>> +             }
> > >>>        }
> > >>>
> > >>>        return rc;
> > >>> diff --git a/drivers/hwtracing/coresight/coresight-tmc.h b/drivers/hwtracing/coresight/coresight-tmc.h
> > >>> index 66959557cf39..01c0382a29c0 100644
> > >>> --- a/drivers/hwtracing/coresight/coresight-tmc.h
> > >>> +++ b/drivers/hwtracing/coresight/coresight-tmc.h
> > >>> @@ -255,7 +255,7 @@ struct tmc_sg_table {
> > >>>   };
> > >>>
> > >>>   /* Generic functions */
> > >>> -void tmc_wait_for_tmcready(struct tmc_drvdata *drvdata);
> > >>> +int tmc_wait_for_tmcready(struct tmc_drvdata *drvdata);
> > >>>   void tmc_flush_and_stop(struct tmc_drvdata *drvdata);
> > >>>   void tmc_enable_hw(struct tmc_drvdata *drvdata);
> > >>>   void tmc_disable_hw(struct tmc_drvdata *drvdata);
> > >
> > > There is no point in waiting for a timeout, and then carrying on even
> > > when it is exceeded. As such this patch seems reasonable.
> > > We should also apply the same principle to the ETF and ETB devices
> > > which use the same tmc_wait_for_tmcready() function.
> >
> > +1
> >
> > I am fine with pushing this change, as it is doing the right thing.
> >
> > >
> > > However - the concern is that this appears to be happening on starting
> > > the ETR - there should be no outstanding AXI operations that cause the
> > > system to not be ready - as we will either be using this the first
> > > time after reset, or we should have successfully stopped and flushed
> > > the ETR from the previous operation. This warrants further
> > > investigation - should we be extending the timeout - which is already
> > > at a rather generous 100uS, and do we also need to check the MemErr
> > > bit in the status register?
> >
> > It would be good to dump the value of TMC_STATUS to see what is going
> > on.
> >
> > >
> > > As James says, we need details of when and how the problem occurs  -
> > > as far as I know it has not been seen on any systems we currently use
> > > (though could have been missed given the current code)
> >
> > +1
> >
> > Kind regards
> > Suzuki
> >
> >
> > >
> > > Regards
> > >
> > > Mike
> > >
> > >
> > > --
> > > Mike Leach
> > > Principal Engineer, ARM Ltd.
> > > Manchester Design Centre. UK
> >