[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <417c4e3a-6fa8-5210-dd27-669a5d9e255c@arm.com>
Date: Wed, 25 Jan 2023 10:27:50 +0000
From: Suzuki K Poulose <suzuki.poulose@....com>
To: Yabin Cui <yabinc@...gle.com>
Cc: Mike Leach <mike.leach@...aro.org>,
James Clark <james.clark@....com>, coresight@...ts.linaro.org,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
Mathieu Poirier <mathieu.poirier@...aro.org>,
Leo Yan <leo.yan@...aro.org>
Subject: Re: [PATCH] coresight: tmc-etr: Don't enable ETR when it's not ready
Folks,
On 24/01/2023 20:09, Yabin Cui wrote:
> Ping for review. And I still can't reproduce it, even if I reduced the
> timeout to 2us and tried different workloads. Any suggestions for how
> to reproduce it?
>
I think we should go ahead and fix this in the driver to handle flaky
hardware cases. But I would like this to be addressed for all the TMC
types, not just ETRs, as Mike pointed out.
Thanks
Suzuki
> Thanks,
> Yabin
>
> On Tue, Jan 10, 2023 at 1:06 PM Yabin Cui <yabinc@...gle.com> wrote:
>>
>>> Do you have a reproducer for this or some more info?
>>> For example is it a regression or has it always been there? And on which
>>> platform.
>>
>> It happens on Pixel 6 and 7. We collect ETM data periodically from some
>> internal dogfood devices. The problem has happened several times on
>> dogfood devices. But I am still trying to reproduce it locally.
>>
>> We use the scatter-gather mode of ETR, and allocate a 4M buffer. In userspace,
>> we use simpleperf in Android to collect system wide ETM data. What is special
>> is, simpleperf disables and reenables perf events every 100ms to flush ETM
>> data to perf aux buffer.
>>
>> Pixel 6 and 7 have hardware monitoring AXI traffic. The hardware finds ETR is
>> trying to read from or write to a low invalid address (like 0x2E0000). The
>> problem always happens right after the "tmc_etr: timeout while waiting for TMC
>> to be Ready" message. And in almost all cases, I can find a "timeout while
>> waiting for completion of Manual Flush" message from the previous session.
>>
>> One log history is below:
>> [11484.610008][ C0] coresight tmc_etr0: timeout while waiting for
>> completion of Manual Flush
>> [11484.610177][ C0] coresight tmc_etr0: timeout while waiting for
>> TMC to be Ready
>> [11484.615367][ C0] coresight tmc_etr0: timeout while waiting for
>> completion of Manual Flush
>> [11484.615534][ C0] coresight tmc_etr0: timeout while waiting for
>> TMC to be Ready
>> [12089.486044][ C0] coresight tmc_etr0: timeout while waiting for
>> TMC to be Ready
>> AXI error report reading from invalid address
>>
>> Another log history is below:
>> [76709.382650][ C5] coresight tmc_etf1: timeout while waiting for
>> TMC to be Ready
>> [76709.382852][ C7] coresight tmc_etr0: timeout while waiting for
>> completion of Manual Flush
>> [76709.382995][ C7] coresight tmc_etr0: timeout while waiting for
>> TMC to be Ready
>> [76709.384510][ C7] coresight tmc_etr0: timeout while waiting for
>> completion of Manual Flush
>> [76709.384649][ C7] coresight tmc_etr0: timeout while waiting for
>> TMC to be Ready
>> [76709.384838][ C0] coresight tmc_etr0: timeout while waiting for
>> TMC to be Ready
>> AIX error report writing to invalid address
>>
>> It seems if the previous manual flush doesn't finish gracefully, ETR may not be
>> ready for the next enable (even after 10min as in the first log). And if we
>> continue to enable ETR, an invalid AXI IO may happen.
>>
>> Thanks,
>> Yabin
>>
>> On Tue, Jan 10, 2023 at 10:04 AM Suzuki K Poulose
>> <suzuki.poulose@....com> wrote:
>>>
>>> On 10/01/2023 17:48, Mike Leach wrote:
>>>> Hi,
>>>>
>>>> On Tue, 10 Jan 2023 at 09:30, James Clark <james.clark@....com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 09/01/2023 23:43, Yabin Cui wrote:
>>>>>> Otherwise, it may cause error in AXI bus and result in a kernel panic.
>>>>>
>>>>> Hi Yabin,
>>>>>
>>>>> Thanks for the fix. Do you have a reproducer for this or some more info?
>>>>> For example is it a regression or has it always been there? And on which
>>>>> platform.
>>>>>
>>>>> Thanks
>>>>> James
>>>>>
>>>>>>
>>>>>> Signed-off-by: Yabin Cui <yabinc@...gle.com>
>>>>>> ---
>>>>>> .../hwtracing/coresight/coresight-tmc-core.c | 4 +++-
>>>>>> .../hwtracing/coresight/coresight-tmc-etr.c | 18 +++++++++++++++---
>>>>>> drivers/hwtracing/coresight/coresight-tmc.h | 2 +-
>>>>>> 3 files changed, 19 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/hwtracing/coresight/coresight-tmc-core.c b/drivers/hwtracing/coresight/coresight-tmc-core.c
>>>>>> index 07abf28ad725..c106d142e632 100644
>>>>>> --- a/drivers/hwtracing/coresight/coresight-tmc-core.c
>>>>>> +++ b/drivers/hwtracing/coresight/coresight-tmc-core.c
>>>>>> @@ -31,7 +31,7 @@ DEFINE_CORESIGHT_DEVLIST(etb_devs, "tmc_etb");
>>>>>> DEFINE_CORESIGHT_DEVLIST(etf_devs, "tmc_etf");
>>>>>> DEFINE_CORESIGHT_DEVLIST(etr_devs, "tmc_etr");
>>>>>>
>>>>>> -void tmc_wait_for_tmcready(struct tmc_drvdata *drvdata)
>>>>>> +int tmc_wait_for_tmcready(struct tmc_drvdata *drvdata)
>>>>>> {
>>>>>> struct coresight_device *csdev = drvdata->csdev;
>>>>>> struct csdev_access *csa = &csdev->access;
>>>>>> @@ -40,7 +40,9 @@ void tmc_wait_for_tmcready(struct tmc_drvdata *drvdata)
>>>>>> if (coresight_timeout(csa, TMC_STS, TMC_STS_TMCREADY_BIT, 1)) {
>>>>>> dev_err(&csdev->dev,
>>>>>> "timeout while waiting for TMC to be Ready\n");
>>>>>> + return -EBUSY;
>>>>>> }
>>>>>> + return 0;
>>>>>> }
>>>>>>
>>>>>> void tmc_flush_and_stop(struct tmc_drvdata *drvdata)
>>>>>> diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c
>>>>>> index 867ad8bb9b0c..2da99dd41ed6 100644
>>>>>> --- a/drivers/hwtracing/coresight/coresight-tmc-etr.c
>>>>>> +++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c
>>>>>> @@ -983,15 +983,21 @@ static void tmc_sync_etr_buf(struct tmc_drvdata *drvdata)
>>>>>> etr_buf->ops->sync(etr_buf, rrp, rwp);
>>>>>> }
>>>>>>
>>>>>> -static void __tmc_etr_enable_hw(struct tmc_drvdata *drvdata)
>>>>>> +static int __tmc_etr_enable_hw(struct tmc_drvdata *drvdata)
>>>>>> {
>>>>>> u32 axictl, sts;
>>>>>> struct etr_buf *etr_buf = drvdata->etr_buf;
>>>>>> + int rc = 0;
>>>>>>
>>>>>> CS_UNLOCK(drvdata->base);
>>>>>>
>>>>>> /* Wait for TMCSReady bit to be set */
>>>>>> - tmc_wait_for_tmcready(drvdata);
>>>>>> + rc = tmc_wait_for_tmcready(drvdata);
>>>>>> + if (rc) {
>>>>>> + dev_err(&drvdata->csdev->dev, "not ready ETR isn't enabled\n");
>>>>>> + CS_LOCK(drvdata->base);
>>>>>> + return rc;
>>>>>> + }
>>>>>>
>>>>>> writel_relaxed(etr_buf->size / 4, drvdata->base + TMC_RSZ);
>>>>>> writel_relaxed(TMC_MODE_CIRCULAR_BUFFER, drvdata->base + TMC_MODE);
>>>>>> @@ -1032,6 +1038,7 @@ static void __tmc_etr_enable_hw(struct tmc_drvdata *drvdata)
>>>>>> tmc_enable_hw(drvdata);
>>>>>>
>>>>>> CS_LOCK(drvdata->base);
>>>>>> + return rc;
>>>>>> }
>>>>>>
>>>>>> static int tmc_etr_enable_hw(struct tmc_drvdata *drvdata,
>>>>>> @@ -1060,7 +1067,12 @@ static int tmc_etr_enable_hw(struct tmc_drvdata *drvdata,
>>>>>> rc = coresight_claim_device(drvdata->csdev);
>>>>>> if (!rc) {
>>>>>> drvdata->etr_buf = etr_buf;
>>>>>> - __tmc_etr_enable_hw(drvdata);
>>>>>> + rc = __tmc_etr_enable_hw(drvdata);
>>>>>> + if (rc) {
>>>>>> + drvdata->etr_buf = NULL;
>>>>>> + coresight_disclaim_device(drvdata->csdev);
>>>>>> + tmc_etr_disable_catu(drvdata);
>>>>>> + }
>>>>>> }
>>>>>>
>>>>>> return rc;
>>>>>> diff --git a/drivers/hwtracing/coresight/coresight-tmc.h b/drivers/hwtracing/coresight/coresight-tmc.h
>>>>>> index 66959557cf39..01c0382a29c0 100644
>>>>>> --- a/drivers/hwtracing/coresight/coresight-tmc.h
>>>>>> +++ b/drivers/hwtracing/coresight/coresight-tmc.h
>>>>>> @@ -255,7 +255,7 @@ struct tmc_sg_table {
>>>>>> };
>>>>>>
>>>>>> /* Generic functions */
>>>>>> -void tmc_wait_for_tmcready(struct tmc_drvdata *drvdata);
>>>>>> +int tmc_wait_for_tmcready(struct tmc_drvdata *drvdata);
>>>>>> void tmc_flush_and_stop(struct tmc_drvdata *drvdata);
>>>>>> void tmc_enable_hw(struct tmc_drvdata *drvdata);
>>>>>> void tmc_disable_hw(struct tmc_drvdata *drvdata);
>>>>
>>>> There is no point in waiting for a timeout, and then carrying on even
>>>> when it is exceeded. As such this patch seems reasonable.
>>>> We should also apply the same principle to the ETF and ETB devices
>>>> which use the same tmc_wait_for_tmcready() function.
>>>
>>> +1
>>>
>>> I am fine with pushing this change, as it is doing the right thing.
>>>
>>>>
>>>> However - the concern is that this appears to be happening on starting
>>>> the ETR - there should be no outstanding AXI operations that cause the
>>>> system to not be ready - as we will either be using this the first
>>>> time after reset, or we should have successfully stopped and flushed
>>>> the ETR from the previous operation. This warrants further
>>>> investigation - should we be extending the timeout - which is already
>>>> at a rather generous 100uS, and do we also need to check the MemErr
>>>> bit in the status register?
>>>
>>> It would be good to dump the value of TMC_STATUS to see what is going
>>> on.
>>>
>>>>
>>>> As James says, we need details of when and how the problem occurs -
>>>> as far as I know it has not been seen on any systems we currently use
>>>> (though could have been missed given the current code)
>>>
>>> +1
>>>
>>> Kind regards
>>> Suzuki
>>>
>>>
>>>>
>>>> Regards
>>>>
>>>> Mike
>>>>
>>>>
>>>> --
>>>> Mike Leach
>>>> Principal Engineer, ARM Ltd.
>>>> Manchester Design Centre. UK
>>>
Powered by blists - more mailing lists