[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAN6iL-QqZXsFDB=3yCfqQeF0H5QaS_Trm62FxvDF-+qPoQ-VNA@mail.gmail.com>
Date: Tue, 18 Jun 2024 21:17:16 +0530
From: Pranjal Shrivastava <praan@...gle.com>
To: Rob Clark <robdclark@...il.com>
Cc: Robin Murphy <robin.murphy@....com>, iommu@...ts.linux.dev,
linux-arm-msm@...r.kernel.org, Stephen Boyd <swboyd@...omium.org>,
Rob Clark <robdclark@...omium.org>, Will Deacon <will@...nel.org>, Joerg Roedel <joro@...tes.org>,
Jason Gunthorpe <jgg@...pe.ca>, Jerry Snitselaar <jsnitsel@...hat.com>,
Krzysztof Kozlowski <krzysztof.kozlowski@...aro.org>,
Dmitry Baryshkov <dmitry.baryshkov@...aro.org>,
"moderated list:ARM SMMU DRIVERS" <linux-arm-kernel@...ts.infradead.org>,
open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] iommu/arm-smmu: Pretty-print context fault related regs
On Tue, Jun 18, 2024 at 8:28 PM Rob Clark <robdclark@...il.com> wrote:
>
> On Mon, Jun 17, 2024 at 10:33 AM Robin Murphy <robin.murphy@....com> wrote:
> >
> > On 2024-06-17 5:18 pm, Rob Clark wrote:
> > > On Mon, Jun 17, 2024 at 6:07 AM Robin Murphy <robin.murphy@....com> wrote:
> > >>
> > >> On 04/06/2024 4:01 pm, Rob Clark wrote:
> > >>> From: Rob Clark <robdclark@...omium.org>
> > >>>
> > >>> Parse out the bitfields for easier-to-read fault messages.
> > >>>
> > >>> Signed-off-by: Rob Clark <robdclark@...omium.org>
> > >>> ---
> > >>> Stephen was wanting easier to read fault messages.. so I typed this up.
> > >>>
> > >>> Resend with the new iommu list address
> > >>>
> > >>> drivers/iommu/arm/arm-smmu/arm-smmu.c | 53 +++++++++++++++++++++++++--
> > >>> drivers/iommu/arm/arm-smmu/arm-smmu.h | 5 +++
> > >>> 2 files changed, 54 insertions(+), 4 deletions(-)
> > >>>
> > >>> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > >>> index c572d877b0e1..06712d73519c 100644
> > >>> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > >>> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > >>> @@ -411,6 +411,8 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> > >>> unsigned long iova;
> > >>> struct arm_smmu_domain *smmu_domain = dev;
> > >>> struct arm_smmu_device *smmu = smmu_domain->smmu;
> > >>> + static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
> > >>> + DEFAULT_RATELIMIT_BURST);
> > >>> int idx = smmu_domain->cfg.cbndx;
> > >>> int ret;
> > >>>
> > >>> @@ -425,10 +427,53 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> > >>> ret = report_iommu_fault(&smmu_domain->domain, NULL, iova,
> > >>> fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
> > >>>
> > >>> - if (ret == -ENOSYS)
> > >>> - dev_err_ratelimited(smmu->dev,
> > >>> - "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
> > >>> - fsr, iova, fsynr, cbfrsynra, idx);
> > >>> + if (ret == -ENOSYS && __ratelimit(&rs)) {
> > >>> + static const struct {
> > >>> + u32 mask; const char *name;
> > >>> + } fsr_bits[] = {
> > >>> + { ARM_SMMU_FSR_MULTI, "MULTI" },
> > >>> + { ARM_SMMU_FSR_SS, "SS" },
> > >>> + { ARM_SMMU_FSR_UUT, "UUT" },
> > >>> + { ARM_SMMU_FSR_ASF, "ASF" },
> > >>> + { ARM_SMMU_FSR_TLBLKF, "TLBLKF" },
> > >>> + { ARM_SMMU_FSR_TLBMCF, "TLBMCF" },
> > >>> + { ARM_SMMU_FSR_EF, "EF" },
> > >>> + { ARM_SMMU_FSR_PF, "PF" },
> > >>> + { ARM_SMMU_FSR_AFF, "AFF" },
> > >>> + { ARM_SMMU_FSR_TF, "TF" },
> > >>> + }, fsynr0_bits[] = {
> > >>> + { ARM_SMMU_FSYNR0_WNR, "WNR" },
> > >>> + { ARM_SMMU_FSYNR0_PNU, "PNU" },
> > >>> + { ARM_SMMU_FSYNR0_IND, "IND" },
> > >>> + { ARM_SMMU_FSYNR0_NSATTR, "NSATTR" },
> > >>> + { ARM_SMMU_FSYNR0_PTWF, "PTWF" },
> > >>> + { ARM_SMMU_FSYNR0_AFR, "AFR" },
> > >>> + };
> > >>> +
> > >>> + pr_err("%s %s: Unhandled context fault: fsr=0x%x (",
> > >>> + dev_driver_string(smmu->dev), dev_name(smmu->dev), fsr);
> > >>> +
> > >>> + for (int i = 0, n = 0; i < ARRAY_SIZE(fsr_bits); i++) {
> > >>> + if (fsr & fsr_bits[i].mask) {
> > >>> + pr_cont("%s%s", (n > 0) ? "|" : "", fsr_bits[i].name);
> > >>
> > >> Given that SMMU faults have a high likelihood of correlating with other
> > >> errors, e.g. the initiating device also reporting that it got an abort
> > >> back, this much pr_cont is a recipe for an unreadable mess. Furthermore,
> > >> just imagine how "helpful" this would be when faults in two contexts are
> > >> reported by two different CPUs at the same time ;)
> > >
> > > It looks like arm_smmu_context_fault() is only used with non-threaded
> > > irq's. And this fallback is only used if driver doesn't register it's
> > > own fault handler. So I don't think this will be a problem.
> >
> > You don't think two different IRQs can fire on two different CPUs at the
> > same time (or close to)? It's already bad enough when multiple CPUs
> > panic and one has to pick apart line-by-line interleaving of the
> > registers/stacktraces - imagine how much more utterly unusable that
> > would be with bits of different dumps randomly pr_cont'ed together onto
> > the same line(s)...
>
> _different_ irq's, yes
>
> Anyways, the reason for pr_cont() was that there wasn't another
> reasonable way to decide where separator chars were needed with a
> single pr_err(). I could instead construct a string on stack and
> print that in a single call, but pr_cont() seemed like the more
> reasonable alternative.
>
> BR,
> -R
The string approach sounds good to me, if possible, let's break this
out into a helper function, something like `arm_smmu_log_ctx_fault`
and put it under a module parameter, I guess? Not sure if this
requires a new Kconfig option, would like Robin's opinion on this.
Thanks,
Pranjal
>
> > Even when unrelated stuff gets interleaved because other CPUs just
> > happen to be calling printk() at the same time for unrelated reasons
> > it's still annoying, and pr_cont makes a bigger mess than not.
> > >> I'd prefer to retain the original message as-is, so there is at least
> > >> still an unambiguous "atomic" view of a fault's entire state, then
> > >> follow it with a decode more in the style of arm64's ESR logging. TBH I
> > >> also wouldn't disapprove of hiding the additional decode behind a
> > >> command-line/runtime parameter, since a fault storm can cripple a system
> > >> enough as it is, without making the interrupt handler spend even longer
> > >> printing to a potentially slow console.
> > >
> > > It _is_ ratelimited. But we could perhaps use a higher loglevel (pr_debug?)
> >
> > Yeah, I'd have no complaint with pr_debug/dev_dbg either, if that suits
> > your use case. True that the ratelimit may typically mitigate the
> > overall impact, but still in the worst case with a sufficiently slow
> > console and/or a sufficiently large amount to print per __ratelimit()
> > call, it can end up being slow enough to stay below the threshold. Don't
> > ask me how I know that :)
> >
> > Thanks,
> > Robin.
Powered by blists - more mailing lists