[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200507125357.GA31783@willie-the-truck>
Date: Thu, 7 May 2020 13:53:58 +0100
From: Will Deacon <will@...nel.org>
To: Robin Murphy <robin.murphy@....com>
Cc: Sai Prakash Ranjan <saiprakash.ranjan@...eaurora.org>,
Joerg Roedel <joro@...tes.org>,
Rob Clark <robdclark@...il.com>,
Jordan Crouse <jcrouse@...eaurora.org>,
iommu@...ts.linux-foundation.org,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
linux-arm-msm@...r.kernel.org
Subject: Re: [PATCH] iomm/arm-smmu: Add stall implementation hook
On Thu, May 07, 2020 at 11:55:54AM +0100, Robin Murphy wrote:
> On 2020-05-07 11:14 am, Sai Prakash Ranjan wrote:
> > On 2020-04-22 01:50, Sai Prakash Ranjan wrote:
> > > Add stall implementation hook to enable stalling
> > > faults on QCOM platforms which supports it without
> > > causing any kind of hardware mishaps. Without this
> > > on QCOM platforms, GPU faults can cause unrelated
> > > GPU memory accesses to return zeroes. This has the
> > > unfortunate result of command-stream reads from CP
> > > getting invalid data, causing a cascade of fail.
>
> I think this came up before, but something about this rationale doesn't add
> up - we're not *using* stalls at all, we're still terminating faulting
> transactions unconditionally; we're just using CFCFG to terminate them with
> a slight delay, rather than immediately. It's really not clear how or why
> that makes a difference. Is it a GPU bug? Or an SMMU bug? Is this reliable
> (or even a documented workaround for something), or might things start
> blowing up again if any other behaviour subtly changes? I'm not dead set
> against adding this, but I'd *really* like to have a lot more confidence in
> it.
Rob mentioned something about the "bus returning zeroes" before, but I agree
that we need more information so that we can reason about this and maintain
the code as the driver continues to change. That needs to be a comment in
the driver, and I don't think "but android seems to work" is a good enough
justification. There was some interaction with HUPCF as well.
As a template, I'd suggest:
> > > diff --git a/drivers/iommu/arm-smmu.h b/drivers/iommu/arm-smmu.h
> > > index 8d1cd54d82a6..d5134e0d5cce 100644
> > > --- a/drivers/iommu/arm-smmu.h
> > > +++ b/drivers/iommu/arm-smmu.h
> > > @@ -386,6 +386,7 @@ struct arm_smmu_impl {
> > > int (*init_context)(struct arm_smmu_domain *smmu_domain);
> > > void (*tlb_sync)(struct arm_smmu_device *smmu, int page, int sync,
> > > int status);
/*
* Stall transactions on a context fault, where they will be terminated
* in response to the resulting IRQ rather than immediately. This should
* pretty much always be set to "false" as stalling can introduce the
* potential for deadlock in most SoCs, however it is needed on Qualcomm
* XXXX because YYYY.
*/
> > > + bool stall;
Hmm, the more I think about this, the more I think this is an erratum
workaround in disguise, in which case this could be better named...
Will
Powered by blists - more mailing lists