linux-kernel - Re: [PATCH] iomm/arm-smmu: Add stall implementation hook

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200507125357.GA31783@willie-the-truck>
Date:   Thu, 7 May 2020 13:53:58 +0100
From:   Will Deacon <will@...nel.org>
To:     Robin Murphy <robin.murphy@....com>
Cc:     Sai Prakash Ranjan <saiprakash.ranjan@...eaurora.org>,
        Joerg Roedel <joro@...tes.org>,
        Rob Clark <robdclark@...il.com>,
        Jordan Crouse <jcrouse@...eaurora.org>,
        iommu@...ts.linux-foundation.org,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
        linux-arm-msm@...r.kernel.org
Subject: Re: [PATCH] iomm/arm-smmu: Add stall implementation hook

On Thu, May 07, 2020 at 11:55:54AM +0100, Robin Murphy wrote:
> On 2020-05-07 11:14 am, Sai Prakash Ranjan wrote:
> > On 2020-04-22 01:50, Sai Prakash Ranjan wrote:
> > > Add stall implementation hook to enable stalling
> > > faults on QCOM platforms which supports it without
> > > causing any kind of hardware mishaps. Without this
> > > on QCOM platforms, GPU faults can cause unrelated
> > > GPU memory accesses to return zeroes. This has the
> > > unfortunate result of command-stream reads from CP
> > > getting invalid data, causing a cascade of fail.
> 
> I think this came up before, but something about this rationale doesn't add
> up - we're not *using* stalls at all, we're still terminating faulting
> transactions unconditionally; we're just using CFCFG to terminate them with
> a slight delay, rather than immediately. It's really not clear how or why
> that makes a difference. Is it a GPU bug? Or an SMMU bug? Is this reliable
> (or even a documented workaround for something), or might things start
> blowing up again if any other behaviour subtly changes? I'm not dead set
> against adding this, but I'd *really* like to have a lot more confidence in
> it.

Rob mentioned something about the "bus returning zeroes" before, but I agree
that we need more information so that we can reason about this and maintain
the code as the driver continues to change. That needs to be a comment in
the driver, and I don't think "but android seems to work" is a good enough
justification. There was some interaction with HUPCF as well.

As a template, I'd suggest:

> > > diff --git a/drivers/iommu/arm-smmu.h b/drivers/iommu/arm-smmu.h
> > > index 8d1cd54d82a6..d5134e0d5cce 100644
> > > --- a/drivers/iommu/arm-smmu.h
> > > +++ b/drivers/iommu/arm-smmu.h
> > > @@ -386,6 +386,7 @@ struct arm_smmu_impl {
> > >      int (*init_context)(struct arm_smmu_domain *smmu_domain);
> > >      void (*tlb_sync)(struct arm_smmu_device *smmu, int page, int sync,
> > >               int status);

/*
 * Stall transactions on a context fault, where they will be terminated
 * in response to the resulting IRQ rather than immediately. This should
 * pretty much always be set to "false" as stalling can introduce the
 * potential for deadlock in most SoCs, however it is needed on Qualcomm
 * XXXX because YYYY.
 */

> > > +    bool stall;

Hmm, the more I think about this, the more I think this is an erratum
workaround in disguise, in which case this could be better named...

Will