linux-kernel - Re: [PATCH 2/2] spi: spi-geni-qcom: Really ensure the previous xfer is done before new one

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAD=FV=WL9ZWz-nGzP0CWWNVFuG8fnSsA8R906B10J-X6_jwMLg@mail.gmail.com>
Date:   Wed, 16 Dec 2020 14:42:08 -0800
From:   Doug Anderson <dianders@...omium.org>
To:     Stephen Boyd <swboyd@...omium.org>
Cc:     Mark Brown <broonie@...nel.org>, msavaliy@....qualcomm.com,
        Akash Asthana <akashast@...eaurora.org>,
        Roja Rani Yarubandi <rojay@...eaurora.org>,
        Alok Chauhan <alokc@...eaurora.org>,
        Andy Gross <agross@...nel.org>,
        Bjorn Andersson <bjorn.andersson@...aro.org>,
        linux-arm-msm <linux-arm-msm@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-spi <linux-spi@...r.kernel.org>
Subject: Re: [PATCH 2/2] spi: spi-geni-qcom: Really ensure the previous xfer
 is done before new one

Hi,

On Tue, Dec 15, 2020 at 5:18 PM Stephen Boyd <swboyd@...omium.org> wrote:
>
> Quoting Doug Anderson (2020-12-15 15:34:59)
> > On Tue, Dec 15, 2020 at 2:25 PM Stephen Boyd <swboyd@...omium.org> wrote:
> > >
> > > Quoting Doug Anderson (2020-12-15 09:25:51)
> > > > In general when we're starting a new transfer we assume that we can
> > > > program the hardware willy-nilly.  If there's some chance something
> > > > else is happening (or our interrupt could go off) then it breaks that
> > > > whole model.
> > >
> > > Right. I thought this patch was making sure that the hardware wasn't in
> > > the process of doing something else when we setup the transfer. I'm
> > > saying that only checking the irq misses the fact that maybe the
> > > transfer hasn't completed yet or a pending irq hasn't come in yet, but
> > > the fifo status would tell us that the fifo is transferring something or
> > > receiving something. If an RX can't happen, then the code should clearly
> > > show that an RX irq isn't expected, and mask out that bit so it is
> > > ignored or explicitly check for it and call WARN_ON() if the bit is set.
> > >
> > > I'm wondering why we don't check the FIFO status and the irq bits to
> > > make sure that some previous cancelled operation isn't still pending
> > > either in the FIFO or as an irq. While this patch will fix the scenario
> > > where the irq is delayed but pending in the hardware it won't cover the
> > > case that the hardware itself is wedged, for example because the
> > > sequencer just decided to stop working entirely.
> >
> > It also won't catch the case where the SoC decided that all GPIOs are
> > inverted and starts reporting highs for lows and lows for highs, nor
> > does it handle the case where the CPU suddenly switches to Big Endian
> > mode for no reason.  :-P
> >
> > ...by that, I mean I'm not trying to catch the case where the hardware
> > itself is behaving in a totally unexpected way.  I have seen no
> > instances where the hardware wedges nor where the sequencer stops
> > working and until I see them happen I'm not inclined to add code for
> > them.  Without seeing them actually happen I'm not really sure what
> > the right way to recover would be.  We've already tried "cancel" and
> > "abort" and then waited at least 1 second.  If you know of some sort
> > of magic "unwedge" then we should add it into handle_fifo_timeout().
>
> I am not aware of an "unwedge" command. Presumably the cancel/abort
> stuff makes the FIFO state "sane" so there's nothing to see in the FIFO
> status registers. I wonder if we should keep around some "did we cancel
> last time?" flag and only check the isr if we canceled out and timed
> out to boot? That would be a cheap and easy check to make sure that we
> don't check this each transaction.

Sure.  I guess technically it would be a "did we fail to cancel last time".


> > However, super delayed interrupts due to software not servicing the
> > interrupt in time is something that really happens, if rarely.  Adding
> > code to account for that seems worth it and is easy to test...
> >
>
> Agreed. The function name is wrong then as the device is not "busy". So
> maybe spi_geni_isr_pending()? That would clearly describe what's being
> checked.

I changed this to just be about the abort.  See if v2 looks better to you.