linux-kernel - Re: [PATCH 1/3] spi: spi-xilinx: Remove ISR race condition

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <93E0CE98-FF69-46EE-A45F-C5C75A0673C4@gmail.com>
Date:	Thu, 8 Oct 2015 11:10:56 -0400
From:	Jean-Francois Dagenais <jeff.dagenais@...il.com>
To:	Mark Brown <broonie@...nel.org>,
	Michal Simek <michal.simek@...inx.com>
Cc:	linux-kernel@...r.kernel.org, Michal Simek <monstr@...str.eu>,
	Peter Crosthwaite <pcrost@...inx.com>,
	Peter Crosthwaite <peter.crosthwaite@...alogix.com>,
	Grant Likely <grant.likely@...aro.org>,
	spi-devel-general@...ts.sourceforge.net,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH 1/3] spi: spi-xilinx: Remove ISR race condition

> On Jun 4, 2013, at 1:32 PM, Mark Brown <broonie@...nel.org> wrote:
> 
> Applied but this is a bit sad, having to defer the refill to process
> context means that we're adding extra latency which takes us further
> away from being able to saturate the bus.  There ought to be a way to
> avoid the issue though I can't think of a non-racy one - I guess level
> triggered interrupts aren't an option?

Hi all,

We've been using an improved version of this driver in production for years here
in the 3.4 tree. I had to intervene in order to improve performance. I managed
to double it (not bad).

Like the other threads I've seen about this, my strategy involved limiting the
reads to the registers, which, through a pci-e link were kind of long. Here's
the yet un-submitted commit header I carry in my clone:

=============
spi: xilinx - minimize iomem reads

    If this IP core is accessed through bridges like PCI-e, reads are rather
    costly. Doing many reads or read-modify-writes is thus long and strenuous
    on the CPU (active waiting).

    The transfer workflow of this driver allows some assumptions to be made and
    exploited to minimize reads as much as possible.

    These two assumptions are made:
    - since we are in control of the CR register, cache it so we don't have to
      read it all the time to modify it.
    - FIFO (either depth of 1 or 16) are always maxed out, unless the remaining
      bytes are less than the FIFO depth. For this reason, we can predict the
      state of the FIFOs without checking with the status register to check if
      they are empty or full.

    Reading 32Mb flash at 32MHz on a core with 8bits/word and FIFO enabled (16),
    has dropped from ~60s to ~30s. Still high, but better. Most of the delay still
    comes from emptying the RX FIFO one "word" at a time.
===============

Today I am merging from 3.4 to 3.14 and then 3.19 and am dealing with merge
conflicts.

I have to say we have never seen any race here and am trying to understand how
the race would have happened. The spi master is locked to concurrent users and
the only time the interrupt is enabled is from the txrx_bufs method. As I
understand it, the interrupt is not declared as IRQ_SHARED, so the driver's ISR
does run because of an actual TX empty event (the only interrupt flag we
enable).

This is where I suspect the interrupt controller's behaviour *might* come into
play. I have not worked with many types of IRQ so please feel free to educate me
here. We are using level based interrupt. When the ISR runs, the irq line is
masked so any new interrupt event is pending. When the ISR returns IRQ_HANDLED,
the IRQ is unmasked and the ISR is re-invoked as needed.

In this context, I fail to see how when disabling the TX inhibitor, i.e.
enabling TX, the sudden interrupt event could cause a race. The ISR invocation
would only be deferred until the current ISR complete.

I guess if you were to use edge interrupt without a latched register you could
miss the event, but why would anyone use such an interrupt controller? (if you
can call it that :-P)

So if we go back to the commit msg about the race:
> On Jun 4, 2013, at 10:02 AM, Michal Simek <michal.simek@...inx.com> wrote:
> 
> The ISR currently consumes the rx buffer data and re-enables transmission
> from within interrupt context. This is bad because if the interrupt
> occurs again before the ISR exits, the new interrupt will be erroneously
> cleared by the still completing ISR.

So regardless of the interrupt controller, I can't see how this event is even
possible. Given that the the ISR clears the interrupt immediately upon entry and
the only code left to execute by the ISR after the tx fifo is filled and the TX
inhibitor is disabled, is "return IRQ_SHARED"... ???

Thank you for your help!
/jfd

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/