lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <159244122443.62212.12188134609709828571@swboyd.mtv.corp.google.com>
Date:   Wed, 17 Jun 2020 17:47:04 -0700
From:   Stephen Boyd <swboyd@...omium.org>
To:     Doug Anderson <dianders@...omium.org>
Cc:     Mark Brown <broonie@...nel.org>,
        Alok Chauhan <alokc@...eaurora.org>, skakit@...eaurora.org,
        Andy Gross <agross@...nel.org>,
        Bjorn Andersson <bjorn.andersson@...aro.org>,
        Dilip Kota <dkota@...eaurora.org>,
        linux-arm-msm <linux-arm-msm@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-spi <linux-spi@...r.kernel.org>
Subject: Re: [PATCH v3 2/5] spi: spi-geni-qcom: Mo' betta locking

Quoting Doug Anderson (2020-06-17 14:19:29)
> On Wed, Jun 17, 2020 at 1:53 PM Stephen Boyd <swboyd@...omium.org> wrote:
> >
> > Quoting Douglas Anderson (2020-06-16 03:40:47)
> > > If you added a bit of a delay (like a trace_printk) into the ISR for
> > > the spi-geni-qcom driver, you would suddenly start seeing some errors
> > > spit out.  The problem was that, though the ISR itself held a lock,
> > > other parts of the driver didn't always grab the lock.
> > >
> > > One example race was this:
> > > a) Driver queues off a command to set a Chip Select (CS).
> > > b) ISR fires indicating the CS is done.
> > > c) Done bit is set, so we complete().
> > > d) Second CPU gallops off and starts a transfer.
> > > e) Second CPU starts messing with hardware / software state (not under
> > >    spinlock).
> > > f) ISR now does things like set "mas->cur_mcmd" to CMD_NONE, prints
> > >    errors if "tx_rem_bytes" / "rx_rem_bytes" have been set, and also
> > >    Acks all interrupts it handled.
> >
> > Can we get a CPU0/CPU1 diagram here? At point e) I got sort of lost. And
> > maybe it's not even a dual CPU problem? i.e. it can happen on one CPU?
> >
> >     CPU0
> >     ----
> >  a) spi_geni_set_cs()
> >      mas->cur_mcmd = CMD_CS
> >      wait_for_completion_timeout(&xfer_done)
> >  b)  <INTERRUPT>
> >      geni_spi_isr()
> >  c)   complete(&xfer_done);
> >      <END INTERRUPT>
> >      pm_runtime_put(mas->dev);
> >  d) galloping?
> >
> > I got lost... Sorry!
> 
> I think you need two CPUs, at least for the race I'm thinking of.
> Maybe this is clearer?

With threaded irqs I think you only need one CPU, but that's just a
minor detail. Drawing it with two CPUs is clearer and easier to
understand.

> 
> CPU1:
> => spi_geni_set_cs() starts
> => spi_geni_set_cs() calls wait_for_completion_timeout(&xfer_done)
> CPU0:
> => geni_spi_isr() starts
> => geni_spi_isr() calls complete(&xfer_done)
> => geni_spi_isr() stalls
> CPU1:
> => spi_geni_set_cs() call to wait_for_completion_timeout() finishes
> => spi_geni_set_cs() exits.
> => spi_geni_transfer_one() is called
> => spi_geni_transfer_one() calls setup_fifo_xfer()
> => setup_fifo_xfer() sets "cur_mcmd"
> => setup_fifo_xfer() sets "tx_rem_bytes"
> => setup_fifo_xfer() sets "rx_rem_bytes"
> => setup_fifo_xfer() kicks off a transfer
> CPU0:
> => geni_spi_isr() finishes stalling
> => geni_spi_isr() sets "cur_mcmd" to NULL
> => geni_spi_isr() checks "tx_rem_bytes" to confirm it's 0.
> => geni_spi_isr() checks "rx_rem_bytes" to confirm it's 0.
> => geni_spi_isr() clears any "DONE" interrupt that is pending
> 
> I can update the commit message to have that if it's helpful and makes
> sense.  In the above example I have a fake "stall" that wouldn't
> really happen, but in general if adding a delay somewhere creates a
> race condition then the race condition was there anyway.  Also, with
> weakly ordered memory it's possible that a write on one CPU could
> clobber a write made by another CPU even if they happened in opposite
> orders.
> 
> The race is fixed by my patch because when CPU1 starts
> setup_fifo_xfer() it won't be able to grab the spinlock until the ISR
> is totally done.

Ok. This would be the diagram then if it looked like this:

  CPU0                                         CPU1
  ----                                         ----
  spi_geni_set_cs()
   mas->cur_mcmd = CMD_CS;
   geni_se_setup_m_cmd(...)
   wait_for_completion_timeout(&xfer_done);
                                              <INTERRUPT>
                                               geni_spi_isr()
                                                complete(&xfer_done);
   <wakeup>
   pm_runtime_put(mas->dev);
  ... // back to SPI core
  spi_geni_transfer_one()
   setup_fifo_xfer()  
    mas->cur_mcmd = CMD_XFER;
                                                mas->cur_cmd = CMD_NONE; // bad!
                                                return IRQ_HANDLED;

     
Time flows down as it usually does in these diagrams. No need to put
stalls in. Reading all those lines and holding which CPU they're running
on makes it harder for me to see the two running in parallel like is
shown above.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ