linux-kernel - Re: can/should a disabled irq become pending?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d8b985a830b3cd0a01607740966c11a40c0a696d.camel@gmail.com>
Date: Mon, 25 Nov 2024 10:08:02 +0100
From: Nuno Sá <noname.nuno@...il.com>
To: Uwe Kleine-König <u.kleine-koenig@...libre.com>
Cc: Jonathan Cameron <jic23@...nel.org>, Thomas Gleixner
 <tglx@...utronix.de>, 	linux-kernel@...r.kernel.org
Subject: Re: can/should a disabled irq become pending?

On Mon, 2024-11-25 at 09:50 +0100, Uwe Kleine-König wrote:
> Hello,
> 
> On Sun, Nov 24, 2024 at 02:18:27PM +0100, Nuno Sá wrote:
> > On Sat, 2024-11-23 at 11:28 +0000, Jonathan Cameron wrote:
> > > On Thu, 14 Nov 2024 13:04:58 +0100
> > > Nuno Sá <noname.nuno@...il.com> wrote:
> > > 
> > > > On Thu, 2024-11-14 at 11:59 +0100, Uwe Kleine-König wrote:
> > > > > On Thu, Nov 14, 2024 at 08:49:34AM +0100, Nuno Sá wrote:  
> > > > > > On Wed, 2024-11-13 at 16:50 +0100, Thomas Gleixner wrote:  
> > > > > > > On Wed, Nov 13 2024 at 11:34, Nuno Sá wrote:  
> > > > > > > > On Wed, 2024-11-13 at 04:40 +0100, Thomas Gleixner wrote:  
> > > > > > > > > The interrupt does not get to the device handler even in the
> > > > > > > > > lazy
> > > > > > > > > disable case. Once the driver invoked disable_irq*() the low
> > > > > > > > > level
> > > > > > > > > flow
> > > > > > > > > handlers (edge, level ...) mask the interrupt line and marks
> > > > > > > > > the
> > > > > > > > > interrupt pending. enable_irq() retriggers the interrupt when
> > > > > > > > > the
> > > > > > > > > pending bit is set, except when the interrupt line is level
> > > > > > > > > triggered. 
> > > > > > > > 
> > > > > > > > There's something that I'm still trying to figure... For IRQ
> > > > > > > > controllers
> > > > > > > > that not
> > > > > > > > disable edge detection, can't we get the device handler called
> > > > > > > > twice if
> > > > > > > > we
> > > > > > > > don't set
> > > > > > > > unlazy?
> > > > > > > > 
> > > > > > > > irq_enable() - > check_irq_resend()
> > > > > > > > 
> > > > > > > > and then
> > > > > > > > 
> > > > > > > > handle_edge_irq() raised by the controller  
> > > > > > > 
> > > > > > > You're right. We should have a flag which controls the replay
> > > > > > > requirements of an interrupt controller. So far it only skips for
> > > > > > > level
> > > > > > > triggered interrupts, but for those controllers it should skip for
> > > > > > > edge
> > > > > > > too. Something like IRQCHIP_NO_RESEND ...  
> > > > > 
> > > > > Agreed, if the irq gets pending while disabled in both hardware and
> > > > > software, that shouldn't result in two invokations. Is this an issue
> > > > > for
> > > > > level irqs only? For edge irqs this only happens with lazy disable
> > > > > and  
> > > > 
> > > > Resending is already ignore for level...
> > > > 
> > > > > if two events happen. Hm, I guess in that case we still only want a
> > > > > single
> > > > > invokation of the irq handler?
> > > > >   
> > > > > > > > Or is the core handling this somehow? I thought IRQS_REPLAY
> > > > > > > > could be
> > > > > > > > doing the trick but I'm not seeing how...  
> > > > > > > 
> > > > > > > IRQS_REPLAY is just internal state to avoid double replay.
> > > > > > >   
> > > > > > > > > On controllers which suffer from the #2 problem UNLAZY should
> > > > > > > > > indeed
> > > > > > > > > be
> > > > > > > > > ignored for edge type interrupts. That's something which the
> > > > > > > > > controller
> > > > > > > > > should signal via a irqchip flag and the core code can act
> > > > > > > > > upon it and
> > > > > > > > > ignore UNLAZY for edge type interrupts.
> > > > > > > > > 
> > > > > > > > > But that won't fix the problem at hand. Let's take a step back
> > > > > > > > > and
> > > > > > > > > look
> > > > > > > > > at the larger picture whether this can be reliably "fixed" at
> > > > > > > > > all.
> > > > > > > > >   
> > > > > > > > 
> > > > > > > > Yeah, I'm still trying to figure when it's correct for a device
> > > > > > > > to do
> > > > > > > > UNLAZY? If I'm
> > > > > > > > understanding things, devices that rely on disable_irq*() should
> > > > > > > > set
> > > > > > > > it?  
> > > > > > > 
> > > > > > > Not necessarily. In most cases devices are not re-raising
> > > > > > > interrupts
> > > > > > > before the previous one has been handled and acknowledged in the
> > > > > > > device.  
> > > > > 
> > > > > Usage of UNLAZY should never affect correctness. It's "only" a
> > > > > performance optimisation which has a positive effect if it's expected
> > > > > that an irq event happens while it's masked.
> > > > >   
> > > > > > > > Because problem #2 is something that needs to be handled at the
> > > > > > > > controller and core level if I got you right.  
> > > > > > > 
> > > > > > > Yes. We need a irqchip flag for that.
> > > > > > >   
> > > > > > > > > > Ack. If there is no way to read back the line state and it's
> > > > > > > > > > unknown
> > > > > > > > > > if
> > > > > > > > > > the irq controller suffers from problem #2, the only way to
> > > > > > > > > > still
> > > > > > > > > > benefit from the irq is to not use IRQ_DISABLE_UNLAZY and
> > > > > > > > > > only act
> > > > > > > > > > on
> > > > > > > > > > each 2nd irq; or ignore irqs based on timing. That doesn't
> > > > > > > > > > sound
> > > > > > > > > > very
> > > > > > > > > > robust though, so maybe the driver has to fall back on
> > > > > > > > > > polling the
> > > > > > > > > > status register and not use irqs at all in that case.  
> > > > > > > > > 
> > > > > > > > > Actually ignoring the first interrupt after a SPI transfer and
> > > > > > > > > waiting
> > > > > > > > > for the next conversion to raise the interrupt again should be
> > > > > > > > > robust
> > > > > > > > > enough. The ADC has to be in continous conversion mode for
> > > > > > > > > that
> > > > > > > > > obviously.
> > > > > > > > >   
> > > > > > > > Might be the only sane option we have, Uwe? If we do this, we
> > > > > > > > could be
> > > > > > > > dropping valid samples but only with controllers that suffer
> > > > > > > > from
> > > > > > > > #2.  
> > > > > > > 
> > > > > > > No. You have the same problem with the controllers which do not
> > > > > > > disable
> > > > > > > the edge detection logic.
> > > > > > > 
> > > > > > > The interrupt controller raises the interrupt on unmask
> > > > > > > (enable_irq()).
> > > > > > > Depending on timing the device handler might be invoked _before_
> > > > > > > the
> > > > > > > sample is ready, no?  
> > > > > > 
> > > > > > For those controllers, I think it's almost always guaranteed that
> > > > > > the first
> > > > > > IRQ
> > > > > > after enable is not really a valid sample. We'll always have some
> > > > > > SPI
> > > > > > transfer
> > > > > > (that should latch an IRQ on the controller) before enable_irq().  
> > > > > 
> > > > > The first irq isn't a valid sample unless the driver is preempted
> > > > > between the spi transfer and the following enable_irq() such that the
> > > > > irq event triggered by the SPI transfer doesn't result in calling the
> > > > > irq handler before the sample is ready. I guess that's what you ruled 
> > > > 
> > > > I guess that race we could prevent by disabling IRQs...
> > > > 
> > > > > out by saying "almost always"? I'd recommend to not rely on that.
> > > > > Chips
> > > > > become faster (and so conversion time shorter) which widens the race
> > > > > window and if you become unsynchronized and ignore every wrong second
> > > > > irq all samples become bogous.  
> > > > 
> > > > Right now we set UNLAZY and that brings this difference in behavior
> > > > depending on
> > > > the IRQ controller we have. But if we remove that change and make sure
> > > > there can
> > > > be no race between enable_irq() and the last spi_transfer, it should be
> > > > safe to
> > > > assume the first time we get in the handler is not for a valid sample.
> > > > Not sure
> > > > synchronization could be an issue to the point where you ignore all
> > > > samples. If
> > > > you ignore one IRQ, then the next one needs to be a valid sample (as
> > > > there
> > > > should be no spi_transfer in between). But not sure if it can affect
> > > > performance...
> > > 
> > > I think it is overly optimistic to assume that an interrupt controller
> > > will
> > > definitely catch both edges.  IIRC some of them have an interesting
> > > acknowledge
> > > dance before they can see an other edge at all.
> 
> Plus there is also the (probably only theoretic) race condition in
> combination with a controller suffering from #2 that the irq_enable()
> only happens after the conversion was done. Then the irq would be
> missed.
> 
> > > So it's also possible we only see one interrupt even though there were
> > > loads
> > > from the spi transfer (which can also trigger multiple if slow enough)
> 
> Ack, so I think we all agree now that the rdy-gpio is needed for
> reliable operation with irq. If that isn't available due to already
> finalized hardware, the only option is polling.
> 
> > > > I think right now, unless the IRQ controller suffers from #2, every time
> > > > we get
> > > > in the device handler after enable_irq() is not because of DRDY and
> > > > having a
> > > > valid sample or not is pure luck. 
> > > > 
> > > > > 
> > > > > So I still think the extra GPIO read should be implemented (as I
> > > > > proposed in
> > > > > https://lore.kernel.org/linux-iio/20241028160748.489596-9-u.kleine-koenig@baylibre.com/
> > > > > )
> > > > > to guarantee reliable operation. If that isn't possible the only
> > > > > really
> > > > > robust way to operate is using polling.  
> > > > 
> > > > My only issue with the gpio approach and your conversation with Thomas
> > > > seems to
> > > > prove it is that we're not guaranteed to be able to read the line. I
> > > > guess your
> > > > reasoning is that if we can't do that for a platform, then don't give
> > > > the gpio
> > > > in DT? But in that case, are we left with a device that might or might
> > > > not work?
> > > Now we have some input from Thomas, I'm happier that we basically have no
> > > choice
> > > for at least some controllers :(
> > 
> > Agreed. I'm not opposing to the GPIO change (even though not perfect) since
> > it's
> > better that what we have today and from this whole discussion, it also looks
> > like
> > there's not perfect solution anyways.
> 
> What is your concern that lets you say "not perfect"? Is it just that it
> won't work on every hardware?

Pretty much :)...

- Nuno Sá