linux-kernel - Re: [PATCH] ad_sigma_delta: fix race between IRQ and completion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20221231142832.49cb47c7@jic23-huawei>
Date:   Sat, 31 Dec 2022 14:28:32 +0000
From:   Jonathan Cameron <jic23@...nel.org>
To:     Daniel Beer <dlbeer@...il.com>
Cc:     linux-iio@...r.kernel.org, Lars-Peter Clausen <lars@...afoo.de>,
        Michael Hennerich <Michael.Hennerich@...log.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] ad_sigma_delta: fix race between IRQ and completion

On Sat, 24 Dec 2022 15:31:58 +1300
Daniel Beer <dlbeer@...il.com> wrote:

> On Fri, Dec 23, 2022 at 04:16:59PM +0000, Jonathan Cameron wrote:
> > > ad_sigma_delta waits for a conversion which terminates with the firing
> > > of a one-shot IRQ handler. In this handler, the interrupt is disabled
> > > and a completion is set.
> > > 
> > > Meanwhile, the thread that initiated the conversion is waiting on the
> > > completion to know when the conversion happened. If this wait times out,
> > > the conversion is aborted and IRQs are disabled. But the IRQ may fire
> > > anyway between the time the completion wait times out and the disabling
> > > of interrupts. If this occurs, we get a double-disabled interrupt.  
> > 
> > Ouch and good work tracking it down.  just to check, did you see this
> > bug happen in the wild or spotted by code inspection?  
> 
> Hi Jonathan,
> 
> Thanks for reviewing. It was by inspection -- I'd originally thought
> about it and fixed in in a similar way in this patch:
> 
>     https://lore.kernel.org/all/61dd3e0c.1c69fb81.cea15.8d98@mx.google.com/
> 
> But since that's not applied, I thought I'd better put together a
> separate fix for the time being.
> 
> > Given that timeout generally indicates hardware failure, I'm not sure
> > how critical this is to fix.  
> 
> Probably not very critical. I think you'd have to be pretty unlucky to
> encounter it.
> 
> > Is this fix sufficient?  If the interrupt is being handled on a different
> > CPU to the caller of this function, I think we can still race enough that
> > this fails to fix it up.  Might need a spinlock to prevent that.
> > 
> >   CPU 0                                        CPU 1
> > ad_sd_data_rdy_trig_poll()               ad_sd_wait_and_disable()
> >                                        
> >                                          //wait_for_completion ends
> > 					
> > Interrupt
> >                                           disable_irq()
> > 					  if (sigma-delta->irq_dis) !true	
> > 					  else
> > 						sigma_delta->irq_dis = true
> > 
> > disable_irq_nosync(irq)
> > sigma_delta->irq_dis = true;
> > 
> > So we still end up with a doubly disabled irq.  Add a spinlock to make the
> > disable and the setting of sigma_delta->irq_dis atomic then it should all be fine.                  
> 
> My understanding is that the suffix-less version of disable_irq would
> wait for all running handlers on other CPUs (i.e.
> ad_sd_data_rdy_trig_poll) to finish before proceeding, which would
> prevent this from happening. Is that not the case?

Ah. I'd missed that - it is indeed documented as waiting for
pending irq handlers so that race doesn't exist.


> 
> But now that you mention it, there is another small problem: in the case
> where the conversion doesn't time out, the interrupt handler will call
> complete() and then perform some operations on the struct
> ad_sigma_delta.
> 
> This is always ok on a single processor, but if there are multiple CPUs
> there is possibly a brief period where both the interrupt handler and
> the waiting thread are accessing the ad_sigma_delta struct without
> synchronization between them.
> 
> Not sure if that's really a problem in practice, but I think an easy way
> to rule it out would just be to move the complete() call to the bottom
> of the handler and make sure it doesn't touch the structure again after
> that.

Hmm. At first glance, you are correct that moving completion later probably
makes sense. There may be some subtleties in that ordering though so I definitely want
some feedback from those more familiar with this driver than I am before
taking this patch or that further change.

Jonathan


> 
> Cheers,
> Daniel
>