linux-kernel - Re: [PATCH] PHYLIB: IRQ event workqueue handling fixes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20071018143752.GA3630@ff.dom.local>
Date:	Thu, 18 Oct 2007 16:37:52 +0200
From:	Jarek Poplawski <jarkao2@...pl>
To:	"Maciej W\. Rozycki" <macro@...ux-mips.org>
Cc:	Andy Fleming <afleming@...escale.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Jeff Garzik <jgarzik@...ox.com>, netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] PHYLIB: IRQ event workqueue handling fixes

On Thu, Oct 18, 2007 at 12:30:35PM +0100, Maciej W. Rozycki wrote:
> On Wed, 17 Oct 2007, Jarek Poplawski wrote:
> 
> > I'm not sure free_irq() should maintain the depth count - rather warn
> > on not zero. But, IMHO, any activity on freed irq seems suspicious to
> > me (and doesn't look like very common), even if it's safe with current
> > implementation.
> 
>  No way to avoid it with DEBUG_SHIRQ.
> 
> > Yes, these DEBUG_SHIRQ checks are suspicious to me too, but they seem
> > to be reasonable only in the case of possible resent irqs (so not for
> > all irqs). On the other hand, it seems, proper irq handler with proper
> > hardware shouldn't have any problems with such a check.
> 
>  What do you mean by "proper irq handler with proper hardware"?  Using
> softirqs (they used to be called bottom-halves) is actually a natural way 
> of handling any interrupt which requires extensive processing.

Technically until free_irq returns a handler should respond to calls
and with proper hardware it should have no problem with checking if
it's its interrupt, even after disabling this hardware, because of
possible races.

But with a scenario like this:

- disable_irq()
- disable_my_hadrware_irq()
- clear_my_hardware_irq()
- free_irq()
- enable_irq()

it seems the handler should respond even after free_irq because there
could be still interrupts to resend, originally generated by its
hardware, so such behavior looks very suspicious, at least with some
type of interrupts.

So, I think, the idea of DEBUG_SHIRQ is generally right and very
useful - but, of course, there could be exceptions, which btw. could
try some hacks under DEBUG_SHIRQ too. And my opinion about
'properness' was very general (not about phy) too, just like my
'knowledge' of drivers.

> 
> > 1) phy_change() checks PHY_HALTED flag without lock; I think it's
> > racy: eg. if it's done during phy_stop() it can check just before
> > the flag is set and reenable interrupts just after phy_stop() ends.
> 
>  I remember having a look into it, but it was long ago and I cannot 
> immediately recall the conclusion.  Which means it is either broken or 
> deserves a comment as non-obvious.  I will have a look into it again, but 
> I am resource-starved a little at the moment, sorry.
> 
> > 2) phy_change() doesn't reenable irq line after it sees returns
> > with errors; IMHO it should at least write some warning, but maybe
> > try some safety plan, so enable_irq() and try to disable interrupts
> > and free_irq() on the next call (if it happens). (But, I can be very
> > wrong with this - maybe it's OK and official way.)
> 
>  No way to do this safely -- at this point the device probably still has 
> its interrupt output asserted and the register to clear it is 
> inaccessible, so enabling the line will enter an infinite loop.  At this 
> point the system is no longer stable, so it is better to keep at least 
> some functionality, so that it may be attempted to be shut down cleanly, 
> rather than make it completely irresponsive.  The alternative is panic().

Right! But then some warning could be useful, I presume.

> 
> > 3) phy_interrupt() checks PHY_HALTED flag without lock too, but I'm
> > not sure now if it could be dangerous after fixing #1; on the other
> > hand even if we know it's not our regular interrupt, with current
> > DEBUG_SHIRQ it could be easier to call schedule_work() anyway since
> > we are sure it's before/in free_irq() yet.
> 
>   See #1 above.
> 
> > 4) phy_interrupt() should check return value from schedule_work() and
> > enable irq on 0.
> 
>  No -- the work already pending will do that.

How? If schedule_work won't succeed because there is a pending one,
we did disable_irq_nosync 2x, so we would stay at least 1x too deep!

> 
> > 5) phy_stop_interrupts(): maybe I miss something, but it seems
> > phy_stop() is required before this, so maybe there should be a
> > comment on this?
> 
>  The API is documented in: Documentation/networking/phy.txt -- you are 
> welcome to improve.  If you do not want to get into the gory details, just 
> use the cooked interface and phy_disconnect() will do the dirty work for 
> you.
> 
> > 6) phy_stop_interrupts(): if I'm not wrong with #3 calling
> > phy_disable_interrupts() looks like we are not sure this phy_stop()
> > really works; than maybe a WARN_ON?
> 
>  WARN_ON what?

I've thought that phy_stop() could be needed before
phy_stop_interrupt() to set PHY_HALTED, but since it disables and
clears interrupts too, then there should be no need to repeat this,
maybe only check it's done. But if there is no such dependency, then
no point at all.

> 
> > 7) phy_stop_interrupts(): after above mentioned changes in
> > phy_interrupt(), and phy_changes() (always enable_irq()) I can't see
> > any reason why there should be more than one skipped enable_irq(),
> > so checking return from cancel_work_sync() shouldn't be enough
> > instead of this atomic counter.
> 
>  CONFIG_DEBUG_SHIRQ.  Barring this option, cancel_work_sync() could have 
> been moved to the front of free_irq().  I think I have seen this in 
> reality, with the interrupt line left stuck disabled afterwards, but I 
> will double check when I have an opportunity.  The approach implemented 
> with this patch does work, which is the important bit, and if 
> simplification is possible, then it may be applied later.
> 
> > 8) phy_stop_interrupts(): I'm not sure this additional call from
> > DEBUG_SHIRQ should be so dangerous, eg.:
> > 
> > 	/*
> > 	 * status == PHY_HALTED &&
> > 	 * interrupts are stopped after phy_stop()
> > 	 */
> > 	if (cancel_work_sync(...))
> > 		enable_irq();
> > 
> > 	free_irq(...);
> > 	/*
> > 	 * possible schedule_work() from DEBUG_SHIRQ only,
> > 	 * but proper check for PHY_HALTED is done;
> > 	 * so, let's flush after this too:
> > 	 */
> > 	cancel_work_sync();
> 
>  Well, if there is another handler registered on this line, you'll get 
> your interrupt line stuck disabled.

I'll rethink this yet...

> 
> > Of course, I don't know phy.c enough, so most of this can be terribly
> > wrong, then feel free to forget about this - I don't expect you should
> > waste any time for explaining me these things - after all they are
> > doubts only.
> 
>  I am not sure I know phy.c well enough either, ;-) and your concerns are 
> appreciated as interesting conclusions may develop.  If someone disagrees 
> with what I have written here, they are welcome to speak out too.

But, I've enough of other concerns too, so nothing urgent here...

Many thanks,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/