netdev - Re: Mark IPW2100 as BROKEN: Fatal interrupt. Scheduling firmware restart.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20080926055651.GA19560@2ka.mipt.ru>
Date:	Fri, 26 Sep 2008 09:56:52 +0400
From:	Evgeniy Polyakov <johnpol@....mipt.ru>
To:	Alan Cox <alan@...rguk.ukuu.org.uk>
Cc:	Arjan van de Ven <arjan@...radead.org>, netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org, ipw2100-devel@...ts.sourceforge.net,
	linux-wireless@...r.kernel.org, yi.zhu@...el.com,
	reinette.chatre@...el.com, jgarzik@...ox.com,
	linville@...driver.com, davem@...emloft.net,
	holtmann@...ux.intel.com, mjg59@...f.ucam.org
Subject: Re: Mark IPW2100 as BROKEN: Fatal interrupt. Scheduling firmware restart.

On Sun, Sep 21, 2008 at 08:57:44PM +0100, Alan Cox (alan@...rguk.ukuu.org.uk) wrote:
> > Getting the fact, that rmmod/insmod does not always fix the problem (but
> > most of the time for a short period of time), I again want to point,
> > that it looks like a firmware problem related to some inner timings. You
> > ask me to fix the driver and do not even listen to what I said
> > previously and do not get that into account and analyze.
> 
> Try putting it into D3 counting to 10 and powering it back up. Thats
> about as close as you can get to pulling the plug when it hangs.

I made several experimetns with power states in reset handler,
like put to d3 (hot), disable device, save/resetore states.
Fatal interrupts continue to fire with essentially the same rate.

The same error address does not always contain the same error value, but
frequently it is finit small set.

Here are some data:
[41773.200686] ipw2100: Fatal interrupt. Scheduling firmware restart.
[41773.200707] eth1: Fatal error value: 0x500185B8, address: 0x08004501,
	inta: 0x40000000
[41773.200810] ipw2100 0000:02:04.0: PCI INT A disabled
[41773.203110] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.224446] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.245781] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.249360] ipw2100 0000:02:04.0: enabling device (0000 -> 0002)
[41773.249384] ipw2100 0000:02:04.0: PCI INT A -> Link[C0C8] -> GSI 11
	(level, low) -> IRQ 11
[41773.249426] ipw2100 0000:02:04.0: restoring config space at
	offset 0x1 (was 0x2900002, writing 0x2900006)

That is quite harmless, since interrupt handler just sees that device is
dissapearing. This brought me to think more about interrupt processing
(irq handler and related tasklet), and I found races between interrupt
tasklet, ipw2100_wx_event_work() handler, reset task and probably
others. Register access in some cases are proteceted by lock (interrupt
handler), and in some cases is not (all others). Although every user
first disables interrupts, but it can be handled right now and scheduled
tasklet already. Also priv->status field is frequently accessed and
modified with and without locks. This may be harmless, but still a red
flag.

Another data about the same failed address:
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x5000CEE4, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x5000CEE4, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x5000CEE4, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000

values are quite limited, but I saw at different address wider set of
different data, but still limited. Addresses and values of the fatal
interrupt do not follow some immediately obvious law, so this looks more
like firmware losts its mind. The reason for this actually may be a
register access races described above. I will continue experiments with
this.

-- 
	Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html