lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 17 May 2013 10:36:22 +0200
From:	Jean Delvare <khali@...ux-fr.org>
To:	Robert Norris <robn@...ra.com>, Daniel Kurtz <djkurtz@...omium.org>
Cc:	linux-kernel@...r.kernel.org, Linux I2C <linux-i2c@...r.kernel.org>
Subject: Re: PROBLEM: modprobe hang at startup (3.8.x, 3.9.x, IBM x3550)

Hi Robert,

On Thu, 16 May 2013 13:44:55 +1000, Robert Norris wrote:
> On Wed, May 15, 2013 at 09:49:23PM +0200, Jean Delvare wrote:
> > >     Interrupt: pin B routed to IRQ 0
> > 
> > Hmm, this "IRQ 0" is quite odd. I'm wondering if this could be the
> > reason for this hang. Was it with the i2c-i801 driver loaded, or
> > blacklisted? Please check if it makes a difference.
> 
> That was without the driver loaded (blacklisted). After loading (with
> interrupts enabled) we get:
> 
>     Interrupt: pin B routed to IRQ 20

For the record, I also see the IRQ value change after loading the
i2c-i801 driver on my system (with an ICH10 south bridge.) From 14 to
22 in my case. So it's a bit different (no IRQ 0) but not still
somewhat similar, so I'm still not sure if this has anything to do with
your issue.

> 
> > Do you see the same (and more generally, this issue) on one, some or
> > all of your x3550 servers?
> 
> The issue has occured on at least three x3550s (we have 11). I haven't
> tested more, because knowingly crashing production machines sucks.

Yes of course, I understand, I did not expect you to do that ;) 

> This appears to be the case on other machines. With the module
> blacklisted (never loaded), lspci shows IRQ 0. After load, IRQ 20.
> (tested on 3.4 and 3.9).

OK.

> > Are you using IPMI on these machines?
> 
> Yes, but only for monitoring/sensors, if that makes a difference.

IPMI is still likely to access the SMBus controller. If there's a BMC
in the machine, it can also access the SMBus slave with its own
controller. It would be good to rule this out by disabling IPMI
completely, removing the BMC from the machine if it has one, and
checking if it makes the issue go away or not.

> > I would appreciate if you could test the following:
> > * Blacklist i2c-i801 and ics932s401 so that none of them get
> >   auto-loaded.
> 
> Done.
> 
> > * Manually load i2c-i801 with interrupts enabled, and see what
> >   happens.
> 
> Returned immediately:
> 
> [   60.527140] i801_smbus 0000:00:1f.3: SMBus using PCI Interrupt

This confirms that the i2c-i801 driver loading itself isn't the problem.

> > * If no hang happens, load i2c-dev, find the i801 bus number with
> >   i2cdetect -l (from the i2c-tools package - it should be 4 according
> >   to what you reported so far but there is no guarantee that it won't
> >   change across reboots.)
> 
> $ i2cdetect -l
> i2c-0   i2c         Radeon i2c bit bus DVI_DDC          I2C adapter
> i2c-1   i2c         Radeon i2c bit bus VGA_DDC          I2C adapter
> i2c-2   i2c         Radeon i2c bit bus MONID            I2C adapter
> i2c-3   i2c         Radeon i2c bit bus CRT2_DDC         I2C adapter
> i2c-4   smbus       SMBus I801 adapter at 0440          SMBus adapter
> 
> > Then do a simple read from a random address
> >   with:
> >   # i2cget 4 0x50 0x00
> >   (Adjust the bus number as needed.)
> >   I am curious if this will hang as well or only when accessing the
> >   clock chip at address 0x69.
> 
> Yep, that one hangs. The hung task handler picked it up after a few
> minutes.

OK, this means that any transaction request to the SMBus controller
causes the hang.

The i2c-i801 driver is optimistically using wait_event() when waiting
for an interrupt to arrive. I suppose that the interrupt is never
delivered in your case (all 0 in /proc/interrupts.)

Daniel, shouldn't we use wait_event_timeout() instead to catch issues
like this and fail cleanly? Maybe even fallback to polling
automatically?

-- 
Jean Delvare
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists