linux-kernel - Re: PROBLEM: modprobe hang at startup (3.8.x, 3.9.x, IBM x3550)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130515214923.036dabdb@endymion.delvare>
Date:	Wed, 15 May 2013 21:49:23 +0200
From:	Jean Delvare <khali@...ux-fr.org>
To:	Robert Norris <robn@...ra.com>
Cc:	linux-kernel@...r.kernel.org, Linux I2C <linux-i2c@...r.kernel.org>
Subject: Re: PROBLEM: modprobe hang at startup (3.8.x, 3.9.x, IBM x3550)

Robert,

On Wed, 15 May 2013 21:27:41 +1000, Robert Norris wrote:
> On Wed, May 15, 2013 at 11:20:44AM +0200, Jean Delvare wrote:
> > Can you share the full output of lspci -s 00:1f.3 -vv?
> 
> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus Controller (rev 09)
>     Subsystem: IBM Device 02dd
>     Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
>     Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>     Interrupt: pin B routed to IRQ 0

Hmm, this "IRQ 0" is quite odd. I'm wondering if this could be the
reason for this hang. Was it with the i2c-i801 driver loaded, or
blacklisted? Please check if it makes a difference.

Do you see the same (and more generally, this issue) on one, some or
all of your x3550 servers?

Are you using IPMI on these machines?

>     Region 4: I/O ports at 0440 [size=32]
> 
> > I'm also curious if the SMBus controller shares its interrupt line
> > with another chip. /proc/interrupts should tell but you'll have to
> > make one of your systems hang again.
> 
> I'm not sure how to read it, so here it is (3.9.2, immediately after
> boot, no options to i2c_i801):
> 
>            CPU0       CPU1       CPU2       CPU3       
> (...)
>  20:          0          0          0          0   IO-APIC-fasteoi   i801_smbus

Here the IRQ looks correct, and it isn't shared. But I am surprised
that the counters are all 0. If an SMBus transaction had been
attempted, there should be a 1 somewhere, even if the transaction
ultimately failed.

> (...)
> I went with blacklisting for now because this driver doesn't appear to
> be doing anything useful for us (sensors etc are working without it).
> I'll confess to not really knowing much about its purpose though.

It all depends on what I2C/SMBus slaves are connected to the SMBus.
Often there are the SPD EEPROMs from your memory modules, sometimes
with integrated thermal sensors (on DDR3 only - driver is jc42.) And in
your case a clock chip as well, for which IBM contributed a driver.

> > (...)
> > As far as debugging goes, please tell me if you have any I2C/SMBus
> > slave device driver loaded (check in /sys/bus/i2c/drivers.) Loading the
> > i2c-i801 driver doesn't do much on its own if there are no slave device
> > drivers using it.
> 
> $ modprobe i2c-i801 disable_features=0x10
> $ dmesg | tail
> ...
> [28876.193408] i801_smbus 0000:00:1f.3: Interrupt disabled by user
> [28876.201168] ics932s401 4-0069: ics932s401 chip found
> $ ls /sys/bus/i2c/drivers
> dummy  ics932s401

The dummy driver is a helper stub for i2c-core, it doesn't actually
access the SMBus. ics932s401 is for the clock chip, and I know clock
chips can be tricky and error prone. OTOH I can only guess that IBM had
a good reason to contribute the driver and make it auto-load on the
x3550.

I would appreciate if you could test the following:
* Blacklist i2c-i801 and ics932s401 so that none of them get
  auto-loaded.
* Manually load i2c-i801 with interrupts enabled, and see what happens.
* If no hang happens, load i2c-dev, find the i801 bus number with
  i2cdetect -l (from the i2c-tools package - it should be 4 according
  to what you reported so far but there is no guarantee that it won't
  change across reboots.) Then do a simple read from a random address
  with:
  # i2cget 4 0x50 0x00
  (Adjust the bus number as needed.)
  I am curious if this will hang as well or only when accessing the
  clock chip at address 0x69.

Thanks,
-- 
Jean Delvare
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/