netdev - Re: Driver SFC: Possible bug in LM87 temperature XFP detection code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <1240929844.10689.35.camel@localhost.localdomain>
Date:	Tue, 28 Apr 2009 14:44:04 +0000
From:	Jesper Dangaard Brouer <hawk@...x.dk>
To:	Ben Hutchings <bhutchings@...arflare.com>
Cc:	"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: Driver SFC: Possible bug in LM87 temperature XFP detection code

On Tue, 2009-04-28 at 14:36 +0100, Ben Hutchings wrote:
> On Tue, 2009-04-28 at 11:36 +0200, Jesper Dangaard Brouer wrote:
> > Hi Ben,
> > 
> > I have borrowed some SMC10GPCIe-XFP NICs directly from SMC for
> > evaluation.  The NICs uses a Solarflare Chip and the SFC driver.
> > 
> > If unpluging the fiber cable I start getting these errors:
> > 
> > +--------
> >  sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 30:00) INTERNAL EXTERNAL
> >  sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> > 
> >  sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 30:00) INTERNAL EXTERNAL
> >  sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> > 
> >  sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 10:00) INTERNAL
> >  sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> > +---------
> > 
> > Reading through the driver code (drivers/net/sfc/boards.c), this problem
> > is related to temperature.
> 
> Right.  And the sensors are not polled while the link is up, on the
> assumption that a temperature or voltage fault will cause the link to go
> down, and because bit-banged I2C will reduce throughput slightly.

In my situation the link does not go down due the temperature issue.


> > The real issues is that I cannot get the device up and running again
> > after lowering the temperature.  Only if I unload and load the sfc
> > driver, then I can get the device running again.
> > 
> > I'm thinking perhaps there is missing a PHY power up again, after the
> > temperature alarm has gone?
> 
> We considered it most important to shut down the board to prevent or
> mitigate damage, and did not implement any recovery beyond that.

Im my case putting the PHY in PHY_MODE_LOW_POWER, does not help lowering
the temperature.  The errors are continous, until I apply "manual"
airflow ;-)


> > I'm using kernel 2.6.30-rc1-net-next-00664-gd93fe1a.
> > 
> > 
> > To Ben; do you have anything you want me to try. Do you want to fix this
> > you self, or can you give me some code hints or patches to try out?
> 
> I don't intend to fix this myself.  If you want to try implementing this
> then you should start by looking at efx_monitor() in efx.c.  However, I
> think your time might be better spent in fixing the air flow in the
> computer before the board is permanently damaged.

I see you point, I don't want to damage the board... not sure I want to
fix it then... Although in a production environment, I think the driver
should support exchanging a failed XFP without rebooting the server.

Then I also think that we should make the error message a bit more
explicit, in order to warn people before the board is permanently
damaged.  I'll post a patch proposal as reply to this message...

-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network developer
  Cand. Scient Datalog / MSc.
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html