[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1240929844.10689.35.camel@localhost.localdomain>
Date: Tue, 28 Apr 2009 14:44:04 +0000
From: Jesper Dangaard Brouer <hawk@...x.dk>
To: Ben Hutchings <bhutchings@...arflare.com>
Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: Driver SFC: Possible bug in LM87 temperature XFP detection code
On Tue, 2009-04-28 at 14:36 +0100, Ben Hutchings wrote:
> On Tue, 2009-04-28 at 11:36 +0200, Jesper Dangaard Brouer wrote:
> > Hi Ben,
> >
> > I have borrowed some SMC10GPCIe-XFP NICs directly from SMC for
> > evaluation. The NICs uses a Solarflare Chip and the SFC driver.
> >
> > If unpluging the fiber cable I start getting these errors:
> >
> > +--------
> > sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 30:00) INTERNAL EXTERNAL
> > sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> >
> > sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 30:00) INTERNAL EXTERNAL
> > sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> >
> > sfc 0000:12:00.0: ERR: eth88 LM87 detected a hardware failure (status 10:00) INTERNAL
> > sfc 0000:12:00.0: ERR: eth88 Board sensor reported fault; shutting down PHY
> > +---------
> >
> > Reading through the driver code (drivers/net/sfc/boards.c), this problem
> > is related to temperature.
>
> Right. And the sensors are not polled while the link is up, on the
> assumption that a temperature or voltage fault will cause the link to go
> down, and because bit-banged I2C will reduce throughput slightly.
In my situation the link does not go down due the temperature issue.
> > The real issues is that I cannot get the device up and running again
> > after lowering the temperature. Only if I unload and load the sfc
> > driver, then I can get the device running again.
> >
> > I'm thinking perhaps there is missing a PHY power up again, after the
> > temperature alarm has gone?
>
> We considered it most important to shut down the board to prevent or
> mitigate damage, and did not implement any recovery beyond that.
Im my case putting the PHY in PHY_MODE_LOW_POWER, does not help lowering
the temperature. The errors are continous, until I apply "manual"
airflow ;-)
> > I'm using kernel 2.6.30-rc1-net-next-00664-gd93fe1a.
> >
> >
> > To Ben; do you have anything you want me to try. Do you want to fix this
> > you self, or can you give me some code hints or patches to try out?
>
> I don't intend to fix this myself. If you want to try implementing this
> then you should start by looking at efx_monitor() in efx.c. However, I
> think your time might be better spent in fixing the air flow in the
> computer before the board is permanently damaged.
I see you point, I don't want to damage the board... not sure I want to
fix it then... Although in a production environment, I think the driver
should support exchanging a failed XFP without rebooting the server.
Then I also think that we should make the error message a bit more
explicit, in order to warn people before the board is permanently
damaged. I'll post a patch proposal as reply to this message...
--
Med venlig hilsen / Best regards
Jesper Brouer
ComX Networks A/S
Linux Network developer
Cand. Scient Datalog / MSc.
Author of http://adsl-optimizer.dk
LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists