[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1220429845.32688.182.camel@bodhitayantram.eng.vmware.com>
Date: Wed, 03 Sep 2008 01:17:25 -0700
From: Zachary Amsden <zach@...are.com>
To: "Brandeburg, Jesse" <jesse.brandeburg@...el.com>
Cc: Qicheng Christopher Li <chrisl@...are.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"arvidjaar@...l.ru" <arvidjaar@...l.ru>,
"Allan, Bruce W" <bruce.w.allan@...el.com>,
"jeff@...zik.org" <jeff@...zik.org>,
"Kirsher, Jeffrey T" <jeffrey.t.kirsher@...el.com>,
"Ronciak, John" <john.ronciak@...el.com>,
"Waskiewicz Jr, Peter P" <peter.p.waskiewicz.jr@...el.com>,
Pratap Subrahmanyam <pratap@...are.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"mm-commits@...r.kernel.org" <mm-commits@...r.kernel.org>
Subject: RE: + e1000e-prevent-corruption-of-eeprom-nvm.patch added to
-mmtree
On Tue, 2008-09-02 at 17:32 -0700, Brandeburg, Jesse wrote:
> Zachary Amsden wrote:
> >>> The EEPROM corruption is triggered by concurrent access of the
> >>> EEPROM read/write. Putting a lock around it solve the problem.
> >>>
> >> <snip> I'd like to know how this actually solves a problem outside
> >> vmware.
> >
> > Yes. This was observed on a physical E1000 device.
>
> Hi Zach, thanks for the reply, can you give me a reproduction test so
> that we can verify this fix internally? We are extremely interested.
>
> <long explanation follows>
This was encountered by a customer and fixed by a developer who no
longer works for us. We can't give any better explanation than -
1) It looks like there is a massive catastrophic failure that causes
working hardware to DIE.
2) The E1000 driver is really crazily complicated and has to deal with
many different hardware versions, all with different idiosynchronicities
about this EEPROM locking, and it isn't really clear how the hardware
locking protects multiple racing kernel threads in all of these cases.
3) Casual inspection by several people who are not networking experts
shows several possibly viable paths where potential EEPROM corruption
might reasonably occur.
4) Adding locking around EEPROM read and write paths resulted in no more
problems by our customers.
5) It seems LKML customers are getting catastrophic failures where
EEPROMS are getting corrupted.
Thus, the obvious answer to me is ... LOCK AROUND EEPROM READS AND
WRITES now and figure out if there are deeper issues due to network
layer and E1000 complexities later, but ELIMINATE THE CATASTROPHIC
FAILURES ASAP.
> We are definitely investigating, please help us reproduce it if you can.
I'm sorry, we have no way to reproduce this, don't know anything about
the hardware the customer was using and really have almost zero context
on this bug because the developer who addressed the issue and fixed it
is long gone. We only have statistical data showing 0% reproduction
since a very similar fix to the one you just NACKED.
I know it's not a very useful thing to say.
But seriously, if people's hardware is being hosed, you must take every
precaution in the world to stop it and worry about 'logical' arguments
regarding invariants in the network and driver layers later.
Zach
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists