netdev - Re: [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20140124064411.GC4361@voom.redhat.com>
Date:	Fri, 24 Jan 2014 17:44:11 +1100
From:	David Gibson <david@...son.dropbear.id.au>
To:	Manish Chopra <manish.chopra@...gic.com>
Cc:	Sony Chacko <sony.chacko@...gic.com>,
	Rajesh Borundia <rajesh.borundia@...gic.com>,
	netdev <netdev@...r.kernel.org>,
	"snagarka@...hat.com" <snagarka@...hat.com>,
	"tcamuso@...hat.com" <tcamuso@...hat.com>,
	"vdasgupt@...hat.com" <vdasgupt@...hat.com>
Subject: Re: [0/2] netxen: bug fix and diagnostics for possible (hardware?)
 bug

On Thu, Dec 19, 2013 at 09:11:33AM +0000, Manish Chopra wrote:
> >> >From: David Gibson [mailto:david@...son.dropbear.id.au]
> >> >Sent: Tuesday, December 17, 2013 10:53 AM
> >> >To: Manish Chopra; Sony Chacko; Rajesh Borundia
> >> >Cc: netdev; snagarka@...hat.com; tcamuso@...hat.com;
> >> >vdasgupt@...hat.com
> >> >Subject: [0/2] netxen: bug fix and diagnostics for possible
> >> >(hardware?) bug
> >> >
> >> >At Red Hat, we've hit a couple of customer cases with crashes in the
> >> >netxen driver due to list corruption.  This seems to be very rarely
> >> >triggered, and unfortunately the dumps we have don't have enough
> >> >information to be certain of the cause, although we have a possible theory.
> >> >
> >> >I'm suggesting, therefore a patch to add some sanity checking which
> >> >should help to at least localize and mitigate the problem when someone hits it
> >in future.
> >> >Please let me know if there's a better approach to doing this.
> >> >
> >> >That's 2/2.  1/2 is a fix for a clear bug I spotted along the way,
> >> >but not one that could cause the symptoms we've seen.
> >>
> >> David,
> >>
> >> Having these checks in data path(Rx path) may have some performance
> >> impact. It's better to root cause it instead of putting some sanity
> >> checks.
> >
> >Obviously, but this was the best way I could think of to try narrowing down the
> >root cause (at least trying to eliminate driver vs. firmware bug).
> 
> David, Instead of making permanent changes in driver, can you please
> run your modified driver in selective customer environment where
> this issues is seen?

Yeah, the problem with that is that the problem has never triggered
twice for a single customer.  Well, technically there is one customer
that's hit it twice, but I'm pretty sure it's on entirely unrelated
systems in different sections of a large customer.  The only reason I
can see enough cases to suspect a pattern to these problems is from
looking across Red Hat's whole case history.

> Which may give some data point that what's the issue exactly and then we go by that.
> 
> >
> >> We will get back to you on this.
> >
> >If you have a better idea for locating the root cause, please let me know.  I have
> >access to a vmcore which I can poke around in.
> 
> We will also try to reproduce the problem in our environment and debug this.
> Can you please give some details?

Apologies for the long delay, I'd been hoping for some more
confirmation of things, but it hasn't happened.  I'll give you what I
can.

> 1) what's the driver and firmware version used?

I'm not sure what the most useful way ot giving a driver version.
I've given kernel version below, but it's an RH kernel, so I'm not
sure how much has been backported.

As to firmware, the driver reports:

netxen_nic 0000:04:00.0: Gen2 strapping detected
netxen_nic 0000:04:00.0: using 64-bit dma mask
netxen_nic: NX3031 Gigabit Ethernet Board S/N
<FF><FF><FF><FF><FF><FF><FF><FF><FF>
<FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF>
<FF><FF><FF>NX3031 Gigabit Ethernet  Chip rev 0x42
netxen_nic 0000:04:00.0: firmware v4.0.585 [legacy]

> 2) which operating system and kernel version?

RHEL5, 

Linux hostname 2.6.18-308.el5 #1 SMP Fri Jan 27 17:17:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

> 3) please send the vmcore also with backtrace if available which can
> give some idea what can trigger this issue.

I can't send the vmcore itself, since it will include customer data.
I can give you the backtrace below, and look up specific things if you
can give me an idea of what you need:

crash> bt
PID: 0      TASK: ffff81207f8bd7e0  CPU: 44  COMMAND: "swapper"
 #0 [ffff81107fd33b40] crash_kexec at ffffffff800b0938
 #1 [ffff81107fd33c00] __die at ffffffff80065137
 #2 [ffff81107fd33c40] die at ffffffff8006c789
 #3 [ffff81107fd33c70] do_invalid_op at ffffffff8006cd49
 #4 [ffff81107fd33d30] error_exit at ffffffff8005dde9
    [exception RIP: list_del+71]
    RIP: ffffffff8015a793  RSP: ffff81107fd33de0  RFLAGS: 00010286
    RAX: 0000000000000058  RBX: 0000000000000427  RCX: ffffffff80323028
    RDX: ffffffff80323028  RSI: 0000000000000000  RDI: ffffffff80323020
    RBP: ffff81407f4e8680   R8: ffffffff80323028   R9: 0000000000000001
    R10: 0000000000000000  R11: 0000000000000000  R12: ffffc200104494a0
    R13: 0000000000000002  R14: ffff81107a2cf500  R15: ffff81407e1bf400
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffff81107fd33de8] netxen_process_rcv_ring at ffffffff8830b050 [netxen_nic]
 #6 [ffff81107fd33eb8] netxen_nic_poll at ffffffff88306e71 [netxen_nic]
 #7 [ffff81107fd33ef8] net_rx_action at ffffffff8000c9b9
 #8 [ffff81107fd33f38] __do_softirq at ffffffff80012551
 #9 [ffff81107fd33f68] call_softirq at ffffffff8005e2fc
#10 [ffff81107fd33f80] do_softirq at ffffffff8006d646
#11 [ffff81107fd33f90] do_IRQ at ffffffff8006d4d6
--- <IRQ stack> ---
#12 [ffff81307fe2fe38] ret_from_intr at ffffffff8005d615
    [exception RIP: mwait_idle_with_hints+102]
    RIP: ffffffff8006b9cf  RSP: ffff81307fe2fee8  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: 00000000000000ff  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: 00007f319bfb6f27   R8: ffff81307fe2e000   R9: 0000000000000013
    R10: ffff8110b8288510  R11: 00000000ffffffff  R12: ffff81306273d040
    R13: ffff81207f8bd7e0  R14: 0000000000000001  R15: 0000000000000000
    ORIG_RAX: ffffffffffffff64  CS: 0010  SS: 0018
#13 [ffff81307fe2fee8] mwait_idle at ffffffff80056c65
#14 [ffff81307fe2fef0] cpu_idle at ffffffff80048f92

> 4) Test case details:- what type of test is running on the system?
> Just to make sure we also try the same test cases in our
> environment.

No particular type of test, it's an Oracle server in production.

> 5) Server details (Number of CPus, memory etc.) if available.

64 x Xeon X7550 CPUs, 256G RAM

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

Content of type "application/pgp-signature" skipped