netdev - RE: [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <31AFFC7280259C4184970ABA9AFE8B938CF868E9@avmb3.qlogic.org>
Date:	Thu, 19 Dec 2013 09:11:33 +0000
From:	Manish Chopra <manish.chopra@...gic.com>
To:	David Gibson <david@...son.dropbear.id.au>
CC:	Sony Chacko <sony.chacko@...gic.com>,
	Rajesh Borundia <rajesh.borundia@...gic.com>,
	netdev <netdev@...r.kernel.org>,
	"snagarka@...hat.com" <snagarka@...hat.com>,
	"tcamuso@...hat.com" <tcamuso@...hat.com>,
	"vdasgupt@...hat.com" <vdasgupt@...hat.com>
Subject: RE: [0/2] netxen: bug fix and diagnostics for possible (hardware?)
 bug

>> >From: David Gibson [mailto:david@...son.dropbear.id.au]
>> >Sent: Tuesday, December 17, 2013 10:53 AM
>> >To: Manish Chopra; Sony Chacko; Rajesh Borundia
>> >Cc: netdev; snagarka@...hat.com; tcamuso@...hat.com;
>> >vdasgupt@...hat.com
>> >Subject: [0/2] netxen: bug fix and diagnostics for possible
>> >(hardware?) bug
>> >
>> >At Red Hat, we've hit a couple of customer cases with crashes in the
>> >netxen driver due to list corruption.  This seems to be very rarely
>> >triggered, and unfortunately the dumps we have don't have enough
>> >information to be certain of the cause, although we have a possible theory.
>> >
>> >I'm suggesting, therefore a patch to add some sanity checking which
>> >should help to at least localize and mitigate the problem when someone hits it
>in future.
>> >Please let me know if there's a better approach to doing this.
>> >
>> >That's 2/2.  1/2 is a fix for a clear bug I spotted along the way,
>> >but not one that could cause the symptoms we've seen.
>>
>> David,
>>
>> Having these checks in data path(Rx path) may have some performance
>> impact. It's better to root cause it instead of putting some sanity
>> checks.
>
>Obviously, but this was the best way I could think of to try narrowing down the
>root cause (at least trying to eliminate driver vs. firmware bug).

David, Instead of making permanent changes in driver, can you please run your modified driver in selective customer environment where this issues is seen?
Which may give some data point that what's the issue exactly and then we go by that.

>
>> We will get back to you on this.
>
>If you have a better idea for locating the root cause, please let me know.  I have
>access to a vmcore which I can poke around in.

We will also try to reproduce the problem in our environment and debug this.
Can you please give some details?

1) what's the driver and firmware version used?
2) which operating system and kernel version?
3) please send the vmcore also with backtrace if available which can give some idea what can trigger this issue.
4) Test case details:- what type of test is running on the system? Just to make sure we also try the same test cases in our environment.
5) Server details (Number of CPus, memory etc.) if available.

Thanks,
Manish
  

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html