[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6380451.UzJpC4z4kH@rofl>
Date: Thu, 05 Nov 2015 09:25:30 +0100
From: Patrick Schaaf <netdev@....de>
To: Yuval Mintz <Yuval.Mintz@...gic.com>
Cc: netdev <netdev@...r.kernel.org>,
Greg KH <gregkh@...uxfoundation.org>
Subject: Re: kernel 3.14.53 + bnx2x loss of connectivity / parity errors / MCP SCPAD
Hi Yuval,
thanks for your notes.
> 4. The patch you've listed merely removes the MCP SCPAD prints, as they're
> unavoidable in certain scenarios; It doesn't actually solve anything.
I also thought so, thanks for confirming. Do you know whether the messages
might have hidden earlier messages pointing to the real problem?
> Having said that, do you know if anything happened to the setup that
> triggered this? I.e., so configuration change, new utility, etc.?
> Alternatively, did the log show anything else in addition to the MCP SCPAD?
There was no update or configuration activity on the box, it was just running
along as usual, operating some virtual machines. Uptime was about 22 days. I
have a second, practically identical server, running pretty much the same
workload, which is still up + running nicely.
I was a bit overeager to reboot the server (power reset) and didn't even try
whether I could still log in (shame on me). After the reset the virtual
machines all came up fine, so at least filesystem flushing was still working
properly during the network breakage event.
The systemd journal logged a vast amount of the messages I've shown (with lots
of "missed kernel messages" too), for a duration of about 8 seconds. In total,
including the suppressions, it would have been over 1 million messages during
the 8 seconds. Running a "sort|uniq" over the visibly logged messages I see:
8786 kernel: bnx2x 0000:04:00.0 eth0: Parity errors detected in blocks:
8768 kernel: bnx2x 0000:04:00.1 eth1: Parity errors detected in blocks:
1583 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4080(eth0)]LATCHED
attention 0x80000000 (masked)
1743 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4080(eth1)]LATCHED
attention 0x80000000 (masked)
36092 kernel: MCP SCPAD:
1 kernel: RAX: 0000000000000000 RBX: 000000198111fb67 RCX:
00000000ffffffff
I'll now see that I backport that "MCP SCPAD" logging suppression patch to the
latest 3.14 kernel, and reboot the boxes with that, hoping to learn more if
the situation reoccurs.
best regards
Patrick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists