netdev - Re: kernel 3.14.53 + bnx2x loss of connectivity / parity errors / MCP SCPAD

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <6380451.UzJpC4z4kH@rofl>
Date:	Thu, 05 Nov 2015 09:25:30 +0100
From:	Patrick Schaaf <netdev@....de>
To:	Yuval Mintz <Yuval.Mintz@...gic.com>
Cc:	netdev <netdev@...r.kernel.org>,
	Greg KH <gregkh@...uxfoundation.org>
Subject: Re: kernel 3.14.53 + bnx2x loss of connectivity / parity errors / MCP SCPAD

Hi Yuval,

thanks for your notes.

> 4. The patch you've listed merely removes the MCP SCPAD prints, as they're
> unavoidable in certain scenarios; It doesn't actually solve anything.

I also thought so, thanks for confirming. Do you know whether the messages 
might have hidden earlier messages pointing to the real problem?

> Having said that, do you know if anything happened to the setup that
> triggered this? I.e., so configuration change, new utility, etc.?
> Alternatively, did the log show anything else in addition to the MCP SCPAD?

There was no update or configuration activity on the box, it was just running 
along as usual, operating some virtual machines. Uptime was about 22 days. I 
have a second, practically identical server, running pretty much the same 
workload, which is still up + running nicely.

I was a bit overeager to reboot the server (power reset) and didn't even try 
whether I could still log in (shame on me). After the reset the virtual 
machines all came up fine, so at least filesystem flushing was still working 
properly during the network breakage event.

The systemd journal logged a vast amount of the messages I've shown (with lots 
of "missed kernel messages" too), for a duration of about 8 seconds. In total, 
including the suppressions, it would have been over 1 million messages during 
the 8 seconds. Running a "sort|uniq" over the visibly logged messages I see:

   8786 kernel: bnx2x 0000:04:00.0 eth0: Parity errors detected in blocks:
   8768 kernel: bnx2x 0000:04:00.1 eth1: Parity errors detected in blocks:
   1583 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4080(eth0)]LATCHED 
attention 0x80000000 (masked)
   1743 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4080(eth1)]LATCHED 
attention 0x80000000 (masked)
  36092 kernel: MCP SCPAD:
      1 kernel: RAX: 0000000000000000 RBX: 000000198111fb67 RCX: 
00000000ffffffff

I'll now see that I backport that "MCP SCPAD" logging suppression patch to the 
latest 3.14 kernel, and reboot the boxes with that, hoping to learn more if 
the situation reoccurs.

best regards
  Patrick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html