lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6380451.UzJpC4z4kH@rofl>
Date:	Thu, 05 Nov 2015 09:25:30 +0100
From:	Patrick Schaaf <netdev@....de>
To:	Yuval Mintz <Yuval.Mintz@...gic.com>
Cc:	netdev <netdev@...r.kernel.org>,
	Greg KH <gregkh@...uxfoundation.org>
Subject: Re: kernel 3.14.53 + bnx2x loss of connectivity / parity errors / MCP SCPAD

Hi Yuval,

thanks for your notes.

> 4. The patch you've listed merely removes the MCP SCPAD prints, as they're
> unavoidable in certain scenarios; It doesn't actually solve anything.

I also thought so, thanks for confirming. Do you know whether the messages 
might have hidden earlier messages pointing to the real problem?

> Having said that, do you know if anything happened to the setup that
> triggered this? I.e., so configuration change, new utility, etc.?
> Alternatively, did the log show anything else in addition to the MCP SCPAD?

There was no update or configuration activity on the box, it was just running 
along as usual, operating some virtual machines. Uptime was about 22 days. I 
have a second, practically identical server, running pretty much the same 
workload, which is still up + running nicely.

I was a bit overeager to reboot the server (power reset) and didn't even try 
whether I could still log in (shame on me). After the reset the virtual 
machines all came up fine, so at least filesystem flushing was still working 
properly during the network breakage event.

The systemd journal logged a vast amount of the messages I've shown (with lots 
of "missed kernel messages" too), for a duration of about 8 seconds. In total, 
including the suppressions, it would have been over 1 million messages during 
the 8 seconds. Running a "sort|uniq" over the visibly logged messages I see:

   8786 kernel: bnx2x 0000:04:00.0 eth0: Parity errors detected in blocks:
   8768 kernel: bnx2x 0000:04:00.1 eth1: Parity errors detected in blocks:
   1583 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4080(eth0)]LATCHED 
attention 0x80000000 (masked)
   1743 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4080(eth1)]LATCHED 
attention 0x80000000 (masked)
  36092 kernel: MCP SCPAD:
      1 kernel: RAX: 0000000000000000 RBX: 000000198111fb67 RCX: 
00000000ffffffff

I'll now see that I backport that "MCP SCPAD" logging suppression patch to the 
latest 3.14 kernel, and reboot the boxes with that, hoping to learn more if 
the situation reoccurs.

best regards
  Patrick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ