lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Date: Fri, 21 Nov 2003 17:54:41 -0500
From: "Russ" <Russ.Cooper@...on.ca>
To: "Geoff Shively" <gshively@...x.com>
Cc: <bugtraq@...urityfocus.com>
Subject: RE: DOE Releases Interim Report on Blackouts/Power Outages, Focus on Cyber Security


Well, they did specifically discount both current (at the time) Internet worms/activity, and terrorist activity, as having any part in the blackout. As for the RTU failures, FE told investigators they believed that was because they believed the RTU's "started queuing and overloading the terminals buffers". Given that the EMS Alarm program was already crashed at this stage, its feasible to see a real-time reporting terminal not know what to do (other than to page an FE IT person) when its host can't accept its input. Since they refer to the RTU's connectivity both as "dial-ups" and as "data links", its hard to say what they were. Nothing else in the EMS system failed until 14:54 when both primary and backup EMS servers were down, so its unlikely that any "network" connectivity between RTUs and EMS were interrupted due to the problems with the EMS systems until that time. Ergo we're left with "comms" problems between RTU and EMS that led some FE personnel to describe them as "network" problems. It may all have simply been the fact that the RTUs had stalled, sending no "comms".

Interesting that a page went to FE IT folks when the RTUs stopped, but nothing went to them with the EMS Alarm program "stalled".

I think the refresh rate of the EMS consoles isn't actually a factor. The alarm function "stalled", or "froze", and did not produce any alarms. That EMS consoles were being refreshed after 59 seconds didn't alter the fact operators weren't seeing new alarms. The lack of alarms coupled with the arrogance of the staff who insisted reports by others were mistaken led to critical failures in line load which ultimately left them unable to recover.

During the same period of time MISO's State Estimating system, which was receiving telemetry from much of FEs network, experienced a normal mis-match in load calculations. A manual process is used to correct this, and was done within ~30 minutes of its first occurrence near the FE problem time-frame. An operator at MISO, however, left the estimating system in manual mode and went to lunch. It was put back into automatic mode 93 minutes later, at which time it again had a mis-match solution...so it had to be manually corrected again. It wasn't back into automatic mode until 16:04. Hard to say it would have made a big difference if it had been running in automatic mode during this whole time. Probably yes, but given FE's adamancy they had good data, they may have spent an equal amount of time arguing over who knew what.

If either of these events occurred independently, its likely the blackout could have been avoided.

If FE's operators not been so sure of themselves, its likely the blackout could have been avoided.

Finally, FE's IT staff took 54 minutes to complete their first attempt at recovering the alarm process, this after both primary and backup servers had failed (14 minutes after both had failed.) They were obviously relying on the failure not transferring from the primary to the backup. 34 minutes after the first warm reboot, and 4 minutes before the EMS crashed again, they discussed with FE operators the possibility of doing a complete cold boot because only then were they informed that the alarm function wasn't running (still). FE operators dissuaded the IT staff from doing so, fearing they'd have less data then they already had (arrogance again, they had already demonstrated their inability to perform adequately with the "less" data.)

Unfortunately, nobody tells us how long it would have actually taken to do a cold boot, and FE's IT staff say they didn't find out that was the only way to recover the alarm system until after the blackout (meaning the warm boot was a useless effort in the first place.)

And during all this time there were those damn trees!!!

MISO failed to adequately warn, and FE failed to adequately control its security space (physically and electronically). And it all happened on a hot August afternoon.

Cheers,
Russ - NTBugtraq Editor


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ