Near misses don’t teach the right lessons

By Chris Edwards | No Comments | Posted: February 7, 2012
Topics/Categories: no topics assigned | Tags:

In the 12 years since the start of 2000, we have seen many embedded systems attach to the internet. And we are faced with the prospect of every car on the road having its own IP address – and using that to receive real-time information about road conditions from other computers sitting on the network. These networks are gradually forming into larger and larger, complex distributed systems combining computers and machinery.

Despite being highly distributed, these systems are potentially vulnerable to single points of failure because care is not being taken to analyse how these growing systems really fit together, safety consultant Martyn Thomas warned in his keynote at the Safety-Critical Systems Symposium this morning in Bristol, UK.

Part of the problem lies in the successes of the past with one major culprit being the Y2K ‘bug’. A bug that failed to bite.

Systems did fail, such as the runway visual range systems operated by the National Air Traffic Service (NATS). Thomas said the system failed at 4am in the morning on New Year’s Day 2000 – but few outside the industry know about it because, as there were no aircraft in the sky, it had no consequences. And all the system needed was a reboot, even though this was “one we missed”.

Thomas says many more catastrophic failures were avoided because organisations hired engineers to go and take a look at the code. But no-one was ever forced to talk about those problems.

“Y2K should have been a warning about systems that are tightly coupled that could have caused a cascade failure,” Thomas said.

But it wasn’t. In fact, the reverse happened.

Thomas argues that society drew the wrong conclusions from the Y2K situation. Because nothing really bad happened, “public opinion was that it was just money for consultants and the thing to do next time is ignore the warnings”.

No-one wants to spend money on backup systems that never get used but this could compromise systems in unexpected ways. An example is navigation.

Thomas worked on the Royal Academy of Engineering (RAE) report on the robustness or, rather, the lack of robustness in the navigational aids such as the Global Positioning System (GPS).

“Despite all the risks, almost nothing is being done,” said Thomas. Not only that, the one thing that could act as a backup for GPS or Glonass is being tossed around as a political hot potato that no-one wants to hang onto.

“We don’t pay attention to the big risks with the attention that we pay to smaller risks,” said Thomas.

Like world-killer asteroid strikes, the problem maybe seems too big to tackle and are practically unavoidable. However, some things are more frequent and potentially massively damaging to society but are being ignored because they just aren’t frequent enough.

One that could disrupt GPS is a repeat of the Carrington Event: a massive coronal ejection by the Sun that unleashed huge amounts of radiation. No electronic system suffered because it happened in the early 19th Century. If it were repeated today, it would probably knock out a host of satellites, particularly those in high, geostationary orbits such as the GPS constellation, where there is no protection from the Earth’s magnetic field.

“A Carrington Event happens with the same frequency as the type of tsunami that wrecked Fukushima,” Thomas claimed.

“Consider nothing impossible. Consider all possibilities,” Thomas stressed, quoting Charles Dickens’ David Copperfield.

“We need to watch out for ‘accidental systems’, a phrase I picked up from a meeting of the National Academy of Sciences study group on dependable systems and was used by [SRI International researcher] John Rushby,” Thomas explained.

These are systems that though designed independently are linked in such a way that the combination cannot survive the failure of a single, key component. They may, for example, depend in some way on GPS signals. Even though a GPS failure might not wipe each one out, the interdepencies between the systems are such that the loss in performance causes a cascade that causes them all to crash.

GPS is not just vulnerable to extreme solar activity but also to relatively simple sabotage. A transmitter with an RF power output of 5W in a balloon could stop any receivers in an area the size of the south of England, for example. That would be relatively easy to spot although it would take time to get to it and stop the transmissions.

But something much more problematic would be a collection of 0.5W transmitters that switched on and off randomly – they would be almost untraceable.

The engineers who design these systems might have considered a backup but failed to “overcome the resistance of accountants to invest in redundancy because, to them, that equates to waste,” said Thomas. “‘What you do mean we need another system?’ they ask.”

“eLoran is a no-brainer. It only needs one or two million in funding a year. But it seems to be not happening because no-one wants to put their hand up and say they will take responsibility for it,” Thomas claimed.

But it won’t happen unless a lot of people decide they want it – no one user on its own is going to push governments into funding the backup. And very few see the risks involved.