Engineering Lessons

In today’s web-centric world where everything is connected to the internet, from smartphones, televisions, and video game systems, it can be easy to forget about an entire specialty of software development devoted to safety critical systems. Such systems, from the airbags or anti-lock brakes on our car, to large industrial plant control systems, are expected to work without issue and jump into action to save us from unsafe conditions. Safety critical systems perform their functions through the combination of hardware and software. Even if the systems we develop are themselves not safety critical, any lessons learned from the safety critical field are best learned once.

The Therac-25 was one such project that provided important lessons to the field of software engineering.

A radiation therapy machine, the Therac-25 was a computer controlled safety critical system built on the foundations of older radiation therapy machines. Numerous software bugs and user interface problems led to dangerous and deadly situations involving the machine.

How could such events happen?

Concurrency programming errors, otherwise known as race conditions, are extremely difficult to debug. These types of errors can appear seemingly at random, and then disappear just the same. Both novice and experienced programmers find it difficult to get to the bottom of race conditions.

The software of Therac-25 had numerous concurrency-related bugs present within a software stack reused from older radiation therapy machines. These bugs were masked in older machines through the use of hardware interlocks. If the older machines detected something that would place a patient or operator into a dangerous situation, hardware interlocks would prevent the older machine’s beams from activating. For Therac-25, engineering over-confidence in the correctness of software led to the design of the Therac-25 without any hardware interlocks. Suddenly, these preexisting race conditions were no longer masked.

Such a situation has many institutional and engineering causes. Simple engineering over-confidence in the reused software stack is at blame for assuming that reused software is free from bugs or has no bugs. This seemingly simple and straightforward error leads to other, more serious errors. Since the software was assumed safe, then it was logical to remove the hardware interlocks since they were not needed. Since the software was assumed safe, it wasn’t necessary to test until fully assembled within the destination hospital. Since the software was assumed safe, it was assumed that Therac-25 was not the cause of any incident and that any problems must be operator error.

Much of the world has changed since the days of Therac-25. Engineering standards have grown in complexity, and medical and safety critical systems are significantly more mature problem domains than they were in the early 1980s. Devices which fall under regulatory umbrellas adhere to numerous standards, such as IEC 62304, to ensure that safety critical software was developed using a known or trusted methodology. What may never change are the lessons taught by the Therac-25 incident:

  • Haphazardly done concurrent programming can lead to problems that are difficult to diagnose or even confirm the existence of.
  • The correctness of software for one application can depend on the hardware that runs it, meaning that software and hardware are often more closely linked than we may initially believe.
  • If a software failure condition is detected, the software should signal errors to the user in a manner appropriate for the detected problem. It should never fail silently.
  • Review and test are natural steps in the development of any system and should never be skipped.

Much of the new software written today is indeed not safety critical. However, the problems that we face in the non-safety critical world could benefit from these lessons. At Baseline Softworks, I build software with the following in mind:

  • Any concurrent architectures will stick to well-known paradigms, like the process-thread model or the actor model.
  • Execution environment is defined as part of requirements gathering, and any bugs must document information about the current operating environment.
  • All unexpected situations are communicated to calling code via exception semantics, when available, and program state is rolled back to a safe point during an exceptional condition. If there is no way to roll back program execution to the last safe state, then continued execution is prevented in a manner consistent with requirements.
  • Reviews of work products are performed as often as is necessary for the problem domain at hand, and the reviews are also of sufficient depth.
  • Both manual and automated testing strategies are defined for the project, from unit testing through integration testing and user acceptance testing. This extends into regression testing and the use of continuous integration tool sets to ensure a healthy, long-lived code base.