The Delicate Software Supply Chain

One of the largest trust obstacles facing software developers today is the mitigation and prevention of software supply chain attacks. A software supply chain attack can be thought of as any compromise that enables some malicious behavior to take place. Beyond being a simple security issue, a supply chain attack is particularly bad because effects can be deliberate, they can go unnoticed for a very long time, and the attack can be clandestine. Supply chain attacks on computer code or executable files are nothing new, but their frequency and severity is dramatically increasing.

Securing software from supply chain attacks is actually a very hard problem. According to a report published by the director of national intelligence:

“Attackers may seek to exploit tools, dependencies, shared libraries, and third-party code in addition to compromising the personnel and infrastructure of developers and distributors.”

There are many different attack surfaces and methods used to compromise the software supply chain. A single solution will not comprehensively solve this growing problem.

Developers have adopted code commit signing as an important step in mitigating the supply chain attack surface. Using cryptography, code commit signing indicates the origin of changes within source control systems. Code signing can additionally help end users determine if a binary or executable is as the developer intended and has not been altered. If a recipient trusts a particular cryptographic key or certificate, then software and code signed with that key or certificate has certain provenance assurances. Combined with proper key or certificate identity verification, much stronger trust guarantees are possible.

I have used hardware token PGP keys for some time to sign project artifacts, invoices, receipts, and other communications. I have also recently begun signing my source code commits and tags using that same PGP key. I believe that signing my work with my hardware token-generated cryptographic keys combined with other workflow changes will lessen the total attack surface for software supply chain attacks. I do believe that this is an important small step for all developers to take and that it will make a meaningful difference if enough developers do it.

For information on specific PGP keys that Baseline Softworks, LLC uses to sign software or source code, check out the End-To-End Encryption page. If you are interested in creating your own hardened PGP key, then read the tutorial by Eric Severance (esev.com) for a good start point.

Engineering Lessons

In today’s web-centric world where everything is connected to the internet, from smartphones, televisions, and video game systems, it can be easy to forget about an entire specialty of software development devoted to safety critical systems. Such systems, from the airbags or anti-lock brakes on our car, to large industrial plant control systems, are expected to work without issue and jump into action to save us from unsafe conditions. Safety critical systems perform their functions through the combination of hardware and software. Even if the systems we develop are themselves not safety critical, any lessons learned from the safety critical field are best learned once.

The Therac-25 was one such project that provided important lessons to the field of software engineering.

A radiation therapy machine, the Therac-25 was a computer controlled safety critical system built on the foundations of older radiation therapy machines. Numerous software bugs and user interface problems led to dangerous and deadly situations involving the machine.

How could such events happen?

Concurrency programming errors, otherwise known as race conditions, are extremely difficult to debug. These types of errors can appear seemingly at random, and then disappear just the same. Both novice and experienced programmers find it difficult to get to the bottom of race conditions.

The software of Therac-25 had numerous concurrency-related bugs present within a software stack reused from older radiation therapy machines. These bugs were masked in older machines through the use of hardware interlocks. If the older machines detected something that would place a patient or operator into a dangerous situation, hardware interlocks would prevent the older machine’s beams from activating. For Therac-25, engineering over-confidence in the correctness of software led to the design of the Therac-25 without any hardware interlocks. Suddenly, these preexisting race conditions were no longer masked.

Such a situation has many institutional and engineering causes. Simple engineering over-confidence in the reused software stack is at blame for assuming that reused software is free from bugs or has no bugs. This seemingly simple and straightforward error leads to other, more serious errors. Since the software was assumed safe, then it was logical to remove the hardware interlocks since they were not needed. Since the software was assumed safe, it wasn’t necessary to test until fully assembled within the destination hospital. Since the software was assumed safe, it was assumed that Therac-25 was not the cause of any incident and that any problems must be operator error.

Much of the world has changed since the days of Therac-25. Engineering standards have grown in complexity, and medical and safety critical systems are significantly more mature problem domains than they were in the early 1980s. Devices which fall under regulatory umbrellas adhere to numerous standards, such as IEC 62304, to ensure that safety critical software was developed using a known or trusted methodology. What may never change are the lessons taught by the Therac-25 incident:

  • Haphazardly done concurrent programming can lead to problems that are difficult to diagnose or even confirm the existence of.
  • The correctness of software for one application can depend on the hardware that runs it, meaning that software and hardware are often more closely linked than we may initially believe.
  • If a software failure condition is detected, the software should signal errors to the user in a manner appropriate for the detected problem. It should never fail silently.
  • Review and test are natural steps in the development of any system and should never be skipped.

Much of the new software written today is indeed not safety critical. However, the problems that we face in the non-safety critical world could benefit from these lessons. At Baseline Softworks, I build software with the following in mind:

  • Any concurrent architectures will stick to well-known paradigms, like the process-thread model or the actor model.
  • Execution environment is defined as part of requirements gathering, and any bugs must document information about the current operating environment.
  • All unexpected situations are communicated to calling code via exception semantics, when available, and program state is rolled back to a safe point during an exceptional condition. If there is no way to roll back program execution to the last safe state, then continued execution is prevented in a manner consistent with requirements.
  • Reviews of work products are performed as often as is necessary for the problem domain at hand, and the reviews are also of sufficient depth.
  • Both manual and automated testing strategies are defined for the project, from unit testing through integration testing and user acceptance testing. This extends into regression testing and the use of continuous integration tool sets to ensure a healthy, long-lived code base.

Crashing into the Technological Wall

Imagine finding the perfect security solution for your corporate VPN. It is low cost or free, it has been vetted through the process of nearly two decades of open source review, and it is supported on all of your company’s operating systems. You can integrate an internal certificate authority, complete with certificate revocation lists, and include PAM single sign-on authentication modules for a full solution that appears to satisfy your information security policy’s confidentiality, integrity, and availability requirements.

Imagine, then, that your beta test discovers that it can’t reliably scale beyond more than a dozen active users.

OpenVPN was developed over many years through the efforts of talented contributors. It is one of only a handful of reputable free or open source solutions addressing the VPN market – A virtual network, often over the internet, to link your corporate network with remote workstations in a secure manner. VPN software, much like other security software, benefits strongly from long-term community acceptance and review. The more review that security software has, the more robust it is thought to be at resisting known types of security intrusions. The OpenVPN project fits this bill handily. However, other important considerations were neglected throughout development.

In the mid 2010s, industry migrated to a minimum of 2048-bit RSA public key cryptography, along with a strong preference for 256-bit symmetric encryption. The expansion of data-intensive or latency-sensitive use cases, such as video, teleconferencing, and the transfer of large data sets vastly increased the load that VPN server solutions are required to handle. Parallel to the tightening of minimum security requirements, state-of-the-art computer architectures continued a decade long trend towards multi-threaded and multi-processor systems in pursuit of increased performance. Raw single processor performance became a non-meaningful metric, as gains in single-threaded loads paled in comparison to gains offered through heterogeneous and homogeneous multi-processor systems. The OpenVPN server, a monolithic C application, runs within a single process context and is not multi-threaded. Benefits from the advances in multi-processor systems cannot be realized within an OpenVPN server process without running multiple OpenVPN instances (often one per end user, each instance hosted on a different TCP or UDP port) and setting up complex routing rules within the operating system.

Session key negotiation and the transfer of large volumes of data re the two main load drivers on a VPN server. As security requirements tighten, more and more of a single processor core’s execution time is spent on these tasks in an OpenVPN instance. The monolithic design leads to key renegotiation from one user impacting the performance of all other logged in users. While this may not be noticeable for 2 or 3 end users, at some point the OpenVPN server instance will reach a point where the performance problems or scaling issues will render it unfit for certain use cases. Such performance issues may become more the norm in the near future as industry further tightens security expectations to a minimum of 3072-bit RSA public keys.

Addressing the shortcomings of a legacy design can be expensive or even impossible. This is true whether that legacy design is due to a lack of discernible architecture, the desire to retain tried-and-tested designs for security reasons, or not having a unified set of standards and practices among contributors. Community driven development excels in small, incremental steps, as described on the OpenVPN community wiki. Large coordinated architectural evolution, on the other hand, is extremely difficult under a development model that lacks a strong central authority.

The OpenVPN project contributors are fixing this. Efforts on the OpenVPN 3.0 rework have been ongoing since at least the May 2010 road map meeting. Their task is neither small nor trivial, but given enough resources and time they will certainly reach feature parity with the legacy code. Their burden is also not one to carelessly ignore, as it extends frighteningly beyond mere technical challenge. The reluctance of developers to change tried-and-tested code is a triviality compared to larger market forces. Who is to say that industry would ever embrace a programmer’s rework at all? Wouldn’t the vast majority of end users stick with the older, more widely tested version? Without a plan to convincingly demonstrate the security of a new re-designed version market rejection of the new product is more likely than not. Once the old monolithic OpenVPN application reaches its absolute end of market usefulness, what edge would a new but non-vetted OpenVPN application have over any other more mature, more well-tested solution?

Projects that seek to have an expected lifetime far into the future must be built to scale and evolve in as many potential paths that computer hardware and software development might reasonably take. It isn’t even required for the long-lived project to immediately make use of features like multi-threading. Rather, the project must be designed and built around the possibility of multi-threading becoming the only way to increase performance or to alleviate scaling issues. Tried-and-tested software development truths, such as maintaining high cohesion and low coupling between components, adequate separation of concerns between modules, the proper documentation of system architecture, and enforcement of adequate standards are often enough to make transitions such as the one faced by the OpenVPN project significantly less painful.

Provided, of course, that this guidance is observed right from the project start, as the case in all of Baseline Softworks, this lesson is sincerely taken to heart, and all projects are approached with future-proofing and industry changes in mind.