Software Complexity comes from reification of complex domains

January 10, 2024

Has this ever happened to you: You come across some piece of software and want to take a look at its source code. You open its codebase and it is sprawling - many layers of abstraction and much more complicated than what you expected. You think you can implement the concept in much fewer lines of code and in a much simpler fashion. If you attempt to do so, you may get the core concept (the “kernel”) implemented pretty quickly. But more often than not, as your software molds to occupy the space of the software it is replacing, it’ll reach a similar level of complexity.

Why is that? I think the core-reason is that complexity arises from fleshing out little abstract concepts into full fledged systems. What does that even mean?

A lot of this stuff is obvious - ultimately your software is a realization of an idea - if the idea/domain it represents is complex - the software becomes complex. However, this is not always the case - sometimes the engineering aspect makes things more complex. The core idea may be simple, but as you flesh out the different engineering corners, you face the complexity. For instance, your software probably has certain engineering goals. These goals may be:

It should be lock free
It prioritizes low-latency
It should have good error-reporting
It should be a good “citizen” (relevant when your software interacts with other computers - eg: web scrapers or your TCP/IP stack)
It should be extremely simple
It should be allocation free
It should be able to run in extremely constrained environments
It shouldn’t have any dependencies

Notice that the scope of these goals varies quite a bit, from implementation details (eg: lock free) to broader ideas. In either case, these goals place constraints on the codebase that ultimately shape what the software looks like. Anything new that you introduce to the codebase, even a common idea, gets shaped into a unique thing as it is weaved through all the constraints and goals that govern your codebase. This is because software is ultimately shaped by a working set of ideas you maintain while developing the software.

The more detailed your goals-list is, the more complex your software gets. Thus, the inherent software complexity consists of the domain you’re working in combined with your engineering goals.

We can now answer our original question by looking at it through this lens. The reason the software you saw was much more complicated than you expected was probably because either:

You have a simplistic model of the underlying problem domain, which in turn leads to a simpler implementation (which is however not optimal)
The project has engineering goals which make the implementation of this simple idea much more complicated.

In general, often your code is not performant because its underlying model of the hardware is too simplistic (eg: same costs for all memory access operations, single execution unit, etc). Once it is rewritten to use all of the bells and whistles of modern hardware, it is usually more complicated than the simpler version. This level of optimization is usually not needed universally across the codebase, only in the hotspots. Thus, a performant codebase may have different parts which utilize the hardware to different levels (i.e their underlying hardware models vary in fidelity).

Let us now go into some examples by domain to see where simple and complex software differ.

Compilers

What’s a big difference between toy compilers and production compilers? Well, there is the codegen efficiency, but a less thought about one is error-reporting.

A toy compiler may handle some errors, but it is usually an afterthought. In production compilers, often it is an actively thought about goal that resources are allocated to. It informs a lot of decisions about how the compiler is written - instead of discarding information from the earlier layers, you need to maintain information from earlier layers of the compilers to have good errors in later stages. Production compilers have a much more sophisticated model of errors than toy compilers.

Another one is economy of file operations. Toy compilers usually just read in the entire file into memory and start compiling it, whereas production compilers are usually much smarter about this. They have entire components dedicated to file management and try to strike a balance between reading too much from the file and memory usage.

Toy compilers also don’t give much thought about compilation speed, but production quality compilers need to care about that. A compiler needs to care even more about this if it is a JIT compiler, since JITs by nature need to compile really fast, even if it it means foregoing some optimizations.

This is why you’ll often see an entire task submission system implemented within production quality compilers.

This is a good example of what I want to highlight: The complexity comes from actively thinking about an abstract piece of the process. Production compiler writers have thought a lot about error reporting, compile-time performance, file handling, etc. At a very abstract level, these are doing the same things:

Load file -> Tokenize -> Parse -> AST Simplification -> IR -> Optimization -> Assembly

However, the devil is in the details (as you might expect) - these components are much more flushed out in the production compilers.

Game Engines

Consider a game engine and compare it to a simple standalone game. The game itself would have a simple engine hidden inside of it, which does at least the following things:

A way to load assets
A game loop
Physics logic

A game engine like Unreal or Unity fleshes out these things to a great detail and has abstractions for components that plug into its system.

Asset management is something that production game engines have have lots of developer cycles dedicated to - eg: integrating with other software like Maya, flat naming mechanisms, hot-relading ec.

Physics Engines probably are the most transparent w.r.t the complexity of their underlying domain. Simple games have simple physics models. Games which model really complex worlds and have realistic / complex interactions have complex codebases (not to mention the overarching engineering goals of pushing out as many frames as possible without delays)

Matrix Multiplication

Matrix multiplication has been studied to death. Optimizing matrix multiplication is an excellent case study into how even the simplest things can become very complex as you introduce more and more optimizations.

Instead of multiplying the whole matrix, you can make use of vector instructions and multiply smaller chunks of the matrix, which then opens a whole can of worms such as the optimal chunking order of matrices and interplay of the various operations (including memory access operations) with each other.

Instead of considering one matrix multiplication in isolation, you may consider a pipeline of matrix operations and may try to parallelize different aspects of the pipeline.

Aspects like memory access, arithmetic operations, etc which were small parts of the naive version become their own subjects of study for a high performance version.

Build systems

At the most naive level, you just issue one big compile command with all of your source files. This is obviously inefficient, so you have makefiles that have some mechanism for describing dependencies and independent units. But there’s still manual work w.r.t passing flags in a compiler-agnostic fashion, taking care of different architectures, so you have fancier build tools like cmake, meson and bazel which have more involved dependency systems. They introduce a level of abstraction that interfaces with the compiler and hides the complexity of interacting with multiple compilers and being able to support multiple architectures.

Better build systems have more sophisticated models of (but not limited to):

Dependencies
Compiler flags
Different kinds of targets (executables vs libraries vs autogenerated files)

TCP/IP

Congestion Control is an excellent example of the software trying to play a part in the system outside it.

A naive implementation of an abstraction over packets may just throw packets out on the network interface and wait for packets back.

But for being performant, the simplistic model of the network card & the outer network being perfect needs to be refined. Networks can be congested and we need to slow down. They can become uncongested and we can speedup. Thus the performant software is aware of the control system aspect of the system it is a part of. It would need to measure how much it sends out and needs to be smart about when it measures and the accuracy with which it measures. Thus, as it reifies its internal model of the world, it becomes more complex, as the underlying system it is interacting with is complex.

Reliability in databases

The TigerBeetle database has an overarching engineering goal of: never. losing. data.

Many databases have a simplistic model of reliability. They assume that the operating system is perfect that that its syscalls perform correctly. However, research has shown that syscalls typically used to ascertain whether an in-memory file has been written out to disk can sometimes lie to you. Thus a software can be lulled into a false sense of security of having stored data into persistent storage.

TigerBeetle has a much more sophisticated model of data corruption and actively works to store data redundantly and attempt to detect and restore corruption. This philosophy permeates to all aspects of its code, which boils down to having redundancy and dotting your i’s and crossing your t’s. This obviously means that the database code that deals with these aspects would be more complicated than something that just assumes that fsync works.

Implementing a standard

Another transparent example of domain complexity is the implementation of standards. If the standard you’re implementing is complex, your software becomes complex.

Example: The C++ standard is huge and a lot is expected from a standards-compliant compiler, which is why it is really hard to make a fully C++ compliant compiler - Note that this complexity is purely for standards compliance; making it performant is a whole other goal that in turn contributes a lot to complexity.

Google Borg

A lot of the inspiration behind this post comes from an excellent Quora answer by Onufry Wojtaszczyk who dives into what makes Borg (Google’s cluster management system) complex and where that complexity arose from. Here are some snippets, but you should go read that answer in full.

In my experience, most huge systems have:

Started out with a large, relatively complex, but not huge kernel. Grown significantly larger, and locally more complex, due to feature growth. Been broadly redesigned several times in large efforts led by senior engineers who understood the majority of the system.

Borg, like most complex systems, was initially designed complex, but nowhere near complex as today. It began with a core idea that could be described in a page of text.

A large part of what happened later was feature growth. SSD was added as a resource. It turned out that the initial assumption - that if a spinning drive breaks, all workloads that were using this spinning drive should be considered dead - was incorrect, and codepaths to allow surviving disk loss were added. Proactive rescheduling to bin-pack the cell better was added.

Additionally, an ecosystem was built around the basic core of the master and the borglet - introspection tools, a configuration language, automation to predict the resource requirements (instead of having people enter them manually in a config), quota management systems, cross-cell scheduling, and many, many more, each one built to address a specific need.

Finally, the system was improved as we went. The initial count-everything-once-every-second resource accounting was gradually replaced with kernel-based mechanisms (cgroups). We overhauled error reporting, because our users couldn’t make sense of what they’re seeing. Read-only GUIs were added at both the Borgmaster and Borglet level, and then a separate cross-cell UI was built with improved capabilities.

All of these are all major systems or functionalities that expand Borg and make it larger and more complex - but they do not necessarily make it deeper.

Too many abstractions?

Of course, this is not to say that software should be complex. We should all strive for simpler software. As Jonathan Blow noted, a lot of software has too many abstractions which makes accomplishing even simple things more involved than they need to be. Complexity should only be introduced if it is needed. The purpose of this blog post was not to make you erect an ivory tower of abstractions, but to help appreciate how there’s some inherent level of complexity depending on the problem domain and engineering goals.