Like 200,000+ other people late earlier this August I watched NASA's Mars Curiosity rover landing. Also, like most, I was suitably impressed by how difficult the challenge was. The code behind this is awesome.
What is it like to write mission-critical software that can't fail? How do you create systems of such high fault tolerance? What processes produce code that can autonomously land a craft fourteen minutes beyond the range of human intervention?
Curiosity's code runs on a 200 megahertz computer chip with 256MB of DRAM and 256KB of electrically erasable programmable read-only memory (EEPROM). It has 2GB of flash memory. Running on that hardware is 2.5 million lines of C code. The C language, despite being developed between 1969 and 1973 at Bell Labs, remains a great solution for embedded systems; much of my first job post-college was programming digital effects pedals for guitars using it. It's a mature with robust tooling, known best practices, and proven track record on other high profile projects (most notably the previous rovers, Spirit and Opportunity).
Despite having a codebase in the multi-millions there was only thirty on the actual programming team. Around ten people comprised the test team. Tests were written in Python with an emphasis on log analysis (trying to catch everything in real time was too prone to result in missed error states).
Ensuring that a $2 billion project succeeds places quite an emphasis on testing. The creation of these Python scripts had to be created to specifications. But, as a slideshow of the project development points out, the problem with specifications is that they're:
- time consuming to write
- hard to auto-generate
- difficult to read which hinders
NASA also has a number of guidelines to ensure the highest levels of code quality. Recursion is shunned, for example, because C compilers cannot guarantee the stack won't explode. Loops must terminate so analyzers can find problems. Nearly all memory is statically allocated to avoid messing with garbage collection overhead and possible instability. And isolation of systems is paramount - through memory protection and singular ownership of data it is that much harder for subsystems to mess with each other. This aligns nicely with the principals of high cohesion and low coupling mentioned often in OOP (Object Orientated Programming) circles.
How cohesive are we talking on each function? There is a sixy line limit on functions. Not only is smaller pieces of code more comprehensible but there are few places for bugs to hide. For the complete overview their guidelines are available online to review.