Go Space Monkey

This article was originally posted on Space Monkey’s blog.

Optimal performance has always been a top goal of Space Monkey’s product offering, and though we are making steady progress on this journey, we haven’t arrived yet. So we wanted to talk a bit about that and delve deeper into what’s been going on behind the scenes for the last few months. This post will definitely be technical in detail.

Correctness first, optimize later!

The core of our software began as distributed systems research, written in Python. Space Monkey benefited greatly from the Python programming language’s expressiveness, rich standard library, and extensibility. We also benefited greatly from the Twisted event-driven library. Both Python and Twisted are wonderful systems and we love them.

One common refrain in the software development community is that programmer time is more valuable than processor time. This is sage and valuable advice. It is easy to solve many problems by throwing more hardware at them, provided that your problems are not algorithmic complexity, and sometimes even then. With Python in particular, it is very easy to optimize hot paths by rewriting them in C. Even though we were shipping a performance-critical distributed system on embedded hardware using reduced resource ARM chipsets, we knew that our system would be heavily IO-bound, and any CPU-intensive parts could easily be rewritten in C.

Our back of the napkin calculations said this approach should have been sufficient. Our ethernet device tops out over 20MB/s, and our crypto chip and CPU, used together solely for data transfer, can do SSL at 8 MB/s. That said, our initial versions of the system routinely only achieved transfer rates near 1MB/s.

So, what was happening? Our throughput has been thoroughly CPU constrained. The amount of work our little ARM CPU was doing to make sure your encrypted data is safe on the Space Monkey network (Reed Solomon! DHT maintenance! Oh my!), in addition to SSL for file transfers and just general Python runtime and Twisted overhead was enough that the CPU was pegged and couldn’t process things any quicker.

Time to optimize, right? Starting with a 90k line codebase of Python and C, fast forward to now and we’ve spent months finding hotspots, optimizing, configuring, and writing C modules, rinse, lather, and repeat. Some highlights:

We enabled our on-board crypto chip for hardware accelerated cryptography in all of our libraries and code. We also found and fixed bugs in said drivers.
We patched the kernel to do DMA directly to our hardware accelerated crypto chip.
We moved file encryption to the client, where it should have been to begin with.
We enabled cgroups on our Python process to tightly control memory usage and swap contention without requiring a different Python memory allocator scheme.
After continuing to hit walls with the standard Python Protocol Buffer library, we wrote our own that was 5x faster than the barely-documented C++ Python Protocol Buffer compiled module support.
We rewrote Twisted Deferred handling to be 30% faster.
We optimized the Twisted Reactor event loop to not only do less constant work, but to prefer incoming I/O.
We wrote our own Reed Solomon library that was 10% faster.
We reduced user-facing head-of-line blocking with Twisted by rewriting our codebase to use multiple reactors.
We made countless other small optimizations and tweaks.

After all of this optimization, we got up to 1.2 MB/s.

We spent long nights poring over profiling readouts and traces, wrote our own monitoring system, and wrote our own benchmarking tools. The fundamental holdup seemed to be that the CPU was simply doing too many non-optional things. Our main event library was doing too much bookkeeping. When your I/O loop framework is your hot path and all your code fundamentally relies on that framework, rewriting in C is tantamount to starting over.

Twisted and Python are great on adequate hardware, but our little ARM devices were pooped.

We hoped that the PyPy JIT Python interpreter might save us. Unfortunately, our ARM architecture does not support the floating point instruction set PyPy requires.

Hail Mary

Python performance is frankly not as good as compiled languages. In practice, this is hardly an issue – until it is. Python no longer holds the powerful position it once did at Google. Dropbox is writing a new Python runtime to try to deal with issues they’ve found. It was infeasible to build our own Python runtime and PyPy is not ready for primetime yet on our hardware. So we started looking at possible compiled languages to switch to. The Space Monkey development team has spent many years at previous employers working in C and C++ (and we actually maintain a large C++ codebase for desktop clients internally at Space Monkey), so those were seriously considered.

But we’ve also relied on Google’s new Go language for many of our supporting cloud services since Space Monkey’s inception. Our NAT-failover relay system has been written in Go since before Go1. With already 40k lines of code and experience in Go, we knew Go was semantically very similar to Python.

So in the bottom of the ninth, we decided to transliterate our 90k lines of Python directly to Go, line by line.

It took us about 4 weeks.

It was a heroic effort by the whole team. With a very clear set of rules about transliteration up front, we very carefully changed the code flow from Python to Go. With careful line-by-line transliteration, we avoided many of the pitfalls teams typically make when they decide to rewrite. We then did pair-programming audits of each and every line. We ported our integration tests. We ran our system tests. The tests passed.

Our very rough initial draft achieved speeds of 4MB/s with only 16% CPU utilization. We decided that an optimization in the hand is worth two in the bush, and have spent the last few weeks running the code through its paces, stabilizing the new codebase, searching for and eliminating every bug we can find.

This new code began rolling out this week. It will likely take a few more weeks to reach all of our current customers, but once it does, you should see drastic performance improvements.

Edit: A few more things I probably should have mentioned that people have seemed curious about.

We had a reduction of memory usage by about 30MB on average. Many of our Python processes (constrained by cgroups to not use much swap) used about 80-95MB of resident memory. Our rewritten Go processes instead use about 50-65MB of resident memory with the same cgroups configuration.
The 90K number includes our tests, but without tests, the Python codebase is 36,784 lines of code. Those same lines of code became 41,717 lines of Go code. So, not that much of an increase. When you consider that’s just 4,933 more lines, it’s not crazy to assume most of those are closing braces.

The future

Obviously we aren’t done. 4MB/s is a 400% increase in speed on most workloads, but the team isn’t satisfied to rest here. We’ll be working tirelessly to bring that number up even more.

Our CPU issues in the past have delayed the release of new on-device features such as SMB and DLNA support. The demos you saw in Kickstarter Update 16 were enabled by this new Go codebase. Since we now have CPU room to spare, SMB and DLNA will be out shortly.

So expect great things soon!

Thoughts on Go

As the adage says, programmer time is much more valuable than processor time. So you should use Go.

Go is a high-level language. It is easy to write in just like Python.
It has a rich standard library just like Python.
Go’s unit testing framework is phenomenal.
Unlike Python, Go is statically typed, which makes refactoring large codebases unbelievably easier.
Like Python, Go supports a form of duck-typing, through interfaces. Interfaces are the bees knees.
Like Python, Go is garbage collected.
Unlike Python, Go frees you from worrying about distinctions between synchronous and asynchronous calling conventions or threadpool switching.
Like Twisted, Go is evented for I/O, but unlike Twisted, Go stack traces are much more manageable and Go’s profiling tools provide much greater clarity about execution order.
Unlike Python, Go is fast, compiled, and leaves the runtime execution to only the work that needs to be done.

It’s a shame Go didn’t exist when Space Monkey started. We’re very lucky to have it now.

Open source?

In transliterating a large Python codebase to Go, we ended up porting or writing some useful things we had already written in Python or Python already had. We’ve also written some useful tools for understanding and debugging Go.

We’ll be open sourcing a handful of libraries and utilities we’ve written to give back to the Go community in the coming weeks, but while we work on getting that ready to go, we wanted to let our users know what’s going on. So stay tuned!

Go’s standard library is very rich and useful and it was rare that we had to stray outside of it, but we wanted to give a special shout out to Walter Schulze’s excellent gogoprotobuf extended Protocol Buffer library.

Update: We open sourced some stuff!

Want to work on Go?