So you want to optimize your code, eh? Who am I to blame you? I certainly want you to optimize your code!
I spent a few years as part of the Firefox Performance Team, on the frontline of, well, performance. I still bear some of the scars. So, as the grizzled perf-veteran that I have decided to be for the day, let me invite you to sit down for a while and share a little hard-earned experience on code optimization.
Make sure your code is ready for optimizations
It all starts with a story. A long time ago, in a galaxy far far away, toiled a developer who intended to make Firefox faster. One of the things that slowed down Firefox was that the disk was accessed on the main thread. If the disk was too busy, or waking up from sleep, this could cause the Firefox user interface to freeze for up to three seconds.
That wasn’t good. Our intrepid developer was in charge of making sure that this didn’t happen anymore. This meant moving all disk accesses to another thread, which was complicated by the fact that, for historical reasons, all the APIs that needed to access the disk were synchronous.
Anyway, after countless weeks carefully moving things across threads, making sure that unit tests passed (this required expanding the unit test framework to allow async tests), making sure that the integration tests passed (same story), making sure that the performance tests passed (no changes needed, lucky him) and testing manually, our developer finally merged the task and informed the rest of the team that he was done with this one file access.
It was then that Firefox started breaking apart in weird an unexpected places. Crash Scene Investigations reported a surge in crash reports from Firefox Nightly. Reverting and debugging ensued. In the end, the culprit was identified. It was the dreaded nested event loop, an API (called elsewhere in the code) that can do weird things with causality. After much more debugging and blood sacrifices, and a complete refactoring of the shutdown code of the browser, said nested event loop was tamed, the code merged and performance improved.
The anonymous developer happily moved on to the next file access.
Despite the countless weeks spent working on this optimization, this story is actually one when things went reasonably well:
- There were plenty of tests (even if these tests didn’t catch the issue);
- There were additional safeguards, with Firefox Nightly, Crash Reports and volunteer testers (which did catch the issue);
- There were a process and a mechanism that allowed our anonymous hero to backout the culprit without compromising the product;
- The code, while quite large, was well-tested, well-documented and clear enough that performing further refactorings was possible in the first place;
- The many teams involved were all aware of the difficulty of the task, of its importance, and were ready to spend a little time brainstorming or reviewing solutions, even if they required some sophisticated rearchitecturing.
Critically, all these points were in place before the optimization began. If they had not been, the story would have ended considerably worse.
Lesson 1 Make sure that your code is ready for optimizations.
You will break things during optimization. Some of the things you break will be your fault. Some were already broken, and you’ll just have made them visible.
- If you can’t detect breakages, you’re killing your product.
- If you can’t revert your optimizations, you’re killing your product.
- If your processes or tools won’t let you fix breakages that arise, you’re setting yourself up for failure.
There are many ways to make sure that your code is ready for optimization. Documentation may not be sufficient. Static types may not be sufficient. Unit tests may not be sufficient. Integration tests may not be sufficient. Manual tests may not be sufficient. A crowd of testers may not be sufficient. Crash reporting may not be sufficient. But every one of these things that you’re missing is a hole in your defenses.
Lesson 2 Defend in depth.
And yes, that lesson is also valid for safety and for security.
Determine what you’re optimizing for
This story takes place earlier in the history of Firefox. A new contender had recently joined the arena, an upstart called Chrome. The Firefox teams were cringing a little bit because the name Chrome belonged to them (the chrome being the historical name of the Firefox front-end), more so because users claimed that Chrome was much faster than Firefox and even more so because the benchmarks disagreed.
And yet, pretty much every single user reported the opposite.
It is then that they realized that they had spent the past year optimizing for the wrong metrics. And that they started moving file I/O off the main thread.
This is a case in which things could have gone much better. An entire year lost. Still, things could have been worse. The Firefox team knew that something was off, even if they couldn’t quite figure out what.
In this case, Mozilla had optimized for throughput at a time it should have focused on responsiveness. But for your application, it might be something different entirely. Here are a few possibilities:
- If you are working on mobile or IoT, you may need to optimize for battery and possibly memory.
- If you are working on a video game, you will need to optimize for video and audio rendering smoothness, controls responsiveness.
- If you are working on a frontend, you will need to optimize for time to first rendering, time to first interactive and responsiveness.
- If you are working on a scientific tool, you will need to optimize for total execution time.
- If you are working on a backend, you will need to optimize for total execution time, memory, disk and cpu.
- If you are working at datacenter- or planetary-scale, you will need to optimize for energy and possibly hardware wear.
- If you are working on a streaming solution, you will need to optimize for bandwidth.
In many cases, you will have more than one target. In many cases, too, targets may be contradictory. For instance, if you’re optimizing for total execution time and/or responsiveness, past a certain point, this is going to cost you battery, energy, cpu and possibly memory. Possibly worse, some of the tools used to achieve responsiveness is making things async, which is going to end up increasing your total execution time.
Lesson 3 Know what you’re optimizing for.
- Optimizing for the wrong metric will waste your time.
- Optimizing for the wrong metric may push the right metric in the wrong direction.
- Optimizing for the wrong metric will very likely make your code less ready for further optimizations.
Oh, by the way, how do you know what to optimize for in the first place? Maybe your code is already optimized enough? Well, that’s where you need your product lead to chime in, or directly your users.
Lesson 4 Optimization is a feature.
Corollary: it may be backlogged until it becomes a priority.
Determine what you’re willing to trade off
Our story continues years later. Efforts to make Firefox faster had succeeded. Firefox felt smoother, more responsive, users were largely happy. But the developers knew that this couldn’t last. Any add-on could slow down Firefox, freezing its user interface for arbitrary durations. Any webpage could do the same. The solution was simple: move to a multi-process architecture. In fact, by then, a multi-process prototype had been available for years. This architecture was better not only for performance but also for security and reliability.
But there was a snag: the XUL model for developing add-ons was not, could demonstrably not be made compatible with the multi-process architecture. So moving to the multi-process architecture would mean dropping pretty much all existing add-ons and quite possibly angering the community that relied upon these add-ons. Also, multiple processes meant a larger memory footprint.
The Firefox team dithered and delayed for several years, hoping to find a miracle solution. But in the end, sacrifices were made (you may wish to take a look at the comment section, too, as well as comments on HN and Reddit to get an idea of the reactions to that choice).
Sometimes, optimizing will require sacrifices. Strike that. Optimizing always requires sacrifices.
- Caching results may improve your speed, but it will cost you memory, and possibly consistency.
- Cutting features may let you get rid of inefficiencies, but it will cost you, well, features.
- Approximating results may be how your game reaches 60fps, but players may be unhappy about weird bounding boxes.
- If you are splitting your backend into microservices, you are sacrificing CPU efficiency, energy, bandwidth to gain scalability.
- If your game needs a recent GPU to achieve decent performance, you are sacrificing some of your userbase, in addition to natural resources, to gain smoothness.
- If you are called Netflix, you are probably willing to trade staggering CPU time for a small percentage of bandwidth gains.
- If, for the sake of performance, you are using C or C++, or bringing in native modules in your higher-level language, or bringing in external components such as Redis or Kafka, you are accepting an extended security perimeter and lesser built-in safety.
- In every case, you are trading away engineering time for optimization.
- In most cases, you also make your code more complicated, trading readiness for optimization.
Lesson 5 Optimizations have a cost.
If you are not willing to pay this cost, do not perform the optimization.
And since optimizations have a cost, you may run out of budget while optimizing.
Lesson 6 Scope and checkpoint your optimizations
- If your optimization process doesn’t have checkpoints, it’s much harder to measure your improvements.
- If your optimization process doesn’t have checkpoints, it’s much harder to give your QA / alpha-testers / post-CI testing process time to confirm that your optimizations have not broken your code.
- If your optimization process doesn’t have checkpoints, you risk the chance of the entire process being dropped without results when it ends up being longer and more complicated than expected.
Fast forward a few more years. Firefox had lost many add-ons, but was now a fast and reliable browser, once again. It felt good. Developers, who were all Firefox users, finally could brag that their browser felt smoother than it ever had. Benchmarks were green. Also, the switch to the new add-ons model and to the multi-process architecture has enabled considerable cleanups, which made the codebase much fitter for further optimizations.
But still, Performance Telemetry insisted that Firefox was slow for a non-negligible percentage of users. No developers could ever reproduce these slowdowns, so surely, it had to be some kind of mistake? Some exotic configuration? Perhaps unreliable GPU drivers? Some buggy anti-virus that drained all CPU? Surely this could be ignored?
Until one day, one developer-turned-manager decided to try something different. He ordered a bunch of low-end laptops, the ones that you probably never use if you read (or write) this blog. Not even the kind of laptop that you can find for cheap in supermarkets, but the kind of laptop that you could find last year in a supermarket. He then proceeded to distribute these laptops to performance developers.
Everybody had to agree that there was still some work to do on the performance front.
Once again, this is a case in which things worked fairly well, even if it took some time to get there:
- Developers were dogfooding their product.
- There were a fair number of benchmarks (both micro-benchmarks and end-to-end benchmarks).
- There were additional safeguards, such as alpha testers.
- Firefox had Performance Telemetry, which measured real-world performance and trends.
- A team was looking the results of Performance Telemetry and taking notes on surprising results.
- Someone was really motivated to dig into the notes.
Very few organizations have all these steps. And while steps 1-5 worked and caught numerous issues, none of them noticed this particular issue. Once again, all these steps were in place before the optimization work started.
Now, let’s unpack this. Your primary tool for measuring improvements are benchmarks (and, at a later stage, profiling data). They are incredibly useful, but they’re also lies. Your JIT, your optimizing compiler, your CPU cache, your disk cache, your OS, your router, your semi-hardware battery monitor are all conspiring to make every number you get from a benchmark unreliable. Or, as the saying goes, “The second algorithm is always faster.”
Lesson 7 Use benchmark but don’t trust them.
- The numbers given by a benchmark are incredibly noisy.
- The numbers given by a benchmark are useless if your machine is doing anything else.
- The evolution of numbers given by a benchmark is useless if you have upgraded your machine.
- The evolution of numbers given by a benchmark is useless if you have updated your OS.
Corollary: You need a dedicated machine.
Of course, as we’ve seen, benchmark numbers are not sufficient. You’re going to need real-world numbers. That’s what Firefox is doing (yes, we learnt that from Chrome). This is the mechanism known as Performance Telemetry, or simply Telemetry, and it’s about phoning home with (anonymous) performance information. It is of course easier if you own the device phoning home (i.e. if you’re writing a backend application), but it can be deployed to just almost any device with connectivity. Just be certain that Telemetry doesn’t mess up with your optimizations (on a low-powered device, sending data can be costly) and doesn’t leak privacy information. Also, please think carefully about the privacy implications of Telemetry and think whether it can/should be opt-in or opt-out.
Note that Telemetry data is even more noisy than benchmark data. If you have many users, they will be running your code on a variety of devices, network or OS configurations, concurrently with any number of other applications, or your numbers may be affected by a user putting their laptop to sleep. Experience indicates that Telemetry is even affected by religious festivals.
Lesson 8 Use real-world data but don’t trust it either.
- If you’re not measuring real-world performance, you have no idea whether you have improved anything.
- If you have not started measuring real-world performance before you start optimizing, you have no idea whether you have improved anything.
- If it’s possible, don’t forget to include your cost in your measures.
- Real-world performance is incredibly noisy.
So if benchmark data is noisy and real-world performance data is even more noisy, how can you trust these numbers? You can’t. However, if 5% of your users start having performance issues regularly, there are really good chances that performance has regressed. It is time to investigate.
Lesson 9 Extract reliable data from unreliable data.
- Graph the evolution of your median performance. It will show you how you’re doing on average.
- For each metric, graph the evolution of the percentage of samples with unacceptable performance. It will show you how you’re doing in the worst case.
This applies both to benchmarks and to real-world data.
Corollary: You need to have decided what “unacceptable performance” means. You can have more than one threshold (e.g. barely acceptable/unacceptable/unusable).
Chances are that you will always have samples with unacceptable performance, because of the noise.
And now that we have numbers, we can safely ignore them, right?
Lesson 10 Raise alarms.
- Numbers are useless if you don’t look at them.
- Raise the alarm if your median performance dips.
- Raise the alarm if your number of sample with unacceptable performance raises above some unacceptability threshold.
- Just as performance is a feature, a performance alert is a bug.
Corollary: You need to have decided what “unacceptability threshold means”. Given that measures are noisy, 0 is probably not an acceptable answer.
Tools of the trade
The loop is almost complete. We’re close to where we began. At the time, there was no good way to determine where time went. Certainly, Firefox took too long to save its state, or to launch, or to scroll down, but why? Was it because of too much CPU used? Was the code using the disk too much? Was there a complexity issue in the garbage-collector?
Certainly, Telemetry could answer some of these questions. But adding Telemetry probes to each line of the code was out of the question. So developers needed a tool that could tell them, on their own computer, where the time went.
Some operating systems come with mechanisms that let code ask these questions to the OS and extract precise performance data. Experiments were made to take advantage of these mechanisms, but they proved disappointing, in part because only the computer’s administrator had any right to perform such measures. Still, Firefox developers made use of these tools when they could.
Finally, Firefox developers, add-on developers and, soon, web developers had an almost-one-click access to performance information that they could use to make Firefox or their websites faster.
So, what went well and what went wrong?
- Through Performance Telemetry, developers could get an idea of which part of the code encountered performance issues. This mechanism was not sufficiently precise to guide optimizations through the entire process, but it did its job quite well.
- There were native tools, including native profilers and native probes, and Firefox developers made use of them. Without these tools, the considerable improvements to SpiderMonkey or other modules would not have been possible in the first place.
- In the end, tools adapted to the technology were even better. But building these tools had a cost.
- As we discussed previously, sometimes, the costs pile up and some optimization work (or in this case, optimization tooling) is dropped before it can become useful.
Your situation is, most likely, different. For one thing, it is fairly unlikely that you’ll need to come up with your own profiler.
However, one thing remains: the tools we have discussed in previous sections will let you monitor optimizations and the need for them, but they are not sufficient to investigate inefficiencies in depth. For that, you’ll need some kind of profiler, whether it’s a performance profiler, a memory profiler, a network profiler, etc.
Lesson 11 Profile, profile, profile.
- Your profiler shows you the biggest inefficiences in your code/system.
- Your profiler is usually not as stable as a benchmark.
- Your profiler is definitely not as realistic as real-world data.
- Nevertheless, it’s your best friend.
A profiler will put you on the right track. A profiler will let you approach the problem from the outside, as an investigation. How simple the investigation will be depends on how many layers of abstraction you need to peel before you reach the culprit.
Lesson 12 Explicit is better than implicit.
- Your GC will make performance decisions without your knowledge.
- Your ORM will make performance decisions without your knowledge.
- Your API SDK will make performance decisions without your knowledge.
- Any operation that performs magic on your behalf makes performance decisions without your knowledge.
- Your programming language makes performance decisions without your knowledge.
Corollary: You will need to find the toggles for these performance decisions, if they exist.
Corollary: If the toggles do not exist, you’re in for turbulent times.
Corollary: For performance, prefer an ecosystem that favors explicit choices over implicit choices.
Corollary: The programming language + framework you used for prototyping may not be adapted once you have reached the need for performance.
One of the pitfalls of using ecosystems such as Python’s or Node’s is that they can easily get you some of the way towards your performance objectives, but that after the easy wins, the road suddenly becomes nightmarish.
If you reach the stage at which you need to optimize, chances are that you will do it much more efficiently in an ecosystem that is more designed towards a final product, even if that ecosystem is less efficient for coming up with prototypes.
One last word
Let us finish where we started, with our valiant developer fighting to get file I/O out of any performance-critical section.
If you recall, our developer was thwarted for a time by a nested event loop. That nested event loop was a leftover from previous work on improving the performance of another feature. Fixing the ensuing explosion required much more work than either moving the file I/O to a background thread or improving the performance of that other feature in the first place.
As you work towards optimization, you will often make your code more complicated and possibly less fragile. Whether you add fastpaths, move algorithms across threads or processes, rework data structures to take less memory, add or remove laziness, deploy external caches, introduce custom allocators… your logics will be more complicated to follow and it may introduce subtle bugs.
That’s an expected part of the cost. But there are things that you can do to keep it contained.
Lesson 13 Contain your optimizations
- If you don’t document your performance choices, expect people to undo them by accident.
- If you don’t defend in depth from the potential consequences of your performance choices, you are laying mines for the next developers who will work on this code.
I hope that you have enjoyed reading this entry. If you have reached that point, you now have seen some of the tools and processes you’ll need to optimize efficiently.
Congratulations. It’s going to be a bumpy ride.
It is my hope that these lessons can apply to most of the optimization problems you may need to deal with.
There is certainly more to say about performance. In fact, there is enough content to fill entire bookshelves. But we’re reaching the end of this blogpost.
If I find time to write a followup, I might add a few ideas on how to deal with more specific issues, such as I/O, migrating code from sync to async, etc.
Have fun optimizing!