Making Code Faster (Topic)

From Tools for Applied Data Analysis
Jump to: navigation, search


A Few Cautionary Notes

  • Tuning your code is a very easy way to waste lots of time. With that in mind, think carefully about how much energy it actually makes sense to invest in speeding up your code. If you have something that takes an hour to run, it may be annoying, but at the end of the day it may not be worth spending 2 hours figuring out how to speed up that code if you only need to run it a couple times to clean your data.
  • I've included links to guides to improving R and Python performance below. I strongly recommend looking at them too! However, many of the tricks that computer scientists suggest using to improve code speed are hard for social scientists to implement because the tricks requiring really understanding how your computer works at a low level. With that in mind, this section emphasizes advice that I think is likely to be most accessible to social scientists.
  • Finally, this section is primarily written with users of R and Python in mind.

Before Anything Else: Find Your Bottlenecks!

If you take nothing else away from this page, please read and remember this section!

There's no reason to tune a line of code that is only responsible for 1/100 of your running time, so before you invest in speeding up your code, figure out what's slowing it down -- a process known as "profiling" your code. Thankfully, because this is so important, there are lots of tools (called profilers) for measuring exactly how long your computer is spending doing each step in a block of code.

Methods for Speeding Up Code

AFTER you've identified what parts of your code are running really slow by profiling them (as discussed above), the next question is how to make them run faster. The following fixes are provided in the order in which I would suggest pursuing them -- the first fixes have high returns and are relatively easy, the later ones are harder to implement and should only be pursued if you've tried the preceding options!

1. Check your memory use

When your computer is manipulating data (defining variables, running regressions, etc), what its actually doing is grabbing a little bit of data from storage, moving it to the processor, doing math with it, and then putting it back in storage.

To simplify somewhat, we can think of your computer as having two forms of storage -- RAM (sometimes called main memory), and your harddrive. But these two forms of memory are very, very different. Your computer can grab data from RAM 100,000 times faster it can grab data on the harddrive. Because of this, your computer is much happier (and performs faster) when it's able to keep all the data you're working with in RAM. Indeed, if you're moving data back and forth to your harddrive, it's almost certainly the biggest bottleneck -- doing actual computations is almost instantaneous compared to moving data back and forth to your harddrive on modern computers.

You can learn more about how to know if your computer is wasting time going back and forth to the harddrive here.

Important: Just because your dataset is small enough it seems like it should fit into RAM doesn't mean this isn't relevant for you! Programs often make copies of your data when they manipulate it, so even if your dataset is 2gb and your have 8gb of RAM, the program you're using can very easy end up using all 8gb of RAM!

2. Using other people's (compiled) functions

As a general rule, code you write in R or Python is slow. But don't worry, it's not your fault; R and Python are fundamentally slow languages, so anything written in R or Python is slow.

But interestingly, commands that other people have written that are available in R or Python are actually usually written in faster ("compiled") languages, like C++. As a result, whenever you have the choice between writing a function yourself or using a function in an established library, you're almost always better off using the command someone else wrote.

Now, using other people's functions is not fool-proof -- some people write their libraries in R or Python (not a compiled language like C++), so they may run as slowly as your own commands. So if you can, check the documentation for whatever library you want to use to see whether it was written in C / C++ or not!

3. Vectorization

In most programming languages, if you want to apply a function to each item in a matrix or vector, you would just create a loop that extracts each element of the vector, modifies it, and then moves to the next item. This will work in R or Python, but it is very slow. What you want to do instead is "vectorize" your function, which means you use a specific command to tell R or Python that you're trying to apply your function to each item in the vector. When R/Python knows this is what you want to do, it has ways of making that function execute much more quickly behind the scenes.

Both languages have a number of tools for this:

  • in R, these come in a set of tools with names related to "apply()" (for example, apply(), lapply(), sapply()), there's a library called "plyr". But what matters most is not which one, but rather that you use one of them and never run loops over your vectors or matrices!
  • In Python, the relevant tools are the "apply()", "agg()", "map()", and "transform()" functions.


4. Parallelization -- probably not your best option

Parallelization (which has it's own page Parallelization here) is often the first thing social scientists turn to to speed up code. Whether this is the right decision depends a lot on your situation, but a few facts:

  • Parallelization is "sub-linear", meaning if you parallelize across two cores, you can except ~1.8x speedup at best. (More generally, for N cores, expect ~0.8*Nx speedups).
  • Improving how your code is written can often yield much higher returns than parallelization, on the order of 5x, 10x, or 100x speed improvements.
  • Using code that was written in C++ by someone else can yield 10x or 100x returns.

So, parallelization certainly isn't the best way to speed up your code. But if you read this and find yourself thinking "um, I still don't really what makes code fast versus slow" or "I've already tried all that and it's still too slow!", it's a good option.

5. Compile your own code

So you're sure you're only using RAM, you've vectorized your functions, and you can't find a library with the function you need to execute, and when you write it yourself it's still too slow, even if you parallelize it? The last option is to compile your own code. This is a little complicated for this page, so if you get here, check out this page for more details.

Learn about Data Structures: If you REALLY want to speed up your code

If this were written for computer scientists, this section would be first. Different data structures -- lists, matrices, dataframes, dictionaries, etc. -- are all very different in terms of how they work at the lowest levels of your computer. As a result, they have very different performance characteristics, and nothing will have more of an effect on the speed of your code than your choice of data structures.

With that said, because most social scientists haven't been taught much about the inner workings of computers, it is understandably hard for them to know which structure is best in each situation. If you really want to get good performance though, it's something to look into -- check out a starter page here.

Guides Written By Wiser Minds than Me

Speeding up R Code

Speeding up Python Code (note: written for Python, not Pandas. Pandas has some extra tricks)

Improving Pandas (Python) Speed