Writing Fast Python

8 min readDec 8, 2020

Photo by Marc-Olivier Jodoin on Unsplash

Introduction

Python is a very popular programming language especially in data science, partially because it’s easy to write and read. However, it does come with a cost: python can be slow. Have been using python for a few years now, but I find myself more interested in a compiled language, which can be really fast, with the cost of more developing time and possibly distracted for data-analysis task. Language design and features are usually just trade-offs — develop efficiency or runtime speed, power or safety, etc., choose one. There are some languages mainly for data analysis that are really fast as well as being quite easy to use, for example, Julia, but it’s still largely a growing community, the machine learning, and AI community is still deaminated by python and c/c+. I’d like to share my experience writing faster python mainly for data analysis tasks, they can be generally categorized into two parts: general guidelines and language-specific suggestions.

General Guidelines

Before diving into grained details, I’d like to talk some about general guidelines that are not limited to python or data-analysis.

Avoid premature optimizations

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming. — Donald Knuth, The Art of Computer Programming

Even experienced programmers can estimate the critical part wrong. However I have to say that if you are using python for data analysis, you probably have a better intuition about performance problems, but still, I do not suggest investing too much time in optimization at the early age of a project. Because the final task is about understanding and modeling the data, fast iterative is important in a try-error process. It definitely does not worth it to make a 5min function run 10x faster with the cost of a full day. It’s fine to apply some generally simple tricks which I’ll talk about later, but don’t be addictive.

Make sure using the best big-O algorithms

Usually, the most effective speedup does not come with language-specific knowledge. Instead, using the optimal algorithm can easily be 10x or faster than badly implemented algorithms, especially if you are deal with a big scale of data. If you are not doing some cutting-edge research, for common data analysis jobs chances are that there are already some optimized packages for it, considering python being so popular in data analysis. But it’s a good practice to think about your algorithms implementations before any further efforts.

Understand the performance problem

Making a 5min function run in 10 seconds is terrific, but it may be more helpful to make a 5-sec function run in 4 sec if this function will be called many many times more. The point is, optimization take your precious time, you need to make sure that the cost is worth it. Find the part that saves your total system runtime, not a function that you feel it’s slow. Benchmark before optimize.

Specific Advice

Know your language and tools

Python’s grammar is easy to learn and read and understand. Many can say that you could learn python in hours, and that is true — if you mean the grammar. But the grammar is not all of a language, like not every person who speaks a language can be an honorable writer. A different language has specific features, your professions of another language may not apply well to the new language you just learned. For example, if you are using compiled languages before, writing for loops is generally not a problem. But with Python, you should probably try to avoid writing for loops.

Let’s take a simple example task: calculate the longest side of a right triangle, given Pythagorean theorem:

You have a dataframe with two columns and 10k rows. If you are from c/c++ languages it may be straight forward to write some kind of loop like this:

The `%%` is a magic command available in the ipython environment to estimate run time in the current cell(block). 10k calculations like this should be instant in modern hardware. If you are not new to python and pandas, you can spot the problem at first glance, and even more, you can probably write in another way. Let me explain with more helpful tools, but you should know that write naïve for loops is probably not a good idea.

Benchmark tools

Once you have performance problems, the next step is to identify where the problem is. I use a handy package called line_profiler, which can be used from the command line or Jupiter notebook. Say I have the following function in a notebook, and if takes an unexpectedly long time:

To identify the problem, we need to load the line_profilerextension and use it like:

and the result looks like this:

from this result, we can find that bar1and bar2take more than 80% of the time. The code here is overly simple, but you get the idea. Having runtime of each line gives a simple and direct clue to dig into the performance problem, and from my experience, the code that takes too long is usually what you did not expect. So again, benchmark before optimize.

How could we use it to find the problems of the code we write earlier? Just wrap the code into a function and dig into it using %lprun, we have a result like this:

Immediately we can find some problems:

iterrows takes majority of the time
line 4 takes way too much time compared to line 5, which means indexing into a row is probably a problem too

From the pandas reference, we know that iterrows will create a (idx, Series) pair for each row, and create instances of class definitely has a lot of runtime cost. Let’s fix it with itertuples instead:

By changing a simple method call we have a 50x speedup! Though the performance is still not satisfying, now we have some handy tools to diagnose the problem and know the importance of knowing the language and package in use. In fact, in python, this kind of problem is generally solved with vectorization.

Vectorization

For data analysis using python, vectorization is nearly a free technique that can be really beneficial. People from compiled language backgrounds may not pay very much attention to writing naïve loops, but if you are using python numpy/pandas, etc., try not to write for loops, even it may take a litter longer to figure how to do that. For the example we used earlier, we just write

Try to utilize vectorization even if it looks not fit at first, the speedup is very attractive, and this is the way to write data analysis code in python.

Take advantage of the ecosystem

With python being a very popular language, the ecosystem is booming.

The language itself is only part of the knowledge, but python is famous for numerous packages available for all kinds of tasks, for example, numpy and pandas for data analysis. While these packages are handy, they are not designed for your specific task, and they can be compelling but also complex. However, you may find it worthwhile to spend more time on the package and know what they can do, because these packages probably have quite polished and possibly sophisticated implementations of common problems, using advanced techniques like FFI or GPU, for example, cuDF — a GPU dataframe library.

JIT and complied extensions

Interpreted languages are slow, compiled languages are more difficult to learn and write. But there’s a way to utilize compiled language’s speed without learning a totally different language, that is JIT. I use numba for this purpose.

numba lets you write in pure python, but they are automatically compiled to machine code when get called, or you explicitly compile it ahead of time. The advantage of numba is that you do not need to learn a special syntax, you just write pure python code and it just works fine. Of course, this level of automation comes with a cost: there’s a limitation that not all codes can compile to machine code. You just need to live with some limitations if you do not pay extra work. However, I find it’s performance is very attractive and once you know the rules you’ll find that your hands are not really tied.

There’s another package Cython that serves a similar purpose and is used in packages like numpy, pandas, etc. The advantage of cython is it requires you to compile it before you use it, so it will not introduce undetermined latency in your running environment. However, you will need to learn some different syntax from python, which is not really hard to learn but does introduce some additional work. I prefer to use numba for this reason, and if it does not satisfy my need, I’ll just go to purely compiled languages, which I’ll describe below.

Go to compiled languages

Python can call C-ABI compatible libraries. There is an unavoidable runtime cost to call these functions, but with design choices, you can probably limit the calls and put the heavy work into the extension. If you have some background writing languages like c, c++, or rust, you’ll find this very handy.

Personally, I’m interested in Rust, there’s a crate(a crate is like a package on PyPI for python) called pyo3 that I find very handy. Most of the time I’m writing an algorithm collection as a rust library, with some extra work to expose selected APIs to python, which are just rust functions with some annotations. Because rust is still kind of new and also the pyo3 package, I would not suggest you use this solution in production. I’m quite concerned with some design problems, like #1056 unexpected memory behavior. This package is growing and being developed actively, but without really refined foundations, it’s far from production-ready.

For now, I think the go to choice is still c/c++, with total control over runtime cost and rich APIs available.

And beyond

Things will go wild when you go to low-level languages, and with languages like C, basically, you can apply any trick you want, including SIMD, cache load, parallel and more. If you are interested in these low-level optimization techniques, I find a fantastic lecture note, which takes a 99 seconds c++ function and optimize it all the way to 0.99 seconds(159x!) by 7 steps, introducing skills like parallel programming, SIMD, register reuse, etc., trust me it’s fantastic!

Conclusion

Python is a very popular and productive language, but it can be slow if you do not pay attention. There are some guidelines I find useful for myself to write fast python.

Benchmark before optimize, make sure the extra efforts get well paid.
Use good algorithms, make sure they are good in big-O notations
Know your tools, including language features, packages functionality, ecosystem, not just grammar.

Some specific tools and tricks:

line_profiler is a handy package to identify performance problems.
try vectorization, and try again harder
Find well-optimized packages available for your problem
Try numba or cython
Implement in low-level language and call it from python
Play with low-level language and reach whatever performance target you need.

Reference

High Performance Python book 2e
Enhancing performance from pandas user guide
line_profiler package
memory_profiler package
jax package
Case study from Programming Parallel Computers lecture notes by Jukka Suomela, and a comparison using Rust