By Josh Hemann
I got my start in programming with C++ (not counting hacking DOS video games like Duke Nukem in a hex editor). One summer in college I had an undergraduate grant with a professor who had me write a neural network algorithm using back propagation. The next summer this same professor had me recode it all in Java, as it was the hot new language. So from the start, my programming needs have been in the context of analytics, often exploratory in nature, which is why I quickly gravitated from compiled languages like C++ and Java (statically compiled) towards dynamic languages like MATLAB, R and Python (runtime compiled).
The trade-off in using dynamic languages over compiled ones is about speed. Dynamic languages better enable the exploratory programming needed when tackling new analytical problems, which leads to faster development of solutions. The trade-off is that this flexibility means that compilers do not have enough information about the code and data to optimize for run-time execution speed. In many settings, increasing computer time for saved human time is a good trade-off, but of course sometimes execution speed is critical, even in exploratory work (i.e., it is harder to iteratively refine a modeling approach when code takes many hours to run). Wouldn't it be great if there was a way to have the flexibility and expressiveness afforded by dynamic languages but with more of the execution performance of compiled languages?
This wouldn't it be great if wish has been around for a while, but three events centering on the LLVM compiler over the past year make this wish a lot closer to reality:
- NVIDIA moving from an in-house compiler for their GPUs to LLVM
- The emergence of and excitement around the Julia language for technical computing
- The emergence of and excitement around the numba project for compiling Python code to the LLVM
While this seems like an unrelated mix of events, the common thread is that important tools in the modern technical computing stack are moving towards using LLVM. So, I figured it was about time I became more familiar with the topic and this counterpart to this post shows some of my recent foray (I also link to all of the code and impressive speed results at the end).
Why LLVM matters
Writing this post certainly pushed me past my comfort zone. I have spent most of my focus over the years on applying analytics to business problems and not on lower-level computing issues like compilers and memory management. But, as in many areas of computing these days, more data, more complex questions, and expectations of real-time results means that issues of the past have come back to the fore. For example, mobile developers have to deal with limited screen space and be hyper-focused on keeping memory consumption down, just as people doing any kind of computing had to in the 1980s. It basically means we have to head the advice of Peter Norvig to remember that there is a "computer" in "computer science". To do even exploratory analytical work nowadays means having to develop and maintain some level of maturity around technical computing. And then of course, there is the rest of being a data scientist, like keeping up with the evolutions in analytic methods, software implementations of said methods, domain knowledge for usefully applying said methods, techniques for visualizing results, and best practices for conveying these methods and their results to technical and non-technical audiences. Sigh...
As overwhelming as this collection of issues feels sometimes, it motivates my excitement about projects like LLVM and numba: They provide a single, consistent abstraction layer on which I can maintain reasonably high performance code in languages that are very flexible and efficient for me to develop in. I can target execution against a single core on a CPU, multiple cores, or even completely different hardware like NVIDIA's GPUs that have thousands of cores, all through LLVM and numba. If I were to continue developing this matrix factorization algorithm for a recommendation system I would now have a way of iteratively testing approaches on much larger, more realistic data sets rather than waiting nearly 3 minutes every time I tested against a toy data set. This means I would be more likely to actually explore and evaluate solutions, and that is what is most exciting.
Acknowledgements and More Reading
- The community of folks on the numba users discussion group on Google Groups were very helpful in working through this example code and assessing behavior. Further, the many people who develop projects like LLVM, llvm-py, and numba in the first place
- An example of me using LLVM for matrix factorization used in a recommendation engine
- A GitHub Gist containing my code code, as well as the complete post as an IPython Notebook (which can be viewed here)
- Travis Oliphant gave a talk this year at PyCon covering the vision for numba and LLVM (slides)
- A great tutorial on LLVM