Choosing a Language for Bioinformatics

May 5, 2019

Books

But why?

Bioinformatics is an interesting field when it comes to language choices. On the one hand, you are writing a lot of cheap scripts to get your data from point A to point B in a unix like environment. On the other hand, you are likely sending a lot of data through those cheap scripts.

The time honored tradition has been to write your cheap scripts in a scripting language, usually Perl or Python, and then identify the places that are doing the heavy lifting, or tools that are ubiquitous / used repeatedly, and rewrite them in something faster, usually C or C++. This has worked well, and likely will continue to work well into the future. But sometimes…sometimes you really don’t want to write C/C++. It’s really fun for the first few hours…maybe even the first few days, but then reality sets in and the enormity of what you’ve undertaken weighs heavily upon you as you repeatedly find bugs in the hand rolled functions that are part of the standard lib of your daily driver language. You gnash your teeth. You decry the gods of programming. You despair.

But wait! We are in the 21st century, we have options! My personal criteria has evolved over time, but currently stands as:

  1. Scriptability - How easy is it write a quick and dirty script to act as a filter?
  2. Scalability - Can it grow from a script into a performance critical piece of software?
  3. IO - Is there something inherently slow about how the language deals with IO? How easy is it to perform IO tasks?
  4. String Ops - Are the builtins for doing string manipulations efficient, or do they just look pretty? How feature rich are the string ops?
  5. Lightweight Modeling - Can we cheaply model our problem via types, classes, or builtin data structures?
  6. Array and String allocations - Does the compiler / interpreter optimize allocations?
  7. (Not yet done) C binding - How easy is it to use a c library?
  8. (Not yet done) C interop - Is there a large cost to interacting with C libraries?

These features are not things that normally fall into other peoples benchmarks categories.

The programs used in most of these benchmarks come down to numeric problems at their core. Benchmarks that revolve around IO and string ops are always going to be dubious to some degree. What I have tried to do is write a short script in each language that demonstrates a common task that I perform every day: parsing a TSV.

Pragmatic but Dubious Benchmarks

Repo

I chose 4 languages based on the benchmarks listed above, as well as personal bias:

In each language I wrote the same script, adjusting it to fit what would be idiomatic’ for the language, but not optimizing outside of that. The Python script is as follows:

As you can see, this follows that pattern of read a line, process the line, save the result, do some calculations on the result. I am covering all of my above criteria, minus the C interop bits, and to some degree the question of scaling. You can check out how the input dataset was generated and the programs run/compiled here.

Without further ado, here are the results:

*In ugly form so that they update.

Nim is blazingly fast! I’d toyed with Nim in the past, but never took it too seriously. Nim compiles to C, and has the potential of being as fast as C depending on how low down you want to go. The beauty of it is that on the surface it is as high level as Python. I’m not going to give a whole intro to Nim because others have already done that better than I could.

What I love the most about Nim, outside of its speed, is the pragmatism of its community and core devs. As I’ve been digging through the Nim forum, Reddit, and their Github issues, I’ve been impressed by the focus on making sure that the language remains focused on enabling people to solve problems.

Overall I’ve been won over and will be using Nim as my daily driver for a while to learn it better. Specifically I want to toy with its interop with C/C++ libs, which looks very straightforward, and to see how low the language can go when trying to optimize as fast as possible. There already seems to be some momentum for Nim in the land of bioinformatics.

But wait…

But where are Rust and Go?

Your implementation of is terrible / you don’t have my favorite language.

Correct Me!