Open source software for data scientists

1121

If your organisation doesn’t buy it for you, analytical software can be very expensive. One way to get around this is to use open source software. This post provides an overview of the benefits and shortcomings of the two main open source data analysis programmes – R and Python. It also suggests another contender – Julia – which though still in development, looks very promising.

Anyone who has been a student will be well-aware of all the associated benefits of being a paid-up member of an academic institution:

  • office space,
  • access to journal articles, and
  • cheap software.

Now that I have flown my academic nest, I find myself in the position where I must find alternatives. Co-working spaces have substituted for university offices, academic friends and authors help out with pay-walled papers (and there is now Sci-Hub for those inclined to break the rules), and there are various options for open source alternatives to data analysis software.

Whilst a PhD student, I tended to use the most powerful stats software that was widely available, namely Stata and Mplus. Unfortunately, now that I am unaffiliated with a uni, these are very expensive. For example, the general purpose version of Stata (currently, Stata SE v14) retails for £1,060 for a single business user licence; the full version of Mplus (Mplus w/combination v7.4) retails at $1,095.

These costs are not only prohibitive for me personally, but are also a problem for my clients. Preferably, I want to share my work with my clients in full, so that they can see what I have done and re-run or edit analyses if they wish. Together, these considerations have pushed me to examine what open source options are available. To be most useful, open source data analysis software should be:

  • widely used,
  • powerful and fast,
  • have the ability to undertake many different types of analysis, and
  • be easy to learn and have a good support network.

RlogoCurrently, the main contenders seem to be R and Python. R (along with its commonly used IDE RStudio) is the original open source stats programme. It has been around for a very long time. As a result, it’s very reliable and has a huge user base. This has led to there being a large number of packages developed for it to enable all sorts of analyses. There are loads of online tutorials and books available to help you learn, and a large forum community to help answer tricky questions. However, R is frequently criticised for not being easy to get into, and is known for having a steep learning curve and not-so-helpful ‘help’ files. Also, it’s language is not written in the same way as a typical programming language, so people with programming experience tend to find the syntax a bit odd.

python-logoPython contrasts particularly strongly with R here – it is a fully fledged high-level programming language. It follows the general rules and syntax that are common in programming languages. It is also very stable and has a huge user base. There are many tutorials and places to get help and support. Crucially, it is known to be easy to pick up. However, being a programming language, it was not designed primarily for statistical and data analysis tasks. For general analysis, this isn’t a problem as there are a wide variety of custom packages available in Python (see StatsModels, for example). However, some commonly used advanced statistical approaches are not (yet) well-supported in Python – things like multiple imputation, multilevel and structural equation modelling, for example.

2000px-Julia_prog_language.svgThere are, of course, lots of other programmes out there that could also be mentioned. One which I think is worth including here is a relatively new kid on the block: Julia, which is causing quite a stir amongst scientific analysts. Julia is a high level programming language, but also includes the ability to programme at a low level (i.e. in the language of the computer). This allows the software to potentially run processes much faster than R and Python. It also supports parallel processing – a bit like Hadoop – and compiles before re-running at maximum speed. These features mean that it holds the promise of far superior performance to R and Python, particularly on intensive processing tasks like machine learning. Currently, it is still in development (not even at version 1.0 yet). As a result, the user base is relatively small and there are not many learning materials and packages. However, Julia is meant to be pretty easy to pick up, and if a sizeable community develops, it could take a big chunk out of the R and Python user base.

Ultimately, I think any data scientist looking to use open source software would do best having a working knowledge of both R and Python – they are good for different things. R is great for one-off statistical research projects, whereas Python is great for analytical tasks which need to process data on an ongoing basis (such as with real-time IT systems). At the moment, Julia is a curiosity, but one which I am keen to keep an eye on and play with. If I have a seriously resource intensive analytical task that needs more performance, I may give it a go on Julia.