DataDive toolbelt
By Ian
What tools do you need to bring to a DataDive? The next DataKind UK DataDive is taking place in two weeks time in London. I took part in one of the previous DataDives and I would highly recommend the experience for anyone with data science or analytical skills who wants to help charities use their data.
The DataDives take place over the course of a weekend and in that time you have to decide on a charity to work with, understand their data and goals, perform your analysis and present your results in a usable form. That’s a lot to get through in just over two days so it’s very important to be able to get up and running quickly with the analysis. I thought it might be useful to list the software and tools that I will be packing in my DataDive toolbelt this time around.
Caveat: All of these are personal preferences and there are many other choices I could have made. I have a Mac, so these choices are also somewhat OSX specific. Feel free to list the contents of your own toolbelt in the comments!
Base system
- Python If you are a data scientist you probably have a favourite in the R vs Python debate. My preference happens to be Python and the PyData stack and the packages below reflect that. If you are more of an R devotee, there are direct equivalents for most of these. I would recommend using the Anaconda Python distribution, especially on Windows or OSX.
- iTerm2 The range of options and customisation possible in this OSX terminal app is very impressive. Another really handy feature is the clipboard history.
- Git & Github Version control is important and being able to share the results of your work is made easier if everyone uses a central repository. For this DataDive the organisers have set up a Github organisation so polish up those Git skills if you don't use it often.
- Pandas Building on the lower level numerical capabilities of NumPy and SciPy, the most effective data analysis package for Python is Pandas created by Wes McKinney. It provides many extremely useful ways to ingest, transform and output datasets and is getting better and faster day by day.
- CSVkit This extremely useful set of command line tools helps you to easily get a handle on the contents of CSV files and slice, dice and aggregate them in numerous ways. Really handy as the first step in a data cleaning process.
- IPython Notebook For easy recording of your analysis steps and results using Python and other languages the clear choice is an IPython Notebook. The most recent update has added interactive widgets which enable much better exploration of data and a simple way to create interactive results.
- scikit-learn Machine learning toolkit of choice in Python at the moment due to its breadth of algorithms and extremely elegant API. Pipelining means you can .fit() and .transform() your way through a multi-stage machine-learning process with ease.
- Matplotlib In this age of interactive Javascript visualisations Matplotlib still has a place as the main tool for exploratory data visualisation. There has been lots of work recently to provide more visually appealing graphs using the Matplotlib engine, including seaborn and prettyplotlib. If you need D3 graphics but only know Matplotlib, check out Jake VanderPlas' mpld3 package.
- yhat's ggplot I wanted to especially mention this port of R's wildly popular ggplot2 graphics package to the Python ecosystem. One of the main reasons given for R fans not trying Python is lack of something like ggplot2, so yhat decided to make a very faithful port using Matplotlib as the backend. The syntax isn't very Pythonic but the results can fool even some veteran R users.
- Flask There are many web frameworks out there, but Flask is a simple Python framework that allows you to get up and running quickly.
- Cloud Foundry If you create a webapp as part of your analysis it would be great to have it publicly available (if that's possible). I use the Cloud Foundry platform for this and one publicly hosted CF instance is Pivotal Web Services [Disclaimer: I work for Pivotal who run this service]. Making your webapp available is a simple as "cf push". Other options include Heroku and Digital Ocean.
- D3 These days people expect a lot more from visualisations than simply a static line graph. Mike Bostock's D3 (Data-Driven Documents) has been at the centre of the interactive visualisation movement on the web. It's somewhat difficult to get started with, but there are a lot of packages that help with this, including NVD3 & the previously mentioned mpld3.