Data Science & Extreme Programming Explained
By Ian
A new event in the Pivotal London office this week was our first lunchtime book club meeting. The first book suggested for discussion was Kent Beck and Cynthia Andres' Extreme Programming (XP) Explained (2nd Edition).
This is a classic of the agile programming community and Kent Beck’s shorter first edition (1999) can lay claim to being one of the first books about agile programming practices. In this long post I’m going to start discussing the message of the book, and how I think some of the ideas apply in the world of data science. [Update: The next parts of this post about principles and practices are now up.]
For the developers in Pivotal Labs, the agile development team at Pivotal, the content of this book should be very familiar. For the last 20+ years Pivotal Labs have been putting into practice many of the ideas described by Beck and in that time have had remarkable success helping development teams in organisations large and small to change their processes.
I’m in the data science team at Pivotal and we’ve been trying to learn from the experience of Pivotal Labs to help us improve how we deliver engagements with clients. Discussing this with the wider data science community I’ve been told that agile methodology is variously essential for data science, compatible with standard practices or even too rigid for data science projects. So I came to the book with a purpose, to assess the ideas provided and see which I would find beneficial in my day-to-day data science work.
In this post I want to concentrate on the first few chapters of XP Explained which describe the philosophy of this approach and lay the groundwork for the more specific technical suggestions later on.
The ideas in XP Explained are laid out in three levels: Values, Principles and Practices, where each of the levels is more practical than the last. Beck and Andres clearly explain why this is the case: by agreeing core foundational values, any decisions about practices, e.g. whether to hold standup meetings and for how long, can be made in a common agreed context.
The Values laid out are:
The authors mention how other values like security or predictability could have been chosen, but I think picking a short list of the five most important values allows the list to be easily memorised and cements the message. With these five values in mind, other decisions and processes are given context and can be questioned when they don’t seem to embody these values.
As a data scientist a lot of these values are already central to my way of thinking about my role. Whether my customer is internal or external I think these values provide a great basis for making a project a success. Here is my take on how these values translate to a data science context.
Communication
Communication is perhaps even more important for data scientists than software developers because we tend to have more different audiences that we are expected to engage with. A data scientist can often find themselves going from a deep technical discussion with a developer or DBA to giving a high level overview of a project to the business owner. Data scientists who aren’t comfortable openly discussing their progress, methods and results are not providing as much business value as possible.
Simplicity
If you’ve ever read a machine learning textbook, simplicity may not seem like the standard mode for data science results. Often, however, we sacrifice some accuracy or performance in pursuit of a model that will be easily explainable and therefore more widely trusted and relied on in an organisation.
The data scientist who always creates the most complex and un-interpretable model, a black box for the customer, may ensure continuing service contracts, but is failing their customer in the process. Beck and Anders address the common complaint about too much simplicity by phrasing this value as the following question: “What is the simplest thing that could possibly work?” Clearly your model needs to work in order to be successful, but there is no need for unnecessary complexity which reduces understanding and maintainability.
Feedback
In a technical sense, feedback lies at the heart of most machine learning algorithms, but for personal & business relationships it is even more crucial. Ensuring honest and timely feedback requires the next value, courage, but also requires a willingness on the listeners behalf to make changes. If you think your techniques and processes “have always worked”, maybe you will miss the underlying change that is rendering them obsolete.
In a consulting environment there is a tendency to go off and do work for a while before presenting what you think are great results at the end. The client may of course have a different view. Feedback at this point is pretty useless at least for this engagement where time has already run out. How much better is it to continually get feedback on whether the direction is right and the results in the form expected? This is often built into projects via a weekly update meeting/call. The client gets to hear about the progress, assess it and provide feedback during this hour or two.
Even this may not be enough though. Creating shorter and shorter feedback cycles is central reducing the difference between the actual execution and the customer’s expectation. When the feedback cycle is shortened to hours or minutes there is much less chance of heading far down the wrong path. How to achieve these shorter cycles in a real environment is the concern of the principles and practices levels as described above.
Courage
I’ve mentioned courage already in the context of feedback, where it is certainly necessary to both give and receive honest criticism. The authors of XP Explained point out that courage could be a detrimental value, encouraging marching straight into danger on every occasion. They argue however that with the other “counterbalancing values” this danger is reduced and courage becomes a powerful necessity.
As a data scientist, courage is often needed, especially when delivering bad news, i.e. the “your baby is ugly” problem. Organisations may not like the results of your analysis or the implications for their business. There may be pressure to ‘adjust’ or ‘soften’ the results to make them more palatable or less incendiary. I would argue that it takes courage to face down these requests when they are inappropriate, especially if your livelihood directly or indirectly depends on the person making the request.
Respect
Beck and Andres describe respect as the value that “lies below the surface of the other four”. Without respect for your teammates and your customers it is very difficult for a project to be a success.
Data scientists have sometimes had a reputation for arrogance and in certain cases it may be deserved. Often there can be a certain aloofness between data scientists and other team members, who may feel like they are being presented with “The Analysis” that cannot be questioned without a PhD in astrophysics and a supercomputer.
In this case something has gone wrong with the respect and communication between team members. Perhaps the data scientist should communicate more to help the team understand the work they’ve done, and try to learn more about the work done by the other team members.
Summary
So far we’ve looked at the values that Beck and Anders prioritise in XP Explained. In my opinion these values are equally valuable in data science as software development. In some ways this wide applicability is the a great strength of XP Explained, which has also been described to me as the “best people management book around”.
In follow-up posts I go deeper into the principles and practices that XP Explained champions, how these might apply to data science and my experience with trying to put some of this into practice.