Data Science & XP Explained: Principles
By Ian
The values described in Extreme Programming Explained are the foundation on which principles and practices are built. Last time we started to talk about those values through the lens of data science.
Today I’m going to describe some of the principles outlined by Beck and Anders, again asking whether they are applicable to data science, and whether data science could benefit from their application.
For Beck and Anders, values are “too abstract to directly guide behavior”. Indeed it would be hard to disagree with the content of the values discussed previously, although you could argue about their relative importance. The principles that you build on top of these values start to more concretely communicate intentions and expected behaviour. [Update: new post on the next layer: practices]
Beck and Anders have a list of 14 principles: humanity, economics, mutual benefit, self-similarity, improvement, diversity, reflection, flow, opportunity, redundancy, failure, quality, baby steps and accepted responsibility.
I’m not going to focus on every one of these but will pick a few to talk about in detail while mentioning the others in passing.
Humanity
The first principle listed in XP Explained is humanity. For Beck and Anders this comes down to the simple fact that “people develop software”. Acknowledging that fact leads to a discussion about how to get the most out of people, and the authors offer this list of needs: basic safety, accomplishment, belonging, growth and intimacy. The practices of XP seek to meet these needs and, by limiting working hours, give space for the individual to meet other needs like relaxation, exercise and socialising. All of this applies equally to data scientists in the workplace.
An aspect I want to highlight for data scientists is acknowledgement of the humanity in the origin of the data, where this is appropriate. That this row of network logs or credit card transactions represents the actions of a real human being is sometimes hard to remember. Keeping this in mind helps when faced with morally loaded questions about what should or should not be done with the data. Is it appropriate that we put these datasets together in this way, or have we taken enough care in anonymising this data?
There have been a number of high profile incidents in the data science field recently which highlighted some ethically questionable behaviour. One of the Pivotal Data Science team’s predictions for 2015 is that we will see many more of these types of incidents. If we consider the humanity of the origin of the datasets we work with, we might be better placed to avoid these ethical breaches.
Mutual Benefit
In XP Explained, “every activity should benefit all concerned.” This informs a lot of the other principles and practices and is often the distinguishing factor between two solutions. Beck and Anders use the perhaps controversial example of internal documentation for software, arguing that excessive documentation is not of benefit to the developer, and the mutually beneficial solution is to write automated tests, take care to refactor out complexity and choose coherent names and metaphors which make the code clearer to anyone reading it for the first time. For data scientists making your code and model usable is of clear importance, especially in a consulting environment where you will probably not be around to implement and iterate further on the model.
In the spirit of the humanity point above, perhaps data scientists can expand this definition to include the benefit of the person who provides the data, or the population they come from. Does your recommendation system really provide better suggestions to your customers or are you just pushing badly selling products more? Does a normal consumer benefit enough from the fraud detection algorithm to understand the need for deep inspection of all their credit card transactions?
A recently formed Data Science Association has a Code of Conduct which goes some way to outlining expectations of this kind for data scientists:
A data scientist shall use reasonable diligence when designing, creating and implementing machine learning systems to avoid harm.
Using the high profile example of Facebook’s mood manipulation experiment, clearly those people in the negatively influenced group did not mutually benefit from this activity. In this light the argument for proper informed consent becomes stronger.
Improvement
For Beck and Anders “‘perfect’ is a verb, not an adjective.” They describe how one should not strive to get the perfect result first time around, but rather start somewhere and continue to improve step by step. For data scientists this should make sense, nobody starts with the perfect model/results. Instead we start with a simple but perhaps inaccurate model and slowly add features to improve the accuracy. If we keep the value of simplicity in mind we will try not to overly complicate the model by adding too much unnecessary complexity.
This does not just apply to model building however. Every aspect of your processes, customer interactions and work environment should be subject to continual improvement. I find this principle quite freeing as it makes clear that you shouldn’t worry if you are starting from a bad position as long as you decide to head in the right direction.
Often the personality types that find themselves drawn to data science exhibit some degree of perfectionism, whether it manifests as continually tweaking code to eek out some performance or obsessing over image placement on slidedecks. By focusing instead on noticeable improvements instead of final perfection, these activities might be discovered to be counter-productive, holding you back from moving to the next step.
Diversity
Diversity as defined in XP Explained is the need to “bring together a variety of skills, attitudes, and perspectives”. Teams work better when there are a range of viewpoints that can be brought to bear on problems.
Technology companies often have difficulty in bringing together diverse teams. While there is some evidence that data science and statistics have done better in gender equality than other tech fields, the evidence is mixed.
Beyond demographic diversity data science teams can also benefit when team members have different academic/technical backgrounds. At Pivotal we have data scientists with backgrounds including biology, physics, computer science and online media, and we often find that techniques and solutions in one field are extremely useful when applied to a different one.
Reflection
In my opinion the principle of reflection is perhaps the most important of all. It is also a key component of many of the others. Understanding why something worked, or why it didn’t is necessary for improvement, provides opportunity to learn, and is central to accepted responsibility.
Being able to reflect honestly about yourself and with your team is difficult and can take a lot of courage. Giving team members a safe environment is essential so that they can talk about their mistakes and those of others without fear of retribution. In a consulting situation this will also involve being open with your customers about your failings and those of your product/process.
The goal of reflection is to enable improvement by exposing mistakes and learning from them. As Beck and Anders' say, “no one stumbles into excellence.”
There is nothing unique about how data scientists should approach reflection, but in my experience it is easy for high-achieving data scientists with academic backgrounds to be less reflective than necessary, resting somewhat on the laurels of past achievements.
Quality
Nobody starts out wanting to produce bad quality work. Unfortunately, the constraints that you work under may make that the only option. Calling out quality as a principle means that other decisions need to be taken without compromising on the quality of the work delivered. In XP Explained the authors are clear that “sacrificing quality is not effective as a means of control.”
With internal or external customers projects are normally constrained in terms of the time allowed and the overall cost. If quality is not to be sacrificed then the scope of the project must be flexible. For XP this means that some requested software features will not be implemented. This feeds into the practice of clearly defining requested features as stories and making the product owner decide on prioritisation. Without compromising on quality you can work on the highest priority stories using the available time & money.
For data science this prioritisation might not seem as simple. Data cleaning often takes a lot of time (80% being the apocryphal estimate), and nobody would be satisfied to “finish” a project having worked through most data cleaning steps but never having built a model.
Summary
We’ve continued to explore the lessons that data science can learn from the classic “XP Explained”. While not going into every principle described in the book, I hope those covered above are a good selection with particular importance for data scientists. Much more could be written about each of these topics. In the next post I finally get to the practices that XP Explained uses to embody these principles and the values described previously. How do these fit with current data science working practices?