Data Science & XP Explained: Practices
By Ian
In the previous two posts in this series we have explored the values and principles outlined in Extreme Programming Explained by Kent Beck and Cynthia Anders (XP Explained). In this post we will discuss perhaps the most contentious of the three layers: practices.
What relevance do XP practices have for data scientists? What are the benefits and disadvantages of these practices? Are there any that we should modify for a data science environment?
The practices listed by Beck and Anders are broken down into 13 primary and 11 corollary ones. I won’t describe all of them here as I want to focus on those I think have the most impact and perhaps require most discussion. I’m also going to add another practice, retrospectives, that is frequently used but not explicitly mentioned in XP Explained.
Sit Together
This is the first practice listed in XP Explained and the importance is clear when you think of the benefits of sitting together as a team for communication, cohesiveness and feedback.
Beck and Anders advocate for open plan workspaces which we use here in Pivotal. I don’t find it too disturbing to work in this environment but I know others have different experiences and prefer closed offices or smaller workspaces.
The layout is often more important than the degree of openness. I recently worked with a team that is divided as individuals across a whole open plan floor. They were really excited that I could easily turn to a colleague beside me to ask a question and that discussion was actively encouraged rather than subdued. In our office there is a hubbub of background noise from pairs working together. This makes people feel free to start conversations naturally instead of feeling like a disruptive force in a hushed library environment.
Facilitating collaboration between your data science team and beyond is made easier if they sit together and feel comfortable enough to start a conversation and ask questions.
Energized Work
Work only as many hours as you can be productive and only as many hours as you can sustain.
This is the essence of Beck and Anders argument for energized, productive but limited working hours. Long hours can seem like a solution when you fall behind or deadlines are approaching, but there is no doubt that quality suffers and this kind of pace is sustainable in the long term.
Long nights of studying and thesis writing in university have taught me bad habits in this regard. Leaving work until rapidly approaching deadlines force me into action can be energizing but is not sustainable.
The other principles and practices of XP work to reduce the possibilities for this kind of ‘crunch mode’. When pairing, procrastination is reduced, and any late night coding sessions immediately affect someone else as well. With continual feedback and customer involvement there is no way to hide a lack of progress and save the day with some late night heroics without them knowing.
Data science is not immune to this kind of last minute rush but as with programming the concentration needed for good analysis cannot be maintained if this is a common occurence.
Pair Programming
This is the big one. Pair programming has been a contentious topic since it was first described and even today there are regular arguments about its effectiveness and suitability for software development.
To cut to the chase: I believe that data scientists benefit immensely from pairing.
The basic idea of pairing is that two people sit side by side working together on the same problem. A standard workstation setup here at Pivotal is one machine driving two monitors, keyboards & mice. One of the team is the driver, they are in control and use their keyboard and mouse. They focus on the immediate code that they are entering. The other is the copilot, actively checking what the first is doing, pointing out typos and mistakes but also thinking about the broader picture, how the current task fits more widely and what to do next.
The roles are switched throughout the day and pairs are changed within the wider team as frequently as comfortable; some teams change every day, some on longer (each week) or shorter (every hour) scales.
The pair should be having a conversation throughout, explaining the choices made, asking questions, discussing merits of different approaches etc. This embodies the value of communication we described previously.
Beck and Anders describe pair programming as a
dialog between two people simultaneously programming (and analyzing and designing and testing)…
I have had the pleasure of working alongside some of the strongest advocates for pair programming, the engineers at Pivotal Labs, who have been practicing pair programming for over 20 years. Every customer engagement at Pivotal Labs involves pairing, with customer developers also taking part. There is a lot of information about how pairing works on the Pivotal Labs website and blog.
Benefits
- Focus: two people working on one machine keep each other on task, with no drifting off to social media or email backlog.
- Generate and clarify ideas: when one person gets stuck, the other can often jump in and move towards a solution.
- Spread knowledge: instead of building silos of knowledge in the team, pairing allows everyone to benefit and share their experience. This is a great way of bringing people up to speed on a project and also lowers the risk from losing a key member of the team.
- Immediate feedback: bugs are picked up earlier, and you can incrementally improve at a faster pace when someone is helping you spot errors immediately.
Possible complications
- Personality issues: you need a certain attitude to make pairing work. If one person is overbearing or too quiet and contained, the dialogue breaks down and the atmosphere quickly becomes corrosive. Beck and Anders also mention other possible issues including emotional attachment or attraction between partners.
- Personal space: close contact is inevitable when pairing but needs to remain comfortable for both parties. Different cultures have different expectations of personal space and personal hygiene and health are important factors.
- Sustainable pace: pairing is very tiring and after a full day you can feel exhausted but satisfied. Make sure to include breaks (ping pong is our pastime) and work at a sustainable but energized pace.
The bottom line about dealing with these issues is to not just grin and bear it if you aren’t feeling comfortable pairing with someone on the team. Beck and Anders recommend talking to “someone safe; a respected team member, a manager or someone in human resources.”
Pair Data Science
In the data science team at Pivotal we traditionally have not practiced pair programming explicitly although in many situations we have found ourselves huddled around a screen to solve a problem or working remotely with another data scientist sharing code on screen.
I have recently been experimenting with how to bring more explicit pairing into our engagements with customers. The usually described phases of data science project are data cleaning, exploration, modelling and operationalization. I think pairing can help each of these phases in different ways.
For the data cleaning and exploration phases, pairing with someone helps to generate more questions to ask the data, helps speed up understanding of the results of queries and ensures that you are continually thinking of how to communicate the insights found. With two participants there is also the opportunity to practice the Socratic method, continually questioning results and driving towards better insights.
If you are pairing with the person responsible for the dataset during these phases you will be able to get much quicker answers to your questions about schema choices, invalid data etc. As a consultant, being able to get answers immediately really accelerates the process of learning about a new dataset. What might take a few emails, phone calls or a meeting can be resolved immediately, and allows you to remain in the flow as you work.
Two heads are better than one for solving problems in innovative ways. Data scientists have a wide variety of backgrounds and you might find that bringing together a particle physicist with a genomics expert can spark interesting and unexpected ideas. Our team already benefits from combining expertise from many disciplines and pairing provides a method for maximising these beneficial interactions.
For modelling and operationalization the benefits are more like those traditionally associate with pairing for software development as outlined above. In this phase perhaps the data scientist should pair with the team responsible for putting the model into production. Understanding how the model interacts with the wider system is facilitated by immediate discussion rather than the distribution of long winded specification sheets.
For many data scientists I know the thought of sitting beside someone and constantly discussing what they are doing fills them with dread. ‘How can I think deeply about something if I have to keep talking?’ ‘Do I rigidly have to pair all the time?’ The answer is embodied in one of the Pivotal Labs mottos: “Do What Works”.
Beck and Anders explicitly say that “pairing doesn’t mean that you can’t think alone”. They recommend working alone on an idea if you need to. You should then come back and check in with the team, and not just use it as an “excuse to act outside of the team”. Think about the idea but don’t implement it in code by yourself. I think this freedom to privately think about something but then work together to implement it is a good balance and should address the objections listed above.
To recap, I am very much in favour of pairing for data science and I will be trying to do practice pair data science as much as possible from now on.
Stories
Stories are “units of customer-visible functionality”. Beck and Anders warn against using the word “requirement” as this conjures up images of permanence and inability to respond to change. At Pivotal Labs, stories are said to be a placeholder for a conversation and so don’t need an exhaustive level of detail up front.
Estimation is also key in XP Explained but not much is said about how to estimate the duration of a story. This is a non-trivial problem that has been attacked in multiple ways. The Pivotal Labs style of estimation, which is central to the use of Pivotal Tracker as a planning tool, gives ‘points’ to stories based on relative complexity following a conversation between the engineers. In the Pivotal Tracker method the “conversation around assigning points to a feature is often as important as the estimate itself." Labs developers say that using complexity rather than strict timings tends to provide more accurate estimates.
For data science the concept of a story needs to be broadened. In the exploratory phase I think of a story as a question that can be asked of the data. For example, ‘are there strong regional variations in our sales?’ This might involve identifying the right data, joining with some regional classification, some aggregation calculation and a simple visualisation. This would all be accomplished before the story would be considered finished.
There is a balance to be struck between too broad a story, e.g. “explore the data”, and too narrow, e.g. “compute the standard deviation of this variable in the last 12 months using this package”.
In XP software development teams a product owner creates stories and the engineering team is responsible for implementation details. This process is eased when the product owner sits nearby as clarifications are easy to discuss. I see no reason why this kind of process can’t also be put into practice for a data science project. There should always be a project owner, who is responsible to the business for the investment of your time. If you cannot identify this person it might be a sign that your project is not directly solving a business problem.
Test-First Programming
Testing in software development is now commonplace although styles and test coverage do vary. XP Explained describes what is now normally called Test Driven Development. This requires writing a failing test before implementing code that passes. This method works will with pairing and is often practiced as ‘ping pong pairing’ where one person writes a test and the other then writes an implementation that passes and this continues back and forth.
Testing for data science is a more complicated issue. For production models or code that is intended to be reused in libraries there is no doubt that a good test suite will provide all the same benefits as in normal software development. In XP Explained these are listed as reduction of scope creep, enforcing loosely coupled highly cohesive code, enabling trust and finding a rhythm, namely “test, code, refactor, test, code, refactor”.
For exploratory one off SQL commands or a quick and dirty visualisation I think the need for full testing is lower. Just be careful that your run-once code doesn’t inexorably find its way into a long lived script or production system. Using ‘spikes’ where a testless exploration is done but the code written is definitively deleted, might be a way to avoid this. Once you are done exploring, sit down at a clean slate and reimplement the calculation with tests from scratch.
Another question is what should be tested. In a data cleaning pipeline does a test need to always be run on the full dataset to ensure for example that every NaN is replaced with some desired value? What if your data is so large that this takes hours to run? Another practice in XP Explained is the Ten Minute Build, the idea that running all the tests and building the full system should ideally take around 10 minutes, long enough for a cup of tea.
Another view is that perhaps it is enough to have a test dataset of corner cases and awkward data that you can quickly run tests against? You might only stumble on some of these corner cases when running the transformation against the full dataset but others will be possible to predefine.
From discussions with colleagues here at Pivotal I think the best course lies in the second of these. If the end product of your data cleaning pipeline is the transformed data, then running the pipeline on the whole dataset is really a final build step rather than an intermediate test for use during development. So I recommend building a test dataset, maybe with real corner case data or some expected problem scenarios. The benefit of this approach is that you can also try to catch problems from future input data not included in the current data set.
Retrospectives
This is not explicitly listed in XP Explained as a practice but I want to include it as it puts into practice the values of feedback, communication and courage and the principles of improvement, reflection and accepted responsibility.
Having scheduled retrospective sessions where everyone is free to list the good and bad things that have happened provides a safe space to discuss issues, applaud good work and learn for next time around. Having these retrospectives frequently means that the opportunities to improve are more frequent and lessons are learned during a project and not just at the end.
Pivotal Labs practice this with weekly retrospectives, usually on Friday afternoon, usually accompanied by a beer. Three categories of issues are discussed, good things we want to do more of, things we are unsure about or have questions about that we want to keep an eye on, and bad things that we want to stop or change.
Catching issues early leads to easier resolutions and providing a safe space for discussion is often enough to alert someone to behaviour that they might not have been aware was affecting the rest of the team. It’s important to decide on changes to be made, and at the next retro evaluate whether these were effective.
For data scientists there is clearly also a need to be reflective in the same way and retrospectives should provide a good framework for this.
Summary
I’ve covered a lot of ground in this post, but there are many further practices described in XP Explained that I have not been able to discuss here. A lot more could be written about the topics above too, especially pairing and test driven development for data science.
I hope this series of posts can start discussions about these topics in data science circles and between developers and data scientists. A lot of data scientists come from a non-software development background (including me) and I think the two communities have a lot to learn from each other beyond technical knowledge. Of course there is also a range of other topics that data scientists need to think about, something that Peadar Coyle has written about recently.