Helping Data Science Projects Succeed: 5 Tips on how to Avoid Becoming a Statistic
Last updated:WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.
Data science projects are complex and very often fail.
What does mean for a project to fail?
One way to know if a project has failed is to ask the client: "Would you have done this project if you had known how it ended?" and they answer No.
Reasons projects fail
Data scientists/project managers failed to iron out questions before the project
Impossible expectations from clients
Out of control complexity
Mismatch between train time / inference time data access and functioning
Unmet assumptions
Avoiding failures
So what can team members (any role) do to increase chances of success?
Following are some worthwhile tips that you can do to help avoid failure in data science projects, either as a practitioner or as a project lead:
Understand why and how your solution will be used
It is critical that you understand the scenarios under with the solution you are building will be used in practice.
You need to understand:
What business problem are you trying to solve? How does this help the customer make money? How is this process done currently?
How will the solution be used, technically (realtime APIs? batch runs? something else?)
Who will consume the output your solution produces (Other systems? Humans? What is their level of expertise?)
Make sure there is data
It is also very often the case that clients have problems to solve but the data that they have is not useful to build solutions on top of.
It is very important to understand what sort of data you will have and what quality it is. Here are some questions to ask:
How far back does our data go?
How was it collected? Does it cover all cases or some cases are missing?
What does it look like in terms of distributions, etc?
When building models, it is likely the case that you will need a couple months' data in advance for you to be able to model anything.
- So if you have to start collecting data you don't have today, you will have to wait a couple months until you can model anything on top of that.
Avoid misunderstandings
It is often the case that clients (non-technical or otherwise) will misunderstand and assume all sorts of things before and during a project. You want to be very clear when communicating with them:
Reframe concepts and ideas
One way to help avoid misunderstandings is to describe things in another way, i.e. to reframe a concept or idea.
In data science projects we usually deal with complex and highly abstract concepts. It is not
Examples of ways to do this are: "So what you mean is...", "Would I be right in saying that..."
Draw things
TODO diagrams
Give examples
TODO
Have people explain how they work to you
Have you ever had someone ask you for help and, when they had finished explaining their problem to you, they said: "Nevermind, I understood this now. Thanks for listening"?
This happens because the mere act of saying things out loud helps whoever is saying it better understand what they are thinking.
It is very often that you need to create solutions for tasks that don't have a well-defined process, so you have to dig for information:
Whenever possible, ask clients to explain their work to you. Here are some questions you may want to ask:
What exactly do you do in case XYZ happen?
Walk me through an average task. What things do you look at, what tools do you use, etc.
What is the difference between X and Y?
Push an MVP ASAP
"Can't we do a simpler version first?"
This can be anything like an initial version of an ML model, with hardcoded rules just so clients can see what the output will look like.