Data-related Job Descriptions: Making of a Data Team

Last updated:

WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.

The actual position names may, of course, vary to some extent depending on your location or industry. The ones used in this article have been chosen according to months of observation of job descriptions, news articles and, most importantly, day-to-day conversation with people involved with data work.

CDO (Chief Data Officer)

Who you should be Executive profile, ideally with previous data-related work
What you should do Oversee all data operations in your organization, at a high level
Prioritize tasks according to business impact
Define the company's data strategy
Connect data initiatives with the organization's overall business objectives
What you should not do
What you may do
Sample tools Spreadsheets (e.g. MS Excel)
Project Management Software (e.g. MS Project)
Other Generic tools (e.g. AirTable)

Data Lead

Who you should be Senior employee, with extensive experience in data work and deep domain expertise
Relevant Advanced Degrees are expected
What you should do Oversee data work done through the company, at a more technical level than the CDO
Use domain knowledge to steer data scientists towards solving actual business problems
Mentor other employees (maybe conduct workshops or technical training)
Make technical decisions and/or settle differences in opinion within data teams
What you should not do
What you may do
Sample tools

A data lead will work as an intellectual leader for all employees working in with data in the organization. They will make technical decisions if necessary, and help turn the company's data strategy (if any) into actual systems and projects.

He/she should have a strong track record of data work him/herself and possibly some managerial experience. In addition to that, he/she should have deep domain expertise and offer guidance and mentoring to other members of the organization.

The role of a data lead may of course be held by someone who also performs other work in the organization. This position commonly emerges from the organization itself when it's necessary; it will often be the most senior data scientist or engineer, or the person whose opinion is perceived to carry more weight in discussions.

Data Scientist

Who you should be A graduate from a highly quantitative field; or a CS graduate with a strong quantitative/statistical bent
Knowledgeable in the domain area (e.g. finance, reatil, marketing, etc) you are working on
What you should do Explore and create useful visualizations for data
Train predictive and descriptive stastistical models based on data
Excel at your ML toolkit of choice (R, Python, Java, etc)
Suggest ways in which value can be extracted from data (actionable insights)
What you should not do
What you may do
Sample tools

Data Engineer

Who you should be Ideally a CS graduate, with previous experience with Distributed Systems, System Administration and Software Development
What you should do Write data pipelines to move data around the infrastructure
Write streaming and batch jobs
What you should not do
What you may do
Sample tools Workflow/Pipeline Orchestrator (e.g. Airflow)
Distributed Soft. Frameworks (e.g. Spark/Hadoop MR)
Message Queues (e.g. ActiveMQ)
Data Streams (e.g. Kafka)
Caching tools (e.g. Redis)
Non-relational Data Stores (e.g. Elasticsearch, MongoDB, S3)
RDBMSs (e.g. PostgreSQL)

Database Engineer

Who you should be Ideally a CS graduate, with a database bent
What you should do Optimize data stores depending upon access patterns
Debug slow queries
Choose the best database for a given task
Keep databases running smoothly
What you should not do
What you may do
Sample tools RDBMSs
Document Stores (e.g. Elasticsearch, MongoDB)
Columnar Data Stores (e.g. Cassandra)
Data Warehousing Tools (e.g. Redshift)

Data Analyst

Who you should be Ideally a graduate from a quantitative field
What you should do Communicate with and obtain data from external sources
Query databases
Deliver high-level analyses of data, such as means, sums, counts, counts per day, outliers, etc.
What you should not do
What you may do Prepare executive reports
Clean datasets (merge datasets, take out bad data, etc)
Build datasets from multiple sources
Sample tools Spreadsheets
Data analysis tools (e.g. Tableau)
PostgreSQL

Machine Learning Engineer

This is sometimes referred to as "Software Engineer - Machine Learning"

Who you should be Ideally, an experienced software engineer with good knowledge of machine learning
What you should do Engineer the system to serve the models to clients
What you should not do Try to develop your own models from scratch
What you may do
Sample tools Black-box ML Solutions (e.g. Prediction.io)
ML Toolkits (Scikit-learn)
Web Frameworks (e.g. Flask, Ruby on Rails)

This is probably the most versatile role so far. Machine learning engineers will build working systems that deliver machine learning solutions and connect them to other systems.

They split their time more or less evenly between software development and building statistical models for data.

They know how to build general-purpose software systems (along with all associated tasks such as testing, versioning, deploying and implementing software engineering best practices) and they know enough about machine learning algorithms and tools in order to use them to add value to the organization.

They probably do not, however, have the same level of expertise in their respective areas as specific roles such as Data Engineers or Data Scientists. For that reason, larger teams should have at least one Data Engineer and one Data Scientist to focus and optimize their respective areas.

Other roles

  • Business Analyst

    TODO


Suggestions of Team Setups

1) 1 Data Scientist + 1 Data Engineer + 2 Data Analysts

Company Size Company Data Maturity
Small - Medium Medium - High

2) 1 Machine Learning Engineer + 1 Data Engineer

Company Size Company Data Maturity
Small Low

3) 1 Data Scientist + 1 Data Engineer + 1 Database Engineer

Company Size Company Data Maturity
Small-Medium Medium-High

References

Dialogue & Discussion