Data Scientist Archetypes

14/11/2018 Data scientist archetypes
Data scientist archetypes

Data Science and Robots Blog
Watch the original Metis DemystifyDS conference talk, with Q&A

Get the slides
When you come to the field of data science for the first time, or even if you’ve been in it for a while, you can be easily
overwhelmed by the number of tools and the vast range of skills required. Blog posts about what it takes to become a data
scientist devolve into laundry list of skills you don’t have yet and tools you haven’t learned yet. This is discouraging. And it’s
counterproductive.
My goal here is to give you a map for navigating the sprawling terrain of data science. It’s to help you prioritize what you
want to learn and what you want to do, so you don’t feel lost.
https://brohrer.github.io/data_science_archetypes.html 1/11
These ideas are not mine alone. They are based on a series of interviews with the data scientists listed here, who are from
several different companies, as well as contributors from academic data science programs. The underlying ideas are
captured in a public document, offered as a resource to academic data science programs from industry data scientists—an
attempt at a virtual industry advisory committee. Any insights that you get from this, I credit to the group, and any errors or
inaccuracies are probably due to me alone. Overall this a compilation of a lot of people’s ideas who have thought deeply
about it.
The reason data science feels so big is because it’s no longer a single field. There are actually three separate pillars. There
is data analysis, taking raw information and turning it into knowledge that can be acted on or that can drive a decision. There
is data modeling, using the data that we have to estimate the data that we wish we had. And then there is data engineering,
which is taking these analysis and modeling activities and making everything work faster, more robustly, and on larger
quantities of data.
These three areas are all distinct. They require different skills and different sets of tools. Let’s dive a little deeper into each
one.
Data analysis
There are several aspects of data analysis, the process of taking data and turning it into information that we can use to make
a decision.
Domain knowledge is the ability to translate a business need or a practical concern to a question. It involves making trade-
offs between how accurate the answer should be and how much time and money you want to spend finding it. Domain
knowledge is also absolutely necessary for interpreting the data, for understanding not just the numbers and labels
presented, but what they reflect in the real world, what we can realistically assume about their accuracy.
Another important part of translating information into actions is research. This can can be as straightforward as performing
the right Google search and reading a few documents. Often, it involves instrumenting processes that are already in place,
like logging code, or adding sensors to a production line, or conducting surveys. For some applications, research involves
careful design of experiments. It requires using sophisticated statistical tools to plan what information will be gathered, how it
will be gathered, and how it will be analyzed after the fact.
Finally an important part of data analysis is interpretation. Given a large collection of data, how can we summarize it,
aggregate it, visualize it, and use statistical summaries to go from a sea of bits to a nugget of knowledge. In particular, I want
to call out the art of visualization. Turning numbers into a picture that is easily interpreted and conveys useful information is
an art as much as a science. It involves an intuitive sense of human visual perception and understanding.
Data modeling
Data modeling also has several major subcategories. This is often referred to as machine learning. It is no more or less than
creating a simplified description of your data that you can use to make estimates for data that you have not measured. Three
main sub classes of machine learning are supervised, unsupervised, and custom model development.
Supervised learning uses labeled examples to discover patterns. These can be categorical labels, for example: Did this
customer re-visit our website within the month after making a purchase, yes or no? Examples can also be labeled with
numbers. For example: How many views did a video receive? Using a large collection of labeled examples, accompanied
either by their category or value, supervised learning algorithms can distill the underlying patterns out. If the labels are
categorical, this is called classification. If those labels are numerical, this is called regression. Also, sometimes the goal is
not to make a specific prediction about future examples, but simply to determine whether they are in line with previously
seen patterns. This is called anomaly detection.
A second major area of machine learning is called unsupervised learning, indicating that the data does not come with labels.
We don’t know the outcome or the right answer. In unsupervised learning, the goal is to discover patterns in how the data is
distributed. In clustering, groups of data points that are nearer to each other than to the rest of the data set are identified.
These are natural groupings in the data. This can be done to find shopppers that buy similar items at the grocery store.
In dimensionality reduction, groups of variables that tend to behave similarly or in a coordinated fashion are identified as
well, such as genes that work together to produce a desired trait. In dimensionality reduction, it’s not the data points that are
being clustered, but the variables. Not the rows of the table, but the columns.
In applied data modeling, it is very common to have to do custom algorithm development. Both supervised and unsupervised
learning methods are general. They begin by assuming very little about the source of the data, what it means, and what
domain it resides in. General algorithms can get away with this by sheer statistical power. They are at their best when you
have a large collection of data.
However, in practical applications it is often the case that we don’t have nearly as much data as we would like, and we don’t
have enough to rely on naïve machine learning techniques. When this happens, we get to do custom algorithm
development. We build in information that we know, or feel comfortable assuming, about the domain. For instance, one
important modeling problem in the agriculture industry is to predict the yield for a field of corn, the total weight of the
harvested grain. This is an unfathomably complex system, with soil texture and chemistry, genetics, temperature, and
moisture interacting. And it takes a year to collect a single data point. There is no feasible way to learn a complete crop
model using naïve machine learning methods. However, geologist, chemists, biologists, meteorologist, and geneticists have
all investigated and described the pieces of this system. By assembling what is known and making some reasonable
assumptions, a tractable crop model can be created.
Data engineering
Data engineering is the third pillar of data science. It has a few major areas. Data management is the exercise of storing,
moving, and handling data. It is trivial for data sets of a few thousand or even a few million data points. But when individual
data points become very complex, like an individual’s genome, or the number of data points gets into the billions or trillions,
data no longer fits inside RAM on a laptop, and an entirely new set of tools needs to be used.
Another big task of data engineers is taking code that works well in a prototype and making it ready for operation in the wider
world. It is one thing to build a model in a Jupyter notebook, but to be able to use that to filter every incoming customer
request, the code needs to be made compatible with the rest of the company’s code base, and it needs to be able to ingest
data and publish results in the appropriate formats. Any glitches or failures need to be handled gracefully, and the system
needs to be made secure against mischievous or malicious users. The gap between a smoothly running prototype and
production code is a big one and part of data engineering is the process of closing it.
In both data management and production, software engineering plays an important role. It’s the act of building things out of
code that not only do what you want as fast as you want, but also can be shared and maintained across your current team,
adapted to future changes, and scaled up or down as needs evolve. This is a craft and comes with a large collection of tools
and specialized skills.
Data mechanics
There’s also a surprise pillar number four, data mechanics. This is the dirty work that everyone needs to do but nobody likes
to talk about. It includes things like data formatting, making sure your types are consistent, getting the information you want
out of strings, making sure the strings in use are valid, handling errors gracefully. It also includes value interpretation,
handling dates and times in an interpretable and consistent way, responding correctly to missing values, making sure units of
measurement are consistent and documented. And it includes the fundamentals of data handling: querying, slicing, and
joining. This lets you pull the subset of data that you want for an operation from the great lake of what you have available.
Now we have our four pillars in place: data analysis, data modeling, data engineering, and data mechanics. What does this
mean for you as a data scientist? These areas are all very different. I propose that a data scientist is someone who has at
the very least exposure to all these skill areas. That means that you’re not necessarily an expert, or even good at them, but
you know what they are, and you know the right questions to ask and right Google searches to run.
A beginner data scientist might have just a little bit of experience in each of these pillars.
With this as a base we can walk through some data scientist archetypes. It’s my hope that these will be helpful to you in
planning out your training and career trajectory.
Archetypes
One path forward is to become a data science Generalist- to develop a solid mixture of proficiency in analysis, modeling,
engineering, and mechanics. A nice balance across the board gives you a broad view of problems of varying scopes. It may
mean that you’ll need to collaborate with specialists for some projects. But it sets you up very well to be a technical leader or
data executive.
Just a word of caution: there may be a tendency to neglect data mechanics or to assume that someone more junior will take
care of it. The person who does this is a data scientist Diva. This is an anti-pattern. This is someone you do not want to be
and you do not want to hire. Mechanics can get gritty. Some days it will occupy the majority of your time. But you never
outgrow the need to do it. Embrace mechanics tasks, learn the skills you need to do them adequately, and get on with it.
There are specialist archetypes as well, emphasizing each of the three main pillars.
If you choose to emphasize data analysis, you can become a Detective, a master in discovering the right data and using it to
draw conclusions. A detective is familiar with modeling and engineering methods, and uses them in a lightweight way.
Analysis is their focus.
Another data scientist archetype is the Oracle, someone who has mastered the skills of modeling and the tools of machine
learning. They will have had exposure to analysis and engineering but their focus will be on modeling—either general
machine learning models or custom, domain specific models.
Finally, you can choose to emphasize data engineering, and become a Maker. This is the person who helps everything work
fast and reliably. Their work will still require some analysis and some modeling, and definitely some mechanics, but they will
be focused on engineering tasks. Makers are the ones who transform good ideas into concrete machinery.
Finally, there is the Unicorn. This person is a master in every pillar. This is the data science ideal. Except of course that it
doesn’t exist. Becoming a master of one pillar takes a long time. Becoming a master in all three is essentially stacking three
careers on top of each other. It’s not impossible to do, but it’s a long road, and the path down it passes through one of the
other archetypes as a waypoint.
The archetypes that are most useful for career planning, then, are
The Generalist, proficient at everything,

The Detective, a master of analysis,
The Oracle, a master of modeling, and
The Maker, a master of engineering.
Another way to represent these is in a 3-D plot with axes for analysis, modeling, and engineering. You can see the Detective,
Oracle, and Maker each lying close to an axis. The Generalist is a balance of three, sitting in the middle of the space. I want
to be completely clear that I’m not implying any hierarchy or value judgement between the archetypes. They are each
invaluable and serve different functions. Let your preferences and interests guide you as you choose between them.
Job postings
There is not a commonly accepted taxonomy for data science specializations. When looking for a job, I recommend ignoring
the job titles. There’s no reliable way to translate between job title and archetype. I also recommend not worrying too much
about the specific tools listed. Instead, look at this skills implied. Read between the lines to discover which archetype they
want. Does the posting sound more like a Maker? A Generalist? a Detective? Based on this, you can decide whether this
particular posting is a good match for you.
Also, when posting a job, ask for specific skills. Don’t ask for all of them. And don’t ask for a Unicorn. You won’t find one.
What I hope you take away from this, more than anything, is that you don’t have to be an expert at everything to be a great
and successful data scientist. It’s enough to choose a direction and dive in.
Good luck! If you haven’t already picked your path, I hope this helps you plan it, and I wish you lots of success along the
way.
Brandon
July 25, 2018
Need to procrastinate? Check out the rest of the
Data Science and Robots Blog
except where noted.

Data Scientist Archetypes

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Scientist Archetypes

Hochgeladen von

Copyright:

Verfügbare Formate

14/11/2018 Data scientist archetypes

Data scientist archetypes

Watch the original Metis DemystifyDS conference talk, with Q&A

The Generalist, proficient at everything,

Need to procrastinate? Check out the rest of the

Data Science and Robots Blog

except where noted.

Das könnte Ihnen auch gefallen