You are on page 1of 22

How to Learn Python for Data Science, The

Self-Starter Way
October 23, 2016

28
SHARES
ShareGoogleLinkedinTweet

Do you want to learn Python for data science, but don’t want to take a slow, expensive course?
Most courses are just rehashed versions of the excellent free content out there. Here are
resources for self-starters to acquire this valuable skill at their own pace!

At its heart, data science is about problem solving, exploration, and extracting valuable
information from data. To do so effectively, you'll need to be able to wrangle datasets,
implement statistical models, write programs, and much more.

Therefore, developing sharp programming skills is critical to your success. It's like learning
how to ride a bike in a crowded city. Not only will you reach your destinations faster, but you'll
also have the freedom to visit areas you could never reach on foot.

Plus, your chosen programming tool will become your trusty sidekick in this journey. For most
aspiring data scientists, we strongly recommend starting with Python. Then, you should learn R
after you become fluent with Python.
Python is one of the most widespread languages in the world, and it has a passionate community
of users:

Python popularity in 2016, TIOBE Index

Within the data science community, Python is even more popular. Here's why...

Why Learn Python for Data Science?


Some people judge the quality of a programming language by the simplicity of its "hello, world!"
program. Python does pretty well by this standard:

Python

1 print "hello, world!"

For comparison, here's the same output in Java:

Java

1 public class Main {


2 public static void main(String[] args) {
3 System.out.println("hello, world!");
4 }
5}

Great, case closed! See you back here after you've mastered Python, sound good?

...
Okay, okay... but in all seriousness... simplicity is definitely one of Python's biggest strengths.
Thanks to its precise and efficient syntax, Python can often accomplish the same tasks with much
less code compared to other languages. This makes implementing solutions refreshingly fast.

In addition, Python's vibrant data science community means you'll be able to find plenty of
tutorials, code snippets, and people to commiserate with fixes to common bugs. Stackoverflow
will be one of your best friends.

Finally, Python has all-star lineup of libraries (a.k.a. packages) for numeric and scientific
computing, all of which will make your life much easier. More on this later.

The Self-Starter Way


We believe in a hyper-practical, action-centric approach to learning Python for data science as
quickly as possible, but you must be a self-starter to succeed with this strategy.

The reason is that we're going to completely cut out "classroom" study. You'll learn just enough
of the fundamentals to jump into real-world problems, and then gradually build mastery over
time by "just doing shit." (not the formal term)

You'll also have a ton of fun using this method because it's the fastest way gain the essential
programming skills required to start doing data science.

However, you must first build a rock-solid foundation of core programming concepts. This is
the one place where you cannot take any shortcuts because you'll need to know how to translate
solutions in your head into instructions for a computer. Effective programming is not about
memorizing syntax, but rather mastering a new way of thinking.

We recommend learning Python for data science through the following 3 reliable steps:

 1

Core Programming Concepts

Learn how to solve problems using code.

 2

Drills and Challenges

Practice to master the core skills.

 3

Essential Data Science Libraries


Equip the tools needed for data science.

After completing these 3 steps, you'll be ready to dive into projects and analyses while
continuing to learn as you go.

Aside: Installing Python through Anaconda


There are many ways to install Python on your computer, but we recommend installing it
through the Anaconda bundle, which includes many of the libraries you'll need for data science.
Here's a quick tutorial on installing Python using Anaconda.

Python 2.7 or 3.0+? Use Python 2.7, plain and simple. Python 2.7 is more widely used in almost
every field. It supports more packages, especially those required for machine learning.

Step 1: Core Programming Concepts


The amount of time you spend at this step depends on how much previous programming
experience you have and whether you can work on this full-time or part-time, but it typically
ranges from 1 week to 6 weeks.

If you are completely new to programming, be prepared to spend at least 1 month on this step.
You'll want the time to absorb these rich concepts. They form the base needed to learn Python
for data science quickly.

Among all the courses, tutorials, and guides out there, we've found the following two resources
to be the best for self-starters. They are both self-paced, hands-on, and comprehensive (and free).

You're new to programming?


How to Think Like a Computer Scientist is a fantastic interactive online book that
takes a whirlwind tour through key programming concepts (with Python). If you're new to
programming, we suggest starting here, as it's like a condensed "Computer Science 101" course.

You've programmed before?

Learn Python the Hard Way is an excellent online book for people with some previous exposure
to programming concepts. The "hard way" simply refers to learning through instructive
exercises. Through 52 short exercises, you'll start with setting up Python and incrementally work
your way up to writing multi-file programs.

Step 2: Drills and Challenges


If you want to learn Python for data science well, then don't skip this step.

After you grasp the core programming concepts, spend a week or two solidifying them
by completing drills and challenges.

If you try to jump into a real project right away, you'll be overwhelmed by the number of moving
parts. It's easy for our brains to trick us into believing we know something after reading about it
in a book, but it takes concentrated practice to really learn the skills.

Think about it this way. Professional basketball players cannot just play games all the time if
they want to improve. They must also spend hours every day practicing specific shots from
different parts of the court.

When you take your newfound programming skills and hone them through short, targeted drills
and challenges, you'll improve much faster than jumping into projects immediately.
Here's what we recommend:

Get into fighting shape...

Code Fights is a platform with many short coding challenges that can be completed in 5-minute
chunks (although it's so fun that you might find yourself playing through it for hours at at time).
You'll gain points along the way and unlock new levels, making it a nice way to track your
progression as well.

Solve a mystery...

The Python Challenge is one of the coolest puzzles on the web, so don't be put off by its 1990's
graphics. You can complete all 33 levels with the help of Python scripts. One user called it "an
addictive way to learn the ins and outs of Python..." We agree!
Consider alternative solutions...

PracticePython.org is a collection of short practice problems in Python. It's updated almost every
week with a new problem. What's really nice is that the author includes multiple user-submitted
solutions for each problem so you can see alternative ways of solving them.

Step 3: Essential Data Science Libraries


Now you're almost ready to dive into real data science projects!

First, we built a strong foundation of core concepts. Then, we practiced pure Python through
drills and challenges. Now, we're going to focus on the for data science part of "how to learn
Python for data science."

As we mentioned earlier, Python has an all-star lineup of libraries that are essential for data
science. To begin, we recommend acquiring a working knowledge of NumPy, pandas,
SciPy and matplotlib, while using them in the IPython notebook environment. This is the core
stack of tools you'll need for data analysis.

Other important libraries, such as scikit-learn (machine learning) or beautifulsoup4 (web


scraping), can be picked up when you need to learn their specific use cases later.

The Big 5 Essential Libraries

 NumPy - NumPy is the grand-daddy of all data science libraries. It allows easy and
efficient numeric computation, and many other machine learning libraries are built on top
of it.
 Pandas - Pandas is high-performance library for data structures and exploratory analysis.
 Matplotlib - Flexible plotting and visualization library.
 IPython - Interactive shell for Python that makes it much easier to explore data and debug
errors. Makes it much more enjoyable to learn Python for data science.
 SciPy - Extends NumPy with more functionality, such as calculating integrals, linear
algebra, and statistics.

Training Videos

 NumPy Beginner (Video), (Course Materials) - Excellent, thorough introduction to


scientific computing with NumPy.
 Introduction to Pandas and Exploratory Data Analysis (Video) - Pandas, IPython, and
matplotlib for exploratory data analysis.

More Resources
 How to Learn Statistics for Data Science, The Self-Starter Way
 How to Learn Math for Data Science, The Self-Starter Way
 Supercharge Your Data Science Career: 88 Free Resources

How to Learn Math for Data Science, The


Self-Starter Way
October 30, 2016

55
SHARES
ShareGoogleLinkedinTweet

Do you need to have a math Ph.D to become a data scientist? Absolutely not! This guide will
show you how to learn math for data science and machine learning without taking slow,
expensive courses.
How much math you'll do on a daily basis as a data scientist varies a lot depending on your role.
Keep reading to find out which concepts you'll need to master to succeed for your goals.

Pre-requisite: Basic Python Skills


To complete this guide, you'll need at least basic Python* programming skills. We'll be learning
math in an applied, hands-on way.

Check out our guide, How to Learn Python for Data Science, The Self-Starter Way, for the
fastest way to get up to speed with Python. We recommend at least completing up to Step 2 in
that guide.

*note: other languages are fine too, but the examples will be in Python.

Math Needed for Data Science


The amount of math you'll need depends on the role. First, every data scientist needs to know
some statistics and probability theory. We have a guide for that:

 How to Learn Statistics for Data Science, The Self-Starter Way

What about other types of math? Well, here's where the answer is more nuanced... it depends on
how much original machine learning research you'll be doing.

Application-Heavy Machine Learning Positions


In practice, especially in entry-level roles, you'll often be using out-of-the-box ML
implementations. There are robust libraries of common libraries in many programming
languages. You don't need to reinvent the wheel.

Even so, interviewers may still test your basic linear algebra and multivariable calculus. Why
do they do this?

Well, at some point, your team may still need to build custom implementations of ML
algorithms. For example, you may need to adapt one to your tech stack or to expand its base
functionality. To do so, you must be able to peel back ML algorithms and work with their
innards.

R&D-Heavy Machine Learning Positions

Other roles need much more original ML research and development. You may need to translate
algorithms from academic papers into working code. Or, you might research enhancements
based on your business's unique challenges.

In other words, you'll be implementing algorithms from scratch much more often.

For these positions, mastery of both linear algebra and multivariable calculus is a must.

The Best Way to Learn Math for Data Science


The self-starter way to learning math for data science is to learn by "doing shit." So we're
going to tackle linear algebra and calculus by using them in real algorithms!

Even so, you'll want to learn or review the underlying theory up front. You don't need to read a
whole textbook, but you'll want to learn the key concepts first.

Here are the 3 steps to learning the math required for data science and machine learning:

 1

Linear Algebra for Data Science

Matrix algebra and eigenvalues.

 2

Calculus for Data Science

Derivatives and gradients.

 3
Gradient Descent from Scratch

Implement a simple neural network from scratch.

Step 1: Linear Algebra for Data Science


Many machine learning concepts are tied to linear algebra. For example, PCA requires
eigenvalues and regression requires matrix multiplication.

Also, most ML applications deal with high dimensional data (data with many variables). This
type of data is best represented by matrices.

Here are a few of the best free resources we've found for learning linear algebra for data science:

For application-heavy roles...

Khan Academy has short, practical linear algebra lessons. They cover the most important topics.

For R&D-heavy roles...


MIT OpenCourseWare offers a rigorous linear algebra class. The video lectures and course
materials are all included.

And if you only need to review:

 Linear Algebra Review for Machine Learning (Video Series) - These are the optional
linear algebra review videos for Andrew Ng's machine learning course. The entire 6-part
series can be watched in under 1 hour. Recommended if you've taken linear algebra
before and just need a quick review.
 The Matrix Cookbook (PDF) - Excellent reference resource for matrix algebra.

Step 2: Calculus for Data Science


Calculus is important for several key ML applications. For example. you'll need to be able to
calculate derivatives and gradients for optimization.

In fact, one of the most common optimization techniques is gradient descent.

Here are some of the best resources for learning calculus for data science:

For application-heavy roles...

Khan Academy has short, practical multivariable calculus lessons. They cover the most
important concepts.
For R&D-heavy roles...

MIT OpenCourseWare offers a rigorous multivariable calculus class. The video lectures and
course materials are all included.

And if you only need to review:

 Multivariable Calculus Review (Video) - This is quick review of multivariable calculus


in the format of solving practice problems. Recommended if you've taken multivariable
calculus before and just need a quick review.

Step 3: Simple Neural Network from Scratch


Congratulations! You've got the theory out of the way. Now it's time for the really fun part.

One of the best ways to learn math for data science and machine learning is to build a simple
neural network from scratch.

You'll use linear algebra to represent the network and calculus to optimize it. Specifically, you'll
code up gradient descent from scratch.
Don't worry too much about the nuances of neural networks for now. It's ok if you're just
following instructions and writing code. We'll cover machine learning in depth in another guide,
as this is for targeted math practice.

Follow along with the tutorials, and review theory as you go along. Plus, you'll have a cool
project to add to your portfolio afterward.

Here are a few awesome step-by-step guides:

 Neural Network in Python, Part 2 - This is an incredible tutorial that takes you through a
simple neural network from end to end. It's packed with helpful illustrations, and you'll
learn about how gradient descent fits in.
 Neural Nets to Recognize Handwritten Digits - We love this resource! This is a free
online book that walks you through a famous application of neural networks. It explains
ideas very intuitively, and it's the most in-depth tutorial in this list.
 Implementing a Neural Network from Scratch - A shorter tutorial that also takes you
through step-by-step.
How to Learn Statistics for Data Science, The
Self-Starter Way
October 23, 2016

67
SHARES
ShareGoogleLinkedinTweet

Do you want to learn statistics for data science without taking a slow and expensive course?
Goods news… You can master the core concepts, probability, Bayesian thinking, and even
statistical machine learning using only free online resources. Here are the best resources for self-
starters!

By the way... you don't need a math degree to succeed with this approach. Yet, if you do have a
math background, you'll definitely enjoy this fun, hands-on method too.

This guide will equip you with the tools of statistical thinking needed for data science. It will arm
you with a huge advantage over other aspiring data scientists who try to get by without it.

You see, it can be tempting to jump directly into using machine learning packages once you've
learned how to program... And you know what? It's ok if you want to initially get the ball rolling
with real projects.
But, you should never, ever completely skip learning statistics and probability theory. It's
essential to progressing your career as a data scientist.

Here's why...

Pre-requisite: Basic Python Skills


To complete this guide, you'll need at least basic Python* programming skills. We'll be learning
statistics in an applied, hands-on way.

Check out our guide, How to Learn Python for Data Science, The Self-Starter Way, for the
fastest way to get up to speed with Python. We recommend at least completing up to Step 2 in
that guide.

*note: other languages are fine too, but the examples will be in Python.

Statistics Needed for Data Science


Statistics is a broad field with applications in many industries.

Wikipedia defines it as the study of the collection, analysis, interpretation, presentation, and
organization of data. Therefore, it shouldn't be a surprise that data scientists need to
know statistics.

Word cloud credit: Cal. State University

For example, data analysis requires descriptive statistics and probability theory, at a minimum.
These concepts will help you make better business decisions from data.
Key concepts include probability distributions, statistical significance, hypothesis testing,
and regression.

Furthermore, machine learning requires understanding Bayesian thinking. Bayesian thinking is


the process of updating beliefs as additional data is collected, and it's the engine behind many
machine learning models.

Key concepts include conditional probability, priors and posteriors, and maximum
likelihood.

If those terms sound like mumbo jumbo to you, don't worry. This will all make sense once you
roll up your sleeves and start learning.

The Best Way to Learn to Statistics for Data Science


By now, you've probably noticed that one common theme in "the self-starter way to learning X"
is to skip classroom instruction and learn by "doing shit."

Mastering statistics for data science is no exception.

In fact, we're going to tackle key statistical concepts by programming them with code! Trust us...
this will be super fun.

If you do not have formal math training, you'll find this approach much more intuitive
than trying to decipher complicated formulas. It allows you to think through the logical steps of
each calculation.

If you do have a formal math background, this approach will help you translate theory into
practice and give you some fun programming challenges.

Here are the 3 steps to learning the statistics and probability required for data science:

 1

Core Statistics Concepts

Descriptive statistics, distributions, hypothesis testing, and regression.

 2

Bayesian Thinking

Conditional probability, priors, posteriors, and maximum likelihood.

 3
Intro to Statistical Machine Learning

Learn basic machine concepts and how statistics fits in.

After completing these 3 steps, you'll be ready to attack more difficult machine learning
problems and common real-world applications of data science.

Step 1: Core Statistics Concepts


To know how to learn statistics for data science, it's helpful to start by looking at how it will be
used.

Let's take a look as some examples of real analyses or applications you might need to
implement as a data scientist:

1. Experimental design: Your company is rolling out a new product line, but it sells
through offline retail stores. You need to design an A/B test that controls for differences
across geographies. You also need to estimate how many stores to pilot in for statistically
significant results.
2. Regression modeling: Your company needs to better predict the demand of individual
product lines in its stores. Under-stocking and over-stocking are both expensive. You
consider building a series of regularized regression models.
3. Data transformation: You have multiple machine learning model candidates
you're testing. Several of them assume specific probability distributions of input data, and
you need to be able to identify them and either transform the input data appropriately or
know when underlying assumptions can be relaxed.

A data scientist makes hundreds of decisions every day. They range from small ones like how to
tune a model all the way up big ones like the team's R&D strategy.

Many of these decisions require a strong foundation in statistics and probability theory.

For example, data scientists often need to decide which results are believable and which are
bullshit likely due to randomness. Plus, they need to know if there are pockets of interest that
should be explored further.

These are central skills in analytical decision making (knowing how to calculate p-values is only
scratching the surface).

Here's one of the best resources we've found for learning basic statistics as a self-starter:
Think like a statistician...

Think Stats is an excellent book (with free PDF version) introducing all the key concepts. The
premise of the book? If you know how to program, then you can use that skill to teach yourself
statistics. We've found this approach to be very effective, even for those with formal math
backgrounds.

Step 2: Bayesian Thinking


One of the philosophical debates in statistics is between Bayesians and frequentists. The
Bayesian side is more relevant when learning statistics for data science.

In a nutshell, frequentists use probability only to model sampling processes. This means they
only assign probabilities to describe data they've already collected.

On the other hand, Bayesians use probability to model sampling processes and to quantify
uncertainty before collecting data. If you'd like to learn more about this divide, check out this
Quora post: For a non-expert, what's the difference between Bayesian and frequentist
approaches?

In Bayesian thinking, the level of uncertainty before collecting data is called the prior
probability. It's then updated to a posterior probability after data is collected. This is a central
concept to many machine learning models, so it's important to master.

Again, all of these concepts will make sense once you implement them.

Here's one of the best resources we've found for learning Bayesian thinking as a self-starter:
Think like a Bayesian...

Think Bayes is the follow-up book (with free PDF version) of Think Stats. It's all about Bayesian
thinking, and it uses the same approach of using programming to teach yourself statistics. This
approach is fun and intuitive, and you'll learn each concept's underlying mechanics well since
you'll be implementing them.

Step 3: Intro to Statistical Machine Learning


If you want to learn statistics for data science, there's no better way than playing with statistical
machine learning models after you've learned core concepts and Bayesian thinking.

The statistics and machine learning fields are closely linked, and "statistical" machine learning is
the main approach to modern machine learning.

In this step, you'll be implementing a few machine learning models from scratch. This will help
you unlock true understanding of their underlying mechanics.

At this stage, it's fine if you're just copying code, line-by-line.

This helps you break open the black box of machine learning while solidifying your
understanding of the applied statistics required for data science.

The following models were chosen because they illustrate several of the key concepts from
earlier.

Linear Regression

First, we have the poster child of predictive modeling...


 Linear Regression from Scratch in Python

Naive Bayes Classifier

Next, we have an embarrassingly simple model that works pretty darn well...

 Intuitive Introduction, Naive Bayes from Scratch in Python

Multi-Armed Bandits

And finally, we have the famous "20 lines of code that beat any A/B test!"

 Intuitive Introduction, Multi-Armed Bandits from Scratch in Python

If you're hungry for more, we recommend the following resource. We'll also be coming out with
a detailed guide for learning machine learning the self-starter way, so stay tuned.

For your reference...

Introduction to Statistical Machine Learning is a wonderful textbook (with free PDF version)
that you can use as a reference. The examples are in R, and the book covers a much broader
range of topics, making this a valuable tool as you progress into more work in machine learning.

More Resources
 How to Learn Math for Data Science, The Self-Starter Way
 6 Fun Machine Learning Projects for Beginners
 Supercharge Your Data Science Career: 88 Free Resources
67
SHARES
ShareGoogleLinkedinTweet

1 Comment

 Comments
 Trackbacks

Recommended Reading

 Best Practices for Feature Engineering


 The Beginner’s Guide to Kaggle
 How to Handle Imbalanced Classes in Machine Learning
 9 Mistakes to Avoid When Starting Your Career in Data Science
 WTF is the Bias-Variance Tradeoff? (Infographic)
 Free Data Science Resources for Beginners
 Dimensionality Reduction Algorithms: Strengths and Weaknesses