Beruflich Dokumente
Kultur Dokumente
Take the Computational Linear Algebra course (it is sometimes called Applied Linear Algebra or
Matrix Computations or Numerical Analysis or Matrix Analysis and it can be either CS or Applied
Math course). Matrix decomposition algorithms are fundamental to many data mining
applications and are usually underrepresented in a standard "machine learning" curriculum. With
TBs of data traditional tools such as Matlab become not suitable for the job, you cannot just run
eig() on Big Data. Distributed matrix computation packages such as those included in Apache
Mahout [1] are trying to fill this void but you need to understand how the numeric
algorithms/LAPACK/BLAS routines [2][3][4][5] work in order to use them properly, adjust for
special cases, build your own and scale them up to terabytes of data on a cluster of commodity
machines.[6] Usually numerics courses are built upon undergraduate algebra and calculus so you
should be good with prerequisites. I'd recommend these resources for self study/reference
material:
See Jack Dongarra : Courses and What are some good resources for learning about numerical
analysis?
2) Learn about distributed computing
It is important to learn how to work with a Linux cluster and how to design scalable distributed
algorithms if you want to work with big data (Why the current obsession with big data? ).
Crays and Connection Machines of the past can now be replaced with farms of cheap cloud
instances, the computing costs dropped to less than $1.80/GFlop in 2011 vs $15M in
1984: http://en.wikipedia.org/wiki/FLOPS .
If you want to squeeze the most out of your (rented) hardware it is also becoming increasingly
important
to
be
able
to
utilize
the
full
power
of
multicore (seehttp://en.wikipedia.org/wiki/Moo... )
Note: this topic is not part of a standard Machine Learning track but you can probably find courses
such as Distributed Systems or Parallel Programming in your CS/EE catalog. See distributed
computing resources, a systems course at UIUC, key works, and for starters: Introduction to
Computer Networking.
After studying the basics of networking and distributed systems, I'd focus on distributed databases,
which will soon become ubiquitous with the data deluge and hitting the limits of vertical
scaling.
See key
works, research
trends and
for
starters:Introduction
to
relational
Start learning statistics by coding with R: What are essential references for R? and experiment with
real-world data: Where can I find large datasets open to the public?
Cosma Shalizi compiled some great materials on computational statistics, check out his lecture
slides, and also What are some good resources for learning about statistical analysis?
I've found that learning statistics in a particular domain (e.g. Natural Language Processing) is
much more enjoyable than taking Stats 101. My personal recommendation is the course by Michael
Collins at Columbia (also available onCoursera).
You can also choose a field where the use of quantitative statistics and causality principles [7] is
inevitable, say molecular biology [8], or a fun sub-field such as cancer research [9], or even
narrower domain, e.g. genetic analysis of tumor angiogenesis [10] and try answering important
questions in that particular field, learning what you need in the process.
4) Learn about optimization
Start with Stephen P. Boyd's video lectures and also What are some good resources to learn about
optimization?
5) Learn about machine learning
Before you get to think about algorithms look carefully at the data and select features that help you
filter signal from noise. See this talk by Jeremy Howard : At Kaggle, Its a Disadvantage To Know
Too Much
Also see How do I learn machine learning? and What are some introductory resources for learning
about large scale machine learning? Why?
You
can
and
curricula
structure
of
your
study
program
MIT,
Stanford
or
according
other
top
to
online
schools.
course
catalogs
Experiment
with
data a lot, hack some code, ask questions, talk to good people, set up a web crawler in your
garage: The Anatomy of a Search Engine
You can join one of these startups and learn by doing: What startups are hiring engineers with
strengths in machine learning/NLP?
The
alternative
(and
program/Machine
rather
Learning
expensive)
track
if
option
you
prefer
is
to
enroll
studying
in
in
CS
formal
setting. See: What makes a Master's in Computer Science (MS CS) degree worth it and why?
Try to avoid overspecialization. The breadth-first approach often works best when learning a new
field and dealing with hard problems, see the Second voyage of HMS Beagle on the adventures of
an ingenious young data miner.
6) Learn about information retrieval
What are some good resources to get started with Information Retrieval? Why?
7) Learn about signal detection and estimation
This
Some
enemy
is
of
classic
these
topic
and
"data
science"
methods
were
used
to
submarines
and
are
still
in
guide
active
par
the
use
excellence
in
Apollo
mission
in
many
my
fields.
opinion.
or
detect
This
is
Good references are Robert F. Stengel' lecture slides on optimal control and estimation: Rob
Stengel's Home Page, Alan V. Oppenheim's Signals and Systems. and What are some good
resources for learning about signal estimation and detection? A good topic to focus on first
is Kalman filter, widely used for Time series forecasting.
Talking about data, you probably want to know something about information: its transmission,
compression and filtering signal from noise. The methods developed by communication engineers
in the 60s (such as Viterbi decoder, now used in about a billion cellphones, or Gabor
wavelet widely used in Iris recognition) are applicable to a surprising variety of data analysis tasks,
from Statistical machine translation to understanding the organization and function of molecular
networks. A good resource for starters is Information Theory and Reliable Communication: Robert
G. Gallager: 9780471290483: Amazon.com: Books. AlsoWhat are some good resources for learning
about information theory?
8) Master algorithms and data structures
What are the most learner-friendly resources for learning about algorithms?
9) Practice
Carpentry: http://software-carpentry.org/