Beruflich Dokumente
Kultur Dokumente
3. Data science problems come in various shapes and sizes. I am saying "data
science" because you asked for "useful applications". To handle a data science problem
you need to do the following stages:
1.
2.
3.
4.
5.
6.
extract some features from the input speech signal, or an image. Now, extracting these
features needs you to know properties of these underlying signals. Digital signal
processing or Image processing will help you gain expertise. You would be in a better
situation to know what feature extraction works and what does not. You would want to
learn what is a Fourier transform because maybe you would like to apply that to
speech signal or maybe apply discrete cosine transform to images before using them
as features to your machine learning system.
5. Calculus: You need to know how to calculate derivatives/gradients as one of the
most common optimization methods used in machine learning, the gradient
descent, actually needs to compute the gradient.
Linear Algebra: You should be comfortable with representations and
computations in terms of vectors and matrices instead of single numbers. Also,
sometimes you may need to apply calculus together with linear algebra, such as to
find the derivatives of a function with respect to a vector (sounds mysterious?).
Also, concepts like transpose and inverse are also fundamental.
Probability Theory: You need to master some basic concepts like conditional
probability, independence (these two are used in a classic ML algorithm named
naive Bayes).
Statistics: You should be familiar with expectation, variance, standard
deviation and probability distributions (better in higher dimensional instead of just
one dimensional).
Optimization: Well, perhaps you do not need to worry about this since most
ML courses will come together with some optimization methods that you will use,
like gradient descent.
The last but not the least, you may first need to decide what style/flavor of ML
you want to learn.
If you want to focus on the mathematical theory of ML (what and why), then it
will require a solid background of maths, such as the COS 511, Spring 2014:
Home offered by Rob Schapire at Princeton. You may also hope to focus on
implementations (how), then it will be less mathy (easier :) ). In fact, you probably
need not understand two much about the mathematical underpinnings of ML
algorithm to make it work for you. Instead, you just need to remember some tricks.
==============
3. Applied Math + Algorithms: For discriminate models like SVMs [10], you
need to have a firm understanding of algorithm theory. Even though you will
probably never need to implement an SVM from scratch, it helps to understand how
the algorithm works. You will need to understand subjects like convex optimization
[11], gradient decent [12], quadratic programming [13], lagrange [14], partial
differential equations [15], etc. Get used to looking at summations [16].
4. Distributed Computing: Most machine learning jobs require working with
large data sets these days (see Data Science) [17]. You cannot process this data on a
single machine, you will have to distribute it across an entire cluster. Projects like
Apache Hadoop [4] and cloud services like Amazon's EC2 [18] makes this very easy
and cost-effective. Although Hadoop abstracts away a lot of the hard-core,
distributed computing problems, you still need to have a firm understanding of
map-reduce [22], distribute-file systems [19], etc. You will most likely want to check
out Apache Mahout [20] and Apache Whirr [21].
5. Expertise in Unix Tools: Unless you are very fortunate, you are going to need
to modify the format of your data sets so they can be loaded into R,Hadoop,HBase
[23],etc. You can use a scripting language like python (using re) to do this but the
best approach is probably just master all of the awesome unix tools that were
designed for this: cat [24], grep [25], find [26], awk [27], sed [28], sort [29], cut
[30], tr [31], and many more. Since all of the processing will most likely be on linuxbased machine (Hadoop doesnt run on Window I believe), you will have access to
these tools. You should learn to love them and use them as much as possible. They
certainly have made my life a lot easier. A great example can be found here [1].
6. Become familiar with the Hadoop sub-projects: HBase, Zookeeper [32],
Hive [33], Mahout, etc. These projects can help you store/access your data, and they
scale.
7. Learn about advanced signal processing techniques: feature extraction is
one of the most important parts of machine-learning. If your features suck, no
matter which algorithm you choose, your going to see horrible performance.
Depending on the type of problem you are trying to solve, you may be able to utilize
really cool advance signal processing algorithms like: wavelets [42], shearlets [43],
curvelets [44], contourlets [45], bandlets [46]. Learn about time-frequency analysis
[47], and try to apply it to your problems. If you have not read about Fourier
Analysis[48] and Convolution[49], you will need to learn about this stuff too. The
ladder is signal processing 101 stuff though.
Finally, practice and read as much as you can. In your free time, read papers like
Google Map-Reduce [34], Google File System [35], Google Big Table [36], The
Unreasonable Effectiveness of Data [37],etc There are great free machine learning
books online and you should read those also. [38][39][40]. Here is an awesome
course I found and re-posted on github [41]. Instead of using open source packages,
code up your own, and compare the results. If you can code an SVM from scratch,
you will understand the concept of support vectors, gamma, cost, hyperplanes, etc.
It's easy to just load some data up and start training, the hard part is making sense
of it all.
=========================================================
=========================================================
=========================================================
1.
2.
3.
4.
why tools - because they can automate the project and gives result faster
When you are working on a large project, machine learning tools can help you to
prototype a solution, figure out the requirements and give you a template for the system
that you may want to implement.
One useful way to think about machine learning tools it so separate them
into Platforms and Libraries. A platform provides all you need to run a project, whereas
a library only provides discrete capabilities or parts of what you need to complete a
project.
scikit-learn in Python.
JSAT in Java.
Accord Framework in .NET
KNIME
RapidMiner
Orange
-----------------------------------------------
Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Overview
When crunching data to model business decisions, you are most typically using
supervised and unsupervised learning methods.
A hot topic at the moment is semi-supervised learning methods in areas such as image
classification where there are large datasets with very few labeled examples.
In this section, I list many of the popular machine learning algorithms grouped the way I
think is the most intuitive. The list is not exhaustive in either the groups or the algorithms,
but I think it is representative and will be useful to you to get an idea of the lay of the
land.
Please Note: There is a strong bias towards algorithms used for classification and
regression, the two most prevalent supervised machine learning problems you will
encounter.
If you know of an algorithm or a group of algorithms not listed, put it in the comments
and share it with us. Lets dive in.
Regression Algorithms
Instance-based Algorithms
Regularization Algorithms
Ridge Regression
Least Absolute Shrinkage and Selection Operator (LASSO)
Elastic Net
Least-Angle Regression (LARS)
Decisions fork in tree structures until a prediction decision is made for a given record.
Decision trees are trained on data for classification and regression problems. Decision
trees are often fast and accurate and a big favorite in machine learning.
The most popular decision tree algorithms are:
Bayesian Algorithms
Naive Bayes
Gaussian Naive Bayes
Multinomial Naive Bayes
Averaged One-Dependence Estimators (AODE)
Bayesian Belief Network (BBN)
Bayesian Network (BN)
Clustering Algorithms
k-Means
k-Medians
Expectation Maximisation (EM)
Hierarchical Clustering
Apriori algorithm
Eclat algorithm
Note that I have separated out Deep Learning from neural networks because of the
massive growth and popularity in the field. Here we are concerned with the more
classical methods.
The most popular artificial neural network algorithms are:
Perceptron
Back-Propagation
Hopfield Network
Radial Basis Function Network (RBFN)
Ensemble Algorithms
Boosting
Bootstrapped Aggregation (Bagging)
AdaBoost
Stacked Generalization (blending)
Gradient Boosting Machines (GBM)
Gradient Boosted Regression Trees (GBRT)
Random Forest