Beruflich Dokumente
Kultur Dokumente
http://cs246.stanford.edu
Input features:
d features: X1, X2, Xd Each Xj has domain Oj
Categorical: Oj = {red, blue} Numerical: Hj = (0, 10)
Y= 0.42
Data D:
Task:
examples ( , ) where is a -dim feature vector, is output variable Given an input data vector predict
2/20/2013
Decision trees:
Split the data at each internal node Each leaf node makes a prediction
A X1<v1
Y= 0.42
Lecture today:
Binary splits: Xj<v Numerical attrs. Regression
F
2/20/2013
Input: Example xi Output: Predicted yi Drop xi down the tree until it hits a leaf node Predict the value stored in the leaf that xi hits
Y= B 0.42
2/20/2013
|D|=90 C
|D|=20
F
X3<v4
|D|=15
X2<v5 H
|D|=30 I
2/20/2013
B D F G
C E H I
2/20/2013
Alternative view:
+ + + +
+ + + + + + + + +
+ + + + + +
+
+ + + + +
+ + +
X1
+ + +
+ +
2/20/2013
+ +
X2
+ +
7
BuildSubtree
2/20/2013
(1) How to split? Pick attribute & value that optimizes some criterion Classification: Information Gain
(|) = () (|)
Entropy: =
=1
|D|=10
A X1<v1
|D|=45
|D|=90
.42 B
C X2<v2
|D|=45
|D|=20
D |D|=25 X3<v4 G
|D|=15
E |D|=30 X2<v5 I
log
Conditional entropy: | = =1 = =
Suppose Z takes m values (v1 vm) H(W|Z=v) ... Entropy of W among the records in which Z has value v
2/20/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9
|D|=10
A X1<v1
|D|=45
|D|=90
.42 B
Find split (Xi, v) that D |D|=25 |D|=20 |D|=15 creates D, DL, DR: parent, X3<v4 F G H left, right child datasets and maximizes: +
= For ordered domains sort Xi and consider a split between each pair of adjacent values For categorical Xi find best split based on subsets
2/20/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
C X2<v2
|D|=45
E |D|=30 X2<v5 I
|| =1
variance of in
10
|D|=10
A X1<v1
|D|=45
|D|=90
.42 B
C X2<v2
|D|=45
|D|=20
D |D|=25 X3<v4 G
|D|=15
E |D|=30 X2<v5 I
|D|=10
A X1<v1
|D|=45
|D|=90
.42 B
C X2<v2
|D|=45
|D|=20
D |D|=25 X3<v4 G
|D|=15
E |D|=30 X2<v5 I
Classification:
Predict most common yi of the examples in the leaf
2/20/2013
12
Announcement: -- Machine Learning Gradiance quiz has been posted. Due 2013-03-02 23:59 Pacific time -- HW4 will be released today
2/20/2013
13
BuildSubTree
BuildSubTree
General considerations:
Tree is small (can keep it memory):
Shallow (~10 levels)
BuildSubTree
Dataset too large to keep in memory Dataset too big to scan over on a single machine MapReduce to the rescue!
2/20/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14
Parallel Learner for Assembling Numerous Ensemble Trees [Panda et al., VLDB 09] A sequence of MapReduce jobs that builds a decision tree Setting:
Hundreds of numerical (discrete & continuous, but not categorical) attributes Target variable is numerical: Regression Splits are binary: Xj < v Decision tree is small enough for each Mapper to keep it in memory Data too large to keep in memory
2/20/2013
16
A B D F G H C E I
Master
Model
Attribute metadata
Intermediat e results
FindBestSplit BuildSubTree
17
A B D F G H C E I
Mapper considers a number of possible splits (Xi,v) on its subset of the data For each split it stores partial statistics Partial split-statistics is sent to Reducers Reducer collects all partial statistics and determines the best split Tree grows for one level
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18
2/20/2013
A B D F G H C E I
Mapper loads the model and info about which attribute splits to consider Each mapper sees a subset of the data D* Mapper drops each datapoint to find the appropriate leaf node L For each leaf node L it keeps statistics about
(1) the data reaching L (2) the data in left/right subtree under split S
Reducer aggregates the statistics (1), (2) and determines the best split for each tree node
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19
2/20/2013
A B D F G H C E I
Master
Monitors everything (runs multiple MapReduce jobs)
Model file
A file describing the state of the model
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20
2/21/2013
Master Node MapReduce Initialization (run once first) MapReduce FindBestSplit (run multiple times) MapReduce InMemoryBuild (run once last)
2/21/2013
21
A B D F G H C E I
Controls the entire process Determines the state of the tree and grows it:
(1) Decides if nodes should be split (2) If there is little data entering a tree node, Master runs an InMemoryBuild MapReduce job to grow the entire subtree (3) For larger nodes, Master launches MapReduce FindBestSplit to evaluate candidates for best split
Master also collects results from FindBestSplit and chooses the best split for a node
j
Xj < v
DR
InMemoryQueue (InMemQ)
Nodes for which the data D in the node fits in memory
A C D G H E I
2/21/2013
23
j
Xj < v
DR
For a given set of tree nodes S, computes quality of a candidate split (for each node in S) For a given set of nodes S, completes tree induction at nodes in S using the InMemoryBuild algorithm (discussed later)
2/21/2013
Master Node MapReduce Initialization (run once first) MapReduce FindBestSplit (run multiple times) MapReduce InMemoryBuild (run once last)
2/21/2013
25
Initialization job: Identifies all the attribute values which need to be considered for splits
Initialization process generates attribute metadata to be loaded in memory by other tasks
Xj: 1.2 1.3 1.4 1.6 2.1 7.2 8.1 9.8 10.1 10.2 10.3 10.4 11.5 11.7 12.8 12.9
Idea: Consider a limited number of splits such that splits move about the same amount of data
2/20/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26
D*
Xj < v For attribute Xj we would like to consider every possible value vOj Compute an approx. equi-depth histogram on D*
Idea: Select buckets such that counts per bucket are equal
Count for bucket
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Domain values
2/20/2013
27
Count in bucket
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Domain values
Goal:
Equal number of elements per bucket (B buckets total)
2/20/2013
Master Node MapReduce Initialization (run once first) MapReduce FindBestSplit (run multiple times) MapReduce InMemoryBuild (run once last)
2/21/2013
29
MapReduce job to find best split when there is too much data to fit in memory Goal: For a particular split node j find attribute Xj and value v that maximize:
D training data (xi, yi) reaching the node j DL training data xi, where xi(j) < v DR training data xi, where xi(j) v Var(D) = 1/(n-1) i yi2 (i yi)2/n Note: Variance can be computed from sufficient statistics: n, yi, yi2
2/20/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
D DL
j
Xj < v
DR
30
A B D F G H C E I
Mapper:
Initialized by loading results of Initialization task
Current model (to find which node each datapoint xi ends up) Attribute metadata (all split points for each attribute)
For each data record run the Map algorithm For each node store statistics of the data entering the node and at the end emit (to all reducers):
<NodeId, { y, y2, 1 } >
A B D F G H C E I
Requires: Split node set S, Model file M For every training record (xi,yi) do:
Node n = TraverseTree(M, xi) if n S: update Tn yi //store {y, y2, 1} for each node for j = 1 d: // d number of features v = xi(j) // value of feature Xj of example xi for each split point s of feature Xj, such that s < v: update Tn,j[s] yi //store {y, ,y2, 1} for (node n, Xj, split s)
MapFinalize: Emit
<NodeId, { y, y2, 1 } > // sufficient statistics (so we can later <SplitId, { y, y2, 1} > // compute variance reduction)
2/20/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 32
A B D F G H C E I
Reducer: (1) Load all the <NodeId, List {y, y2, 1}> pairs and aggregate the per node statistics (2) For all the <SplitId, List {y, y2, 1}> aggregate and run the reduce algorithm Reduce(SplitId, values): For each NodeId, split = NewSplit(SplitId) best = BestSplitSoFar(split.nodeid) output the best for stats in values split found: split.stats.AddStats(stats)
left = GetImpurity(split.stats) right = GetImpurity(split.node.statssplit.stats) split.impurity = left + right if split.impurity < best.impurity: UpdateBestNodeSplit(split.nodeid, split)
2/20/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 33
A B D F G H C E I
A
Xj < v
DR
2/20/2013
34
Master Node MapReduce Initialization (run once first) MapReduce FindBestSplit (run multiple times) MapReduce InMemoryBuild (run once last)
2/21/2013
35
Task: Grow an entire subtree once the data fits in memory Mapper:
Initialize by loading current model file For each record identify the node it falls under and if that node is to be grown, output <NodeId, (xi, yi)>
A B D F G H C E I
Reducer:
Initialize by loading attribute file from Initialization task For each <NodeId, List{(xi, yi)}> run the basic tree growing algorithm on the data records Output the best splits for each node in the subtree
2/20/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 36
A B C
D1 D D2
F G
D3 E D4
H I
For each node: Tn: (node){y, y2, 1} For each (split,node): Tn,j,s: (node, attribute, split_value){y, y2, 1}
Compute impurity for each node using T1, T2 Return best split to Master (which then decides on globally best split)
2/20/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 37
We need one pass over the data to construct one level of the tree! MapReduce set up and tear down cost
Reduce tear-down cost by polling for output instead of waiting for a task to return Reduce start-up cost through forward scheduling
Maintain a set of live MapReduce jobs and assign them tasks instead of starting new jobs from scratch
2/20/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 38
2/20/2013
39
Bagging:
Learns multiple trees over independent samples of the training data Predictions from each tree are averaged to compute the nal model prediction
2/20/2013
40
2/20/2013
41
SVM
Classification
Usually only 2 classes
Decision trees
Classification & Regression
Multiple (~10) classes
Real valued features (no categorical ones) Tens/hundreds of thousands of features Very sparse features Simple decision boundary
No issues with overfitting
Real valued and categorical features Few (hundreds) of features Usually dense features Complicated decision boundaries
Overfitting! Early stopping
Example applications
Text classification Spam detection Computer vision
Example applications
User profile classification Landing page bounce prediction
42
2/20/2013
B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. PLANET: Massively parallel learning of tree ensembles with MapReduce. VLDB 2009. J. Ye, J.-H. Chow, J. Chen, Z. Zheng. Stochastic Gradient Boosted Distributed Decision Trees. CIKM 2009.
2/20/2013
43