Beruflich Dokumente
Kultur Dokumente
Tree boosting is a highly effective and widely used machine state-of-the-art results on many standard classification bench-
learning method. In this paper, we describe a scalable end- marks [14]. LambdaMART [4], a variant of tree boosting for
to-end tree boosting system called XGBoost, which is used ranking, achieves state-of-the-art result for ranking prob-
widely by data scientists to achieve state-of-the-art results lems. Besides being used as a stand-alone predictor, it is
on many machine learning challenges. We propose a novel also incorporated into real-world production pipelines for ad
sparsity-aware algorithm for sparse data and weighted quan- click through rate prediction [13]. Finally, it is the de-facto
tile sketch for approximate tree learning. More importantly, choice of ensemble method and is used in challenges such as
we provide insights on cache access patterns, data compres- the Netflix prize [2].
sion and sharding to build a scalable tree boosting system. In this paper, we describe XGBoost, a scalable machine
By combining these insights, XGBoost scales beyond billions learning system for tree boosting. The system is available as
of examples using far fewer resources than existing systems. an open source package2 . The impact of the system has been
widely recognized in a number of machine learning and data
mining challenges. Take the challenges hosted by the ma-
CCS Concepts chine learning competition site Kaggle for example. Among
•Methodologies → Machine learning; •Information the 29 challenge winning solutions 3 published at Kaggle’s
systems → Data mining; blog during 2015, 17 solutions used XGBoost. Among these
solutions, eight solely used XGBoost to train the model,
Keywords while most others combined XGBoost with neural nets in en-
sembles. For comparison, the second most popular method,
Large-scale Machine Learning
deep neural nets, was used in 11 solutions. The success
of the system was also witnessed in KDDCup 2015, where
1. INTRODUCTION XGBoost was used by every winning team in the top-10.
Machine learning and data-driven approaches are becom- Moreover, the winning teams reported that ensemble meth-
ing very important in many areas. Smart spam classifiers ods outperform a well-configured XGBoost by only a small
protect our email by learning from massive amounts of spam amount [1].
data and user feedback; advertising systems learn to match These results demonstrate that our system gives state-of-
the right ads with the right context; fraud detection systems the-art results on a wide range of problems. Examples of
protect banks from malicious attackers; anomaly event de- the problems in these winning solutions include: store sales
tection systems help experimental physicists to find events prediction; high energy physics event classification; web text
that lead to new physics. There are two important factors classification; customer behavior prediction; motion detec-
that drive these successful applications: usage of effective tion; ad click through rate prediction; malware classification;
(statistical) models that capture the complex data depen- product categorization; hazard risk prediction; massive on-
dencies and scalable learning systems that learn the model line course dropout rate prediction. While domain depen-
of interest from large datasets. dent data analysis and feature engineering play an important
Among the machine learning methods used in practice, role in these solutions, the fact that XGBoost is the consen-
gradient tree boosting [9]1 is one technique that shines in sus choice of learner shows the impact and importance of
1 our system and tree boosting.
Gradient tree boosting is also known as gradient boosting
machine (GBM) or gradient boosted regression tree (GBRT) The most important factor behind the success of XGBoost
is its scalability in all scenarios. The system runs more than
ten times faster than existing popular solutions on a single
Permission to make digital or hard copies of part or all of this work for personal or machine and scales to billions of examples in distributed or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
memory-limited settings. The scalability of XGBoost is due
on the first page. Copyrights for third-party components of this work must be honored. to several important systems and algorithmic optimizations.
For all other uses, contact the owner/author(s). These innovations include: a novel tree learning algorithm
2
c 2016 Copyright held by the owner/author(s). https://github.com/dmlc/xgboost
3
ACM ISBN . These solutions come from of top-3 teams of each compe-
DOI: titions.
predict the output.
K
X
ŷi = φ(xi ) = fk (xi ), fk ∈ F , (1)
k=1
0.79
0.78 exact greedy
0.77 global eps=0.3 Figure 4: Tree structure with default directions. An
local eps=0.3 example will be classified into the default direction
0.76 global eps=0.05 when the feature needed for the split is missing.
0.750 10 20 30 40 50 60 70 80 90
Number of Iterations functions rk : R → [0, +∞) as
1 X
rk (z) = P h, (8)
Figure 3: Comparison of test AUC convergence on (x,h)∈Dk h (x,h)∈Dk ,x<z
Higgs 10M dataset. The eps parameter corresponds
to the accuracy of the approximate sketch. This which represents the proportion of instances whose feature
roughly translates to 1 / eps buckets in the proposal. value k is smaller than z. The goal is to find candidate split
We find that local proposals require fewer buckets, points {sk1 , sk2 , · · · skl }, such that
because it refine split candidates.
|rk (sk,j ) − rk (sk,j+1 )| < , sk1 = min xik , skl = max xik .
i i
(9)
Here is an approximation factor. Intuitively, this means
in these two settings, an approximate algorithm needs to that there is roughly 1/ candidate points. Here each data
be used. An approximate framework for tree boosting is point is weighted by hi . To see why hi represents the weight,
given in Alg. 2. To summarize, the algorithm first proposes we can rewrite Eq (3) as
candidate splitting points according to percentiles of fea- n
ture distribution (specific criteria will be given in Sec. 3.3). X 1
hi (ft (xi ) − gi /hi )2 + Ω(ft ) + constant,
The algorithm then maps the continuous features into buck- i=1
2
ets split by these candidate points, aggregates the statistics
and finds the best solution among proposals based on the which is exactly weighted squared loss with labels gi /hi
aggregated statistics. and weights hi . For large datasets, it is non-trivial to find
There are two variants of the algorithm, depending on candidate splits that satisfy the criteria. When every in-
when the proposal is given. The global variant proposes all stance has equal weights, an existing algorithm called quan-
the candidate splits during the initial phase of tree construc- tile sketch [12, 21] solves the problem. However, there is no
tion, and uses the same proposals for split finding at all lev- existing quantile sketch for the weighted datasets. There-
els. The local variant re-proposes after each split. The global fore, most existing approximate tree boosting algorithms ei-
method requires less proposal steps than the local method. ther resorted to sorting on a random subset of data which
However, usually more candidate points are needed for the have a chance of failure or heuristics that do not have theo-
global proposal because candidates are not refined after each retical guarantee.
split. The local proposal refines the candidates after splits, To solve this problem, we introduced a novel distributed
and can potentially be more appropriate for deeper trees. A weighted quantile sketch algorithm that can handle weighted
comparison of different algorithms on a Higgs boson dataset data with a provable theoretical guarantee. The general idea
is given by Fig. 3. We find that the local proposal indeed is to propose a data structure that supports merge and prune
requires fewer candidates. The global proposal can be as operations, with each operation proven to maintain a certain
accurate as the local one given enough candidates. accuracy level. A detailed description of the algorithm as
Most existing approximate algorithms for distributed tree well as proofs are given in the appendix.
learning also follow this framework. Notably, it is also possi-
ble to directly construct approximate histograms of gradient 3.4 Sparsity-aware Split Finding
statistics [19]. Our system efficiently supports exact greedy In many real-world problems, it is quite common for the
for the single machine setting, as well as approximate al- input x to be sparse. There are multiple possible causes
gorithm with both local and global proposal methods for for sparsity: 1) presence of missing values in the data; 2)
all settings. Users can freely choose between the methods frequent zero entries in the statistics; and, 3) artifacts of
according to their needs. feature engineering such as one-hot encoding. It is impor-
tant to make the algorithm aware of the sparsity pattern in
the data. In order to do so, we propose to add a default
3.3 Weighted Quantile Sketch direction in each tree node, which is shown in Fig. 4. When
One important step in the approximate algorithm is to a value is missing in the sparse matrix x, the instance is
propose candidate split points. Usually percentiles of a fea- classified into the default direction.
ture are used to make candidates distribute evenly on the There are two choices of default direction in each branch.
data. Formally, let multi-set Dk = {(x1k , h1 ), (x2k , h2 ) · · · (xnk , hn )}
The optimal default directions are learnt from the data. The
represent the k-th feature values and second order gradient algorithm is shown in Alg. 3. The key improvement is to only
statistics of each training instances. We can define a rank visit the non-missing entries Ik . The presented algorithm
Algorithm 3: Sparsity-aware Split Finding 32
16 Basic algorithm
Input: I, instance set of current node
Input: Ik = {i ∈ I|xik 6= missing} 8
4
128 256 8 8
Basic algorithm Basic algorithm Basic algorithm Basic algorithm
Cache-aware algorithm 128 Cache-aware algorithm 4 Cache-aware algorithm 4 Cache-aware algorithm
64
Time per Tree(sec)
8 1 2 4 8 16 8 1 2 4 8 16 0.25 1 2 4 8 16 0.25 1 2 4 8 16
Number of Threads Number of Threads Number of Threads Number of Threads
(a) Allstate 10M (b) Higgs 10M (c) Allstate 1M (d) Higgs 1M
Figure 7: Impact of cache-aware prefetching in exact greedy algorithm. We find that the cache-miss effect
impacts the performance on the large datasets (10 million instances). Using cache aware prefetching improves
the performance by factor of two when the dataset is large.
tem that can handle large scale problems with very limited
64
computing resources. We also summarize the comparison
32 between our system and existing ones in Table 1.
Quantile summary (without weights) is a classical prob-
16 lem in the database community [12, 21]. However, the ap-
proximate tree boosting algorithm reveals a more general
8 problem – finding quantiles on weighted data. To the best
4 1 of our knowledge, the weighted quantile sketch proposed in
2 4 8 16
Number of Threads this paper is the first method to solve this problem. The
weighted quantile summary is also not specific to the tree
(b) Higgs 10M learning and can benefit other applications in data science
and machine learning in the future.
Figure 9: The impact of block size in the approxi-
mate algorithm. We find that overly small blocks re-
sults in inefficient parallelization, while overly large 6. END TO END EVALUATIONS
blocks also slows down training due to cache misses. In this section, we present end-to-end evaluations of our
system in various settings. Dataset and system configura-
Block Sharding The second technique is to shard the data tions for all experiments are summarized in Sec. 6.2.
onto multiple disks in an alternative manner. A pre-fetcher
thread is assigned to each disk and fetches the data into an 6.1 System Implementation
in-memory buffer. The training thread then alternatively We implemented XGBoost as an open source package4 .
reads the data from each buffer. This helps to increase the The package is portable and reusable. It is available in pop-
throughput of disk reading when multiple disks are available. ular languages such as python, R, Julia and integrates nat-
urally with language native data science pipelines such as
scikit-learn. The distributed version is built on top of the
5. RELATED WORKS rabit library5 for allreduce. The portability of XGBoost
Our system implements gradient boosting [9], which per- makes it available in many ecosystems, instead of only be-
forms additive optimization in functional space. Gradient ing tied to a specific platform. The distributed XGBoost
tree boosting has been successfully used in classification [11],
4
learning to rank [4], structured prediction [7] as well as other https://github.com/dmlc/xgboost
5
fields. XGBoost incorporates a regularized model to prevent https://github.com/dmlc/rabit
Table 2: Dataset used in the Experiments Table 3: Comparison of Exact Greedy Methods with
Dataset n m Task 500 trees on Higgs-1M data
Allstate 10 M 4227 Insurance claim classification Method Time per Tree (sec) Test AUC
Higgs Boson 10 M 28 Event classification XGBoost 0.6841 0.8304
Yahoo LTRC 473K 700 Learning to Rank XGBoost (colsample=0.5) 0.6401 0.8245
Criteo 1.7 B 67 Click through rate prediction scikit-learn 28.51 0.8302
R.gbm 1.032 0.6224
Compression+shard 1024
512 512
Out of system file cache XGBoost
start from this point 256
256 128
128 256 512 1024 2048
Number of Training Examples (million)
128
128 256 512 1024 2048 (a) End-to-end time cost include data loading
Number of Training Examples (million)
4096
Figure 11: Comparison of out-of-core methods on
different subsets of criteo data. The missing data 2048
points are due to out of disk space. We can find 1024
6.5 Out-of-core Experiment We set up a YARN cluster on EC2 with m3.2xlarge ma-
chines, which is a very common choice for clusters. Each
We also evaluate our system in the out-of-core setting on machine contains 8 virtual cores, 30GB of RAM and two
the criteo data. We conducted the experiment on one AWS 80GB SSD local disks. The dataset is stored on AWS S3
c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB instead of HDFS to avoid purchasing persistent storage.
RAM). The results are shown in Figure 11. We can find We first compare our system against two production-level
that compression helps to speed up computation by factor of distributed systems: Spark MLLib [15] and H2O 10 . We use
three, and sharding into two disks further gives 2x speedup. 32 m3.2xlarge machines and test the performance of the sys-
For this type of experiment, it is important to use a very tems with various input size. Both of the baseline systems
large dataset to drain the system file cache for a real out- are in-memory analytics frameworks that need to store the
of-core setting. This is indeed our setup. We can observe a data in RAM, while XGBoost can switch to out-of-core set-
transition point when the system runs out of file cache. Note ting when it runs out of memory. The results are shown
that the transition in the final method is less dramatic. This in Fig. 12. We can find that XGBoost runs faster than the
is due to larger disk throughput and better utilization of baseline systems. More importantly, it is able to take ad-
computation resources. Our final method is able to process vantage of out-of-core computing and smoothly scale to all
1.7 billion examples on a single machine. 1.7 billion examples with the given limited computing re-
sources. The baseline systems are only able to handle sub-
6.6 Distributed Experiment
10
Finally, we evaluate the system in the distributed setting. www.h2o.ai
2048 [3] L. Breiman. Random forests. Maching Learning,
45(1):5–32, Oct. 2001.
[4] C. Burges. From ranknet to lambdarank to lambdamart:
1024 An overview. Learning, 11:23–581, 2010.
Time per Iteration (sec) [5] O. Chapelle and Y. Chang. Yahoo! Learning to Rank
Challenge Overview. Journal of Machine Learning
Research - W & CP, 14:1–24, 2011.
512 [6] T. Chen, H. Li, Q. Yang, and Y. Yu. General functional
matrix factorization using gradient boosting. In Proceeding
of 30th International Conference on Machine Learning
256 (ICML’13), volume 1, pages 436–444, 2013.
[7] T. Chen, S. Singh, B. Taskar, and C. Guestrin. Efficient
second-order gradient boosting for conditional random
fields. In Proceeding of 18th Artificial Intelligence and
128 4 8 16 32 Statistics Conference (AISTATS’15), volume 1, 2015.
Number of Machines
[8] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and
C.-J. Lin. LIBLINEAR: A library for large linear
Figure 13: Scaling of XGBoost with different num- classification. Journal of Machine Learning Research,
ber of machines on criteo full 1.7 billion dataset. 9:1871–1874, 2008.
Using more machines results in more file cache and [9] J. Friedman. Greedy function approximation: a gradient
makes the system run faster, causing the trend boosting machine. Annals of Statistics, pages 1189–1232,
to be slightly super linear. XGBoost can process 2001.
the entire dataset using as little as four machines, [10] J. Friedman. Stochastic gradient boosting. Computational
Statistics & Data Analysis, 38(4):367–378, 2002.
and scales smoothly by utilizing more available re-
[11] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic
sources. regression: a statistical view of boosting. The annals of
statistics, 28(2):337–407, 2000.
set of the data with the given resources. This experiment [12] M. Greenwald and S. Khanna. Space-efficient online
shows the advantage to bring all the system improvement computation of quantile summaries. In Proceedings of the
together and solve a real-world scale problem. We also eval- 2001 ACM SIGMOD International Conference on
uate the scaling property of XGBoost by varying the number Management of Data, pages 58–66, 2001.
[13] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi,
of machines. The results are shown in Fig. 13. We can find A. Atallah, R. Herbrich, S. Bowers, and J. Q. n. Candela.
XGBoost’s performance scales linearly as we add more ma- Practical lessons from predicting clicks on ads at facebook.
chines. Importantly, XGBoost is able to handle the entire In Proceedings of the Eighth International Workshop on
1.7 billion data with only four machines. This shows the Data Mining for Online Advertising, ADKDD’14, 2014.
system’s potential to handle even larger data. [14] P. Li. Robust Logitboost and adaptive base class (ABC)
Logitboost. In Proceedings of the Twenty-Sixth Conference
Annual Conference on Uncertainty in Artificial Intelligence
7. CONCLUSION (UAI’10), pages 302–311, 2010.
In this paper, we described the lessons we learnt when [15] X. Meng, J. Bradley, B. Yavuz, E. Sparks,
S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai,
building XGBoost, a scalable tree boosting system that is
M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin,
widely used by data scientists and provides state-of-the-art R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine
results on many problems. We proposed a novel sparsity Learning in Apache Spark, May 2015.
aware algorithm for handling sparse data and a theoretically [16] B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo.
justified weighted quantile sketch for approximate learning. Planet: Massively parallel learning of tree ensembles with
Our experience shows that cache access patterns, data com- mapreduce. Proceeding of VLDB Endowment,
pression and sharding are essential elements for building a 2(2):1426–1437, Aug. 2009.
scalable end-to-end system for tree boosting. These lessons [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
can be applied to other machine learning systems as well. R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
By combining these insights, XGBoost is able to solve real- D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.
world scale problems using a minimal amount of resources. Scikit-learn: Machine learning in Python. Journal of
Machine Learning Research, 12:2825–2830, 2011.
[18] G. Ridgeway. Generalized Boosted Models: A guide to the
Acknowledgments gbm package.
We would like to thank Tyler B. Johnson, Marco Tulio Ribeiro, [19] S. Tyree, K. Weinberger, K. Agrawal, and J. Paykin.
Sameer Singh for their valuable feedback. We also sincerely thank Parallel boosted regression trees for web search ranking. In
Tong He, Bing Xu, Michael Benesty, Yuan Tang, Hongliang Liu, Proceedings of the 20th international conference on World
Qiang Kou, Nan Zhu and all other contributors in the XGBoost wide web, pages 387–396. ACM, 2011.
community. This work was supported in part by ONR (PECASE) [20] J. Ye, J.-H. Chow, J. Chen, and Z. Zheng. Stochastic
N000141010672, NSF IIS 1258741 and the TerraSwarm Research gradient boosted distributed decision trees. In Proceedings
Center sponsored by MARCO and DARPA. of the 18th ACM Conference on Information and
Knowledge Management, CIKM ’09.
[21] Q. Zhang and W. Wang. A fast algorithm for approximate
8. REFERENCES quantiles in high speed data streams. In Proceedings of the
[1] R. Bekkerman. The present and the future of the kdd cup 19th International Conference on Scientific and Statistical
competition: an outsider’s perspective. Database Management, 2007.
[2] J. Bennett and S. Lanning. The netflix prize. In [22] T. Zhang and R. Johnson. Learning nonlinear functions
Proceedings of the KDD Cup Workshop 2007, pages 3–6, using regularized greedy forest. IEEE Transactions on
New York, Aug. 2007.
+ −
Pattern Analysis and Machine Intelligence, 36(5), 2014. 2) r̃D , r̃D and ω̃D are functions in S → [0, +∞), that satisfies
− − + +
r̃D (xi ) ≤ rD (xi ), r̃D (xi ) ≥ rD (xi ), ω̃D (xi ) ≤ ωD (xi ), (14)
APPENDIX
−
the equality sign holds for maximum and minimum point ( = r̃D (xi )
A. WEIGHTED QUANTILE SKETCH −
rD +
(xi ), r̃D +
(xi ) = rD (xi ) and ω̃D (xi ) = ωD (xi ) for i ∈ {1, k}).
In this section, we introduce the weighted quantile sketch algo-
Finally, the function value must also satisfy the following con-
rithm. Approximate answer of quantile queries is for many real-
straints
world applications. One classical approach to this problem is GK
algorithm [12] and extensions based on the GK framework [21]. − − + +
r̃D (xi ) + ω̃D (xi ) ≤ r̃D (xi+1 ), r̃D (xi ) ≤ r̃D (xi+1 ) − ω̃D (xi+1 )
The main component of these algorithms is a data structure called (15)
quantile summary, that is able to answer quantile queries with
relative accuracy of . Two operations are defined for a quantile Since these functions are only defined on S, it is suffice to use 4k
summary: record to store the summary. Specifically, we need to remember
• A merge operation that combines two summaries with ap- each xi and the corresponding function values of each xi .
proximation error 1 and 2 together and create a merged
summary with approximation error max(1 , 2 ). Definition A.2. Extension of Function Domains
+ −
Given a quantile summary Q(D) = (S, r̃D , r̃D , ω̃D ) defined in
• A prune operation that reduces the number of elements in + −
Definition A.1, the domain of r̃D , r̃D and ω̃D were defined only
the summary to b + 1 and changes approximation error from
in S. We extend the definition of these functions to X → [0, +∞)
to + 1b .
as follows
A quantile summary with merge and prune operations forms basic When y < x1 :
building blocks of the distributed and streaming quantile comput-
− +
ing algorithms [21]. r̃D (y) = 0, r̃D (y) = 0, ω̃D (y) = 0 (16)
In order to use quantile computation for approximate tree boost-
ing, we need to find quantiles on weighted data. This more gen- When y > xk :
eral problem is not supported by any of the existing algorithm. In −
r̃D +
(y) = r̃D +
(xk ), r̃D +
(y) = r̃D (xk ), ω̃D (y) = 0 (17)
this section, we describe a non-trivial weighted quantile summary
structure to solve this problem. Importantly, the new algorithm When y ∈ (xi , xi+1 ) for some i:
contains merge and prune operations with the same guarantee as
− −
GK summary. This allows our summary to be plugged into all r̃D (y) = r̃D (xi ) + ω̃D (xi ),
the frameworks used GK summary as building block and answer + +
r̃D (y) = r̃D (xi+1 ) − ω̃D (xi+1 ), (18)
quantile queries over weighted data efficiently.
ω̃D (y) = 0
A.1 Formalization and Definitions
Given an input multi-set D = {(x1 , w1 ), (x2 , w2 ) · · · (xn , wn )} Lemma A.1. Extended Constraint
− +
such that wi ∈ [0, +∞), xi ∈ X . Each xi corresponds to a po- The extended definition of r̃D , r̃D , ω̃D satisfies the following
sition of the point and wi is the weight of the point. Assume constraints
we have a total order < defined on X . Let us define two rank − − + +
− + r̃D (y) ≤ rD (y), r̃D (y) ≥ rD (y), ω̃D (y) ≤ ωD (y) (19)
functions rD , rD : X → [0, +∞)
− − + +
−
X r̃D (y) + ω̃D (y) ≤ r̃D (x), r̃D (y) ≤ r̃D (x) − ω̃D (x), for all y < x
rD (y) = w (10) (20)
(x,w)∈D,x<y
Proof. The only non-trivial part is to prove the case when
y ∈ (xi , xi+1 ):
X
+
rD (y) = w (11)
(x,w)∈D,x≤y − − − −
r̃D (y) = r̃D (xi ) + ω̃D (xi ) ≤ rD (xi ) + ωD (xi ) ≤ rD (y)
We should note that since D is defined to be a multiset of the
points. It can contain multiple record with exactly same position + + + +
r̃D (y) = r̃D (xi+1 ) − ω̃D (xi+1 ) ≥ rD (xi+1 ) − ωD (xi+1 ) ≥ rD (y)
x and weight w. We also define another weight function ωD :
This proves Eq. (19). Furthermore, we can verify that
X → [0, +∞) as
+ + +
+ −
X r̃D (xi ) ≤ r̃D (xi+1 ) − ω̃D (xi+1 ) = r̃D (y) − ω̃D (y)
ωD (y) = rD (y) − rD (y) = w. (12)
(x,w)∈D,x=y − − −
r̃D (y) + ω̃D (y) = r̃D (xi ) + ω̃D (xi ) + 0 ≤ r̃D (xi+1 )
+ +
Finally, we also define the weight of multi-set D to be the sum of r̃D (y) = r̃D (xi+1 ) − ω̃D (xi+1 )
weights of all the points in the set Using these facts and transitivity of < relation, we can prove
X Eq. (20)
ω(D) = w (13)
(x,w)∈D We should note that the extension is based on the ground case
defined in S, and we do not require extra space to store the sum-
Our task is given a series of input D, to estimate r+ (y)
and r− (y) mary in order to use the extended definition. We are now ready
for y ∈ X as well as finding points with specific rank. Given these to introduce the definition of -approximate quantile summary.
notations, we define quantile summary of weighted examples as
follows: Definition A.3. -Approximate Quantile Summary
+ −
Given a quantile summary Q(D) = (S, r̃D , r̃D , ω̃D ), we call it is
Definition A.1. Quantile Summary of Weighted Data -approximate summary if for any y ∈ X
+ −
A quantile summary for D is defined to be tuple Q(D) = (S, r̃D , r̃D , ω̃D ), + −
where S = {x1 , x2 , · · · , xk } is selected from the points in D (i.e. r̃D (y) − r̃D (y) − ω̃D (y) ≤ ω(D) (21)
xi ∈ {x|(x, w) ∈ D}) with the following properties: − +
−
We use this definition since we know that r (y) ∈ [r̃D (y), r̃D (y)−
1) xi < xi+1 for all i, and x1 and xk are minimum and max-
− +
imum point in D: ω̃D (y)] and r+ (y) ∈ [r̃D (y) + ω̃D (y), r̃D (y)]. Eq. (21) means the
we can get estimation of r (y) and r− (y) by error of at most
+
x1 = min x, xk = max x
(x,w)∈D (x,w)∈D ω(D).
+ −
Lemma A.2. Quantile summary Q(D) = (S, r̃D , r̃D , ω̃D ) is an Algorithm 4: Query Function g(Q, d)
-approximate summary if and only if the following two condition
holds Input: d: 0 ≤ d ≤ ω(D)
+ − + −
r̃D (xi ) − r̃D (xi ) − ω̃D (xi ) ≤ ω(D) (22) Input: Q(D) = (S, r̃D , r̃D , ω̃D ) where
+
r̃D −
(xi+1 ) − r̃D (xi ) − ω̃D (xi+1 ) − ω̃D (xi ) ≤ ω(D) (23) S = x1 , x2 , · · · , xk
− +
if d < 12 [r̃D (x1 ) + r̃D (x1 )] then return x1 if
1 − +
d ≥ 2 [r̃D (xk ) + r̃D (xk )] then return xk Find i such
Proof. The key is again consider y ∈ (xi , xi+1 ) −
that 12 [r̃D +
(xi ) + r̃D (xi )] ≤ d < 12 [r̃D − +
(xi+1 ) + r̃D (xi+1 )]
+ − + + − +
r̃D (y)−r̃D (y)−ω̃D (y) = [r̃D (xi+1 )−ω̃D (xi+1 )]−[r̃D (xi )+ω̃D (xi )]−0 if 2d < r̃D (xi ) + ω̃D (xi ) + r̃D (xi+1 ) − ω̃D (xi+1 ) then
return xi
This means the condition in Eq. (23) plus Eq.(22) can give us else
Eq. (21) return xi+1
Property of Extended Function In this section, we have in- end
+ −
troduced the extension of function r̃D , r̃D , ω̃D to X → [0, +∞).
The key theme discussed in this section is the relation of con-
straints on the original function and constraints on the extended
function. Lemma A.1 and A.2 show that the constraints on the This can be obtained by straight-forward application of Defini-
original function can lead to in more general constraints on the tion A.2.
extended function. This is a very useful property which will be
used in the proofs in later sections. Theorem A.1. If Q(D1 ) is 1 -approximate summary, and Q(D2 )
is 2 -approximate summary. Then the merged summary Q(D) is
A.2 Construction of Initial Summary max(1 , 2 )-approximate summary.
Given a small multi-set D = {(x1 , w1 ), (x2 , w2 ), · · · , (xn , wn )}, Proof. For any y ∈ X , we have
+ −
we can construct initial summary Q(D) = {S, r̃D , r̃D , ω̃D }, with S
+ − + −
to the set of all values in D (S = {x|(x, w) ∈ D}), and r̃D , r̃D , ω̃D r̃D (y) − r̃D (y) − ω̃D (y)
defined to be + + − −
=[r̃D 1
(y) + r̃D 2
(y)] − [r̃D 1
(y) + r̃D 2
(y)] − [ω̃D1 (y) + ω̃D2 (y)]
+ + − −
r̃D (x) = rD (x), r̃D (x) ω̃D (x) = ωD (x) for x ∈ S
= rD (x), ≤1 ω(D1 ) + 2 ω(D2 ) ≤ max(1 , 2 )ω(D1 ∪ D2 )
(24)
The constructed summary is 0-approximate summary, since it can Here the first inequality is due to Lemma A.3.
answer all the queries accurately. The constructed summary can
be feed into future operations described in the latter sections.
A.4 Prune Operation
A.3 Merge Operation Before we start discussing the prune operation, we first in-
In this section, we define how we can merge the two summaries troduce a query function g(Q, d). The definition of function is
together. Assume we have Q(D1 ) = (S1 , r̃D + −
, r̃D , ω̃D1 ) and shown in Algorithm 4. For a given rank d, the function returns
+ −
1 1 a x whose rank is close to d. This property is formally described
Q(D2 ) = (S2 , r̃D 1
, r̃D2
, ω̃D2 ) quantile summary of two dataset in the following Lemma.
D1 and D2 . Let D = D1 ∪ D2 , and define the merged summary
+ −
Q(D) = (S, r̃D , r̃D , ω̃D ) as follows. Lemma A.4. For a given -approximate summary Q(D) =
+ −
(S, r̃D , r̃D , ω̃D ), x∗ = g(Q, d) satisfies the following property
S = {x1 , x2 · · · , xk }, xi ∈ S1 or xi ∈ S2 (25)
+ ∗
The points in S are combination of points in S1 and S2 . And the d ≥ r̃D (x ) − ω̃D (x∗ ) − ω(D)
+ − 2
function r̃D , r̃D , ω̃D are defined to be (33)
− ∗ ∗
− − − d ≤ r̃D (x ) + ω̃D (x ) + ω(D)
r̃D (xi ) = r̃D (xi ) + r̃D (xi ) (26) 2
1 2