Sie sind auf Seite 1von 6

Big Data: Characteristics, Methods and Technology

Gede Karya*, Benhard Sitohang, Saiful Akbar, Veronica S. Moertini*


School of Electrical Engineering and Informatics, Bandung Institute of Technology
*
Faculty of Information Technology and Science, Parahyangan Catholic University
Bandung, Indonesia
gkarya@unpar.ac.id, benhard@stei.itb.ac.id, saiful@informatika.org, moertini@unpar.ac.id

Abstract— At this time the big data phenomenon becomes


one of the most interesting trends. Big data is defined as an II. BIG DATA DEFINITION AND CHARACTERISTICS
information asset that has high volume, velocity and variety
(3v) characteristics requiring specific technology and analysis A. Definition
methods to transform it into value. The velocity aspect has a From the various opinions gathered in the study [2] big
major contribution to volume growth and processing data is defined as an information asset that has 3v
complexity. Therefore reducing velocity (in the context of data characteristics, including high volume, velocity and variety,
streams) contributes in reducing the volume and complexity of
requiring certain methods and technologies to process it into
the analysis in generating value. The challenge is how to reduce
velocity without reducing the meaning of big data significantly.
valuable knowledge in decision making.
This paper is our preliminary study for research in the field of The data-information-knowledege-wisdom hierarchy
velocity big data reduction. This paper discusses the study of (DIKW) [3] provides a plausible reason why the big data
literature on big data, starting from the definition, phenomenon is growing so rapidly (figure 1). With the huge
characteristics, methods of analysis, and technology that exist potential of data, then there is a huge potential of information
today. This paper also discusses the method of data stream also available to be transformed into knowledge to optimize
clustering analysis as one of the velocity reduction candidates.
decision making (wisdom).
The description ends with a study of current big data research
positions and future challenges, as well as research conducted
by the author of the past 3 years.

Keywords— big data, velocity reduction, data stream


clustering

I. INTRODUCTION
At this time the big data phenomenon becomes one of the
most interesting trends [1]. Big data has high volume,
velocity and variety (3v) characteristics. The velocity aspect
contributes greatly to the problem of increasing volume. Big Figure 1 The Wisdom Hierachy [3]
data processing is basically aimed to produce value in the So, what is expected of big data is the information assets that
form of knowledge or patterns (knowledge discovery) that are processed into knowledge that can generate value for the
can be used to support decision making. High data velocity organization.
processing requires real-time processing that requires
expensive methods and technologies. If we can reduce B. Characteristics
velocity to a certain level it will produce big data with
reduced volumes and batch processing, thus requiring Big data characteristic according to [4] is 3v, including
cheaper methods and technologies. The challenge is how to high volume, velocity and variety. In general, high size limits
reduce velocity without reducing the meaning of big data in the context of big data follow Moore's law [5]. However,
significantly. In addition, the data reduction results are the current characteristics of big data are depicted as in
expected to be closer to the final knowledge. Thus the Figure 2.
process of further analysis can be run with more
conventional methods and technologies, over a longer time
span. The velocity reduction of the data stream can be done
by sampling, filtering and custering techniques. However, for
clustering technique has its own challenge related to data
stream mining constraint, therefore discussed separately in
this paper.
This paper is a preliminary study in our research on the
reduction of big data streams. This paper discusses the
literature study of big data, starting from the definition and
its characteristics, the analytical methods used, as well as the
technology used today. This paper also discusses the method Figure 2 Big Data Characteristics
of data stream analysis especially data stream clustering. The Big data has high volume characteristics, from terabytes
discussion concludes with a study of the current big data to zettabytes. Consequently, it requires data storage and
research positions and the challenges ahead and the research processing capacity that can not be handled by conventional
that has been done by the author of the last 3 years. methods and technologies. Current applied methods and

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


techniques lead to parallel processing in a distributed system degree of attractiveness of a treshhold pattern is determined
environment, both in storage media and in processing. on the basis of the structure of the patterns discovery and the
Further discussion of big data storage and processing underlying statistics. For example, for the objective measure
technology is found in section 4. of association rules is support (percentage of data that meets
the rules of the pattern) and confidence (level of pattern
The velocity characteristics of the big data alter the data certainty). In the classification there is a measure of accuracy
processing angle in batches, into dynamic data processing,
(percentage of data classified correctly) and coverage
where data is no longer seen statically, but dynamically as a (similar to support on association rules). There is also a
stream. In addition, big data is also associated with the
subjective measure (user trust for data) such as: expected
movement of large amounts of data (high volume movement) (confirming the user's hypothesis), unexpected (contrary to
such as spatial data, images, and others. user's belief), actionable. The ability to produce all the
Big data is sourced from various events. All of our interesting patterns (completeness) and only the pattern of
activities that use computers, gadgets, sensors and other interest (efficient) is a quality criterion of the DM system.
equipment produce big data. In addition to a wide variety of To achieve its objectives, DM adopts techniques from
sources, structures are also diverse, ranging from the
various domains such as statistics, machine learning (ML),
structured, such as: transaction data (money market, e- pattern recognition, database, data warehouse, information
commerce, etc.), semi-structured, and unstructured, such as:
retrieval (IR), visualization, algorithm, high performance
image, text opinions on social media and web pages on the
computing and more.
internet. For that, it takes methods and technology to
integrate big data from various sources and from different
formats. B. Data and Knowledge
Data is a representation of facts. Data set is a collection
III. BIG DATA ANALYSIS METHOD of data objects. A data object represents an entity. The data
object is described by the attributes. An attribute (in the
The purpose of big data analysis is to extract knowledge database) is a data field, representing a characteristics or
in the form of patterns (concepts) that are useful in decision feature of a data object. Terms that have the same meaning
making (knowledge discovery). Big data analysis is basically
as attributes are dimension (in data warehouse), feature (on
a process of data mining (DM). This section discusses data
mining, data, knowledge, data stream mining and data stream ML), variable (on statistics). A set of attributes used to
clustering. describe an object is called an attribute vector or a feature
vector. The type of the attribute is determined from the set
A. Data Mining of possible values of the attribute, can be: nominal, binary,
ordinal or numeric.
Data mining is the process of extracting or mining Knowledge is an interesting pattern that has the value of
knowledge from large data sets. The knowledge is in the novelty or benefits (usefull) for the user. Knowledge is
form of interesting patterns hidden in the large data. There
represented in various ways, such as: table, predicate
are several terms that have the same meaning as DM, such
as: knowledge discovery from data (KDD), knowledge calculus, direct representation, production-rules, frames,
extraction, data / pattern analysis, archaeology data and scripts, semantic networks, semantic web [8] [9]. The use of
dredging data [6]. As a process of knowledge discovery, DM knowledge representation is associated with the ease of
processes include pre-processing data (cleaning, integration, storing and conducting inference and reasoning processes.
reduction, selection, transformation) [7], pattern discovery, C. Data Stream Mining
pattern evaluation and knowledge representation.
Data Stream (DS) is defined as a sequence of data objects
There are various types of data that can be mining, which or samples: DS = {x1, x2 ,. . . , xt ,. . .}, where xi is the ith
can be grouped into 4, such as: (1) data base, data warehouse data object that arrives. DS has intrinsic characteristics, such
and transaction data. This type is more seen as static data (in as possibly unlimited volumes, chronological order, and
rest); (2) data streams, continuously flowing data, so it is dynamic changes. For example, Google processes millions of
very dynamic to have time sensitivity (in flow); (3) graph or searches daily, each attached with a time stamp; and these
data network; and (4) unstructured data format, such as: searches change according to different warm topics at
spatial data, text, multimedia, web. different times.
DM functionality is used to find various patterns on DM DS mining has several constraints, including: (1) single
activity. There are 5 DM functionality, such as: (1) pass, data can only be accessed once and can not be
characterization and discrimination; (2) mining of frequent backtracked; (2) real time response, related to the responses
patterns, associations and correlations; (3) classification and in decision making, such as: stock market that requires speed
regression for prediction analysis; (4) clustering analysis; and in decision making; (3) bounded memory, due to its high
(5) outlier analysis (anomaly mining). DM activity generally speed and unlimited potential, computing the synopsis with
can be grouped into: descriptive (characterizing data apporoximate results is acceptable; and (4) drift concept, a
properties into target data sets) and predictive (induction of condition of pattern discovery that changes over time [10].
data processed for forecasting).
In Figure 3 can be seen the general model of DS mining.
The patterns generated by DM activity are expressed as The DS to be processed is stored first in the buffer, then
knowledge if interesting (interesting patterns). Patterns can processed by time window using the DM algorithm. The
be declared interesting if: (1) easily understood by humans; most popular DM algorithms used in DS mining are
(2) valid on new data / test data at certain level of certainty; clustering and classification.
(3) has potential benefits; (4) novel (novelty value). The
high, the window extends, and when the accuracy is low, the
window shrinks.
In the fading window variant, each data object is assigned
different weight according to its arrival time so that the new
transaction gets higher weight than the old one. Using the
fading window can reduce the effects of old and outdated
transactions on mining products. The decreased exponential
function f (Δt) = λ Δt (0 <λ <1) is usually used in the fading
window. In this function, Δt is the age of a data object equal
to the time difference, between the current time and the time
of arrival. Fading window needs to select the appropriate
fading parameter λ, which is usually set to [0.99, 1] in real
Figure 3 Model of DS Mining [10]
application.
The DS algorithm manages the synopsis of the DS using
The tilted time window variation is between the fading
a time window with a particular computational approach:
window and the slading window. It implements different
incremental learning or two-phase learning. In the
levels of detail associated with the arrival of data. More
incremental leraning approach (Figure 4.a), the model
interested in recent data on a good scale than long-term data
gradually evolved to conform to changes in incoming data.
from the past on a rough scale. The tilted time window
There are two schemes to update the model, ie with data
roughly stores the entire data set and provides a good trade-
instances and with windows. While the two-phase learning
off between storage requirements and accuracy. However,
approach (also known as online-offline learning), the mining
the model may become unstable after walking for a long
process into two stages, such as: the online phase, synopsis
time. For example, tree structure in FP-Stream will become
of data updated in real-time, and in the offline phase, the
very large as time goes by, and the process of updating and
mining process is done on the synopsis stored every time the
scanning on the tree can degrade its performance.
user submitted the request. Illustration of two-phase learning
can be seen in Figure 4.b.
D. Data Stream Clustering
Clustering or data segmentation is the process of
grouping objects into different sets called clusters. The
purpose of clustring is that the data in the same cluster has
similarities, and is different from the other clusters. Several
clustering techniques have been proposed in [17].
The methods applied to DS clustering can be grouped
into 5, such as: partitioning, hierarchical, density-based,
grid-based and model-based. Furthermore, it is necessary to
measure the distance between the clusters that allows the
merge during the clustering process. In general there are 4
Figure 4 DS Clustering Computing Approach [10] types of distance measures, such as minimum distance
The time window of the data object is defined as W [i, j] (single-linkage), maximum distance (complete-linkage),
= (xi, xi + 1, ..., xj), where i <j. There are different types of mean distance and average distance. Maximum and average
time windows, such as: landmark window, sliding window, distance are rarely used because it requires expensive
fading window, and tilted time window. computing. The more popular used is the minimum and
mean distance. Minimum distance is used on D-Stream [18],
In the landmark window, we are interested in the overall
Den-Stream [19], and MR-Stream [20] algorithms. While
flow of data from the instant start time of 1 to the instant tc,
windows are W [1, tc] [[11], [12], [13], [14], [15] [16]]. By the mean distance is used on STREAM [15], CluStream
using a landmark window, all transactions in a window are [21].
just as important, there is no difference between past and Figure 10 describes the data stream clustering algorithms
present data. However, as the DS continues to grow, models grouped on the basis of conventional algorithms developed,
built with old data objects may not match new ones. To time windows and computational approaches used.
emphasize the latest data, it may be necessary to apply The partitioning algorithm classifies the data set into k
variations of sliding window, tilted time window or fading clusters, where k is the predefined parameter. Iteration is
window. done to reassign objects from one group to another to
In the sliding window variant W [tc - w + 1, tc], we are minimize its objective function. In traditional clustering, the
only interested in the last transaction, the other is eliminated. most popular method used is k-means and k-medians. The
Mining results depend on the window w size. If w is too clustering algorithm for DS is STREAM which applies the
large and there is a drift concept, the window may contain LSEARCH technique for applying k-medians, which uses a
outdated information, and the model's accuracy decreases. If landmark window with an incremental learning approach.
w is small, the window may have less data and the model is The hierarchical method aims to group the data objects
no longer suitable for large variants. Some experiments take into the hierarchical tree of the cluster. Hierarchical grouping
into account fixed values for user-defined sliding window methods can be further grouped as agglomerative or divisive,
sizes or experiment values. Recently, there is a proposal for in which hierarchical decomposition is formed in bottom-up /
flexible sliding windows where window sizes change merging or top-down/split mode, respectively. Some
according to the accuracy of the model. When the accuracy is
traditional hierarchical algorithms are BIRCH [22], CURE informasi ringkasan objek data dalam subruang. Kemudian,
[23], ROCK, and CHAMELEON [24]. The clustering kelompok ditentukan oleh daerah padat sel padat terdekat.
algorithm in DS is: (1) CluStream [21], which develops Ada beberapa algoritma clustering berbasis grid tradisional
BIRCH with two-phase learning approach and tilted time yang populer misalnya, DENCLUE [27] dengan grid
window. CluStream uses micro-clusters to capture summary seukuran fix, STING [28] dengan grid multi resolusi, dan
information about the DS. The micro cluster is defined as the WaveCluster [29] dengan metode transformasi wavelet untuk
temporal extension of the clustering feature vector in membuat cluster lebih menonjolkan ruang yang
BIRCH; (2) HPStream fix CluStream to handle high- ditransformasikan. Khusus untuk DS ada beberapa algoritma
dimensional DS issues; (3) SWClustering fixes the degraded antara lain: (1) D-Stream, merupakan pengembangan dari
CluStream after running for a long time using the temporal DENCLUE menggunakan two-phase learning dan fading
cluster feature (TCF); (4) E-Stream [25] also fix ClueStream window; (2) MR-Stream, merupakan pengembangan dari D-
to handle concept drift changes (cluster evolution) using Stream dan STING dan menerapkan fading window; (3)
fading window and cluster histogram; (5) ClusTree [26] CellTree, berbasis grid yang berkembang menjadi Cell*Tree
indexes micro-clusters to adapt automatically to DS speeds [30] yang menggunakan B+Tree untuk menyimpannya
using the R * -tree structure; (6) REPSTREAM, developed sinopsis DS dan menerapkan fading window untuk
CHAMELEON using incremental learning and fadding menekankan perubahan informasi terbaru dalam DS pada
window approaches. clusters.
Grid based clustering method quantifies data space into
multi-resolution grid structures. The grid structure contains
many cells, each having a subspace and storing summary
data object information in a subspace. Then, the group is
determined by the solid region of the nearest solid cell. There
are some popular traditional grid-based clustering algorithms
for example, DENCLUE [27] with a fixed-size grid, STING
[28] with multi-resolution grid, and WaveCluster [29] with
wavelet transform method to make clusters more accentuated
transformed space. Especially for DS there are several
algorithms, among others: (1) D-Stream, is the development
of DENCLUE using two-phase learning and fading window;
(2) MR-Stream, is the development of D-Stream and STING
and applies fading window; (3) CellTree, a grid-based that
evolves into Cell * Tree [30] which uses B + Tree to store it
DS synopsis and implements fading windows to emphasize
changes to the latest information in DS on clusters.
The model-based clustering method tries to optimize the
possibility between data and some statistical models. For
traditional model-based clustering, the Expectation-
Maximization (EM) algorithm is a soft-clustering method,
and Self-Organizing Map (SOM) [31] is a popular artificial
neural network method for clustering. As for the developing
DS algorithm SWEM developed from EM using fading
window and incremental learning. There are two important
Figure 5 Data Stream Clustering Algorithm [10] extensions of SOM for DS, namely: Growing self-organizing
The density-based method builds a data density profile map (GSOM) and Cellular Probabilistic self-organizing map
for grouping purposes. Thus, groups are considered as dense (CPSOM). In GSOM, there is no need to specify the initial
areas of objects and separated by rare regions with low size of the output map; it dynamically grows the node at the
densities in data space. Density-based groupings are able to boundary of the map whenever its accumulated error exceeds
find arbitrary-shaped clusters and do not require a the threshold. CPSOM is an online algorithm and is suitable
predetermined number of groups. DBSCAN, OPTICAL, and for large datasets. CPSOM uses a fading window to reduce
PreDeCon are some of the best known traditional density- the weight of the neuron state. GCPSOM is a hybrid
based grouping methods. For DS there are several methods: algorithm that combines the advantages of both GSOM and
(1) DenStream, which developed the DBSCAN algorithm by CPSOM. Therefore, GCPSOM is dynamically developing
applying two-phase learning and fading window and using feature maps to group DS and create track clusters as they
micro-clusters to capture the DS information synopsis; (2) evolve.
OPTICS-Stream, development of OPTICS algorithm using
two-phase learning, and using micro-cluster and fading IV. BIG DATA TECHNOLOGY
model to build synopsis; (3) incPreDecon, the development In part 2 it has been suggested that big data requires
of PreDeCon and DenStream using incremental learning and technology to handle 3v characteristics and to transform it
landmark window approaches. into value / knowledge. In this section discussed about the
Grid-Based. Metode clustering berbasis grid (grid based) technology used to process big data. Beginning with the
mengkuantifikasi ruang data ke dalam struktur grid multi- history and basic architecture of Hadoop and Hadoop
resolusi. Struktur grid tersebut mengandung banyak sel, Ecosystem, then followed by a discussion of the results of a
masing-masing memiliki subruang dan menyimpan big data technology survey based on research [32] and [33].
The main requirement to use big data is adequate technology, systems, like Hadoop, Spark, Flink and Tex; (2) database
ie data processing technology in large size, with a high rate processing using query (big SQL), such as Hive, Spark SQL,
of speed. It therefore requires computing and storage Impala, HAWQ, IBM Big SQL and Presto; (3) large graph
requirements that can not be provided by conventional processing, such as Pregel, GraphLab and GraphX, and (4)
information technology systems. stream processing (big stream), such as Storm, S4 (Simple
Scalable Streaming System), Infosphere Streams (Spark-
A. Hadoop and Hadoop Ecosystem Streaming) and Flink -Streaming.
Hadoop is an opensource framework designed
specifically to handle big data. Hadoop was developed V. POSITION OF BIG DATA CURRENT RESEARCH AND
initially by Google [35], then became a stand-alone Apache FUTURE CHALLENGES
project. The main components of Hadoop are distributed In the study [38], a survey of the research trends on big
storage (Hadoop Distributed File System - HDFS) and data has been conducted. The study was conducted on
distributed computing framework (MapReduce). HDFS is original research papers published in journals referenced in
designed to store and process large amounts of data sets ScienceDirect from 2006 to 2016 using the term "big data" in
reliably, and stream them at high bandwidth to user title (T), abstract (A) or keywords (K), ie 693 papers. This
applications. On a large cluster thousands of servers handle research uses multi dimension in classification, among
data storage directly and execute user application tasks at others: time, topic, objective, artifact, application domain,
locations where the data sets are stored. By distributing discipline, journal and active author. In this paper is
storage and computing across multiple servers, resources can displayed dimensions that are considered important by the
grow as demand continues to be economical for every size author are: time, topic, objectives, artifacts and application
increase. In 2010 reported experience of using HDFS to domain.
manage 25 petabyte data at Yahoo companies [36].
MapReduce is a programming model that implements the In terms of time horizon, big data research began to
processing and generation of large data sets that are specified develop in 2011, then growing very rapidly since 2013.
as map and reduce functions, and compiles them in parallel While from the top 3 topics is cloud, analytics, and social.
to large-scale machine clusters. Google has used it to process From the objective side, the majority on the performance
data over 20 petabytes per day in 2008 [37]. Limitations of dimension, followed by quality, security and scalability. For
data processing in the context of big data follow Moore's artifact dimensions, the most is the algorithm, which is
law. followed by the framework and architecture. As for
application domain dimension, the most is earth, then
Above the Hadoop framework has been widely followed by energy, medechine, ecology, marketing and
developed big data processing applications. This is health. To obtain more insight on big data research, in a
developed in addition to expansions for specific applications, study [38] also carried out content analysis by mapping 64
as well as the many limitations of MapReduce (which are articles containing at least 1 SMACIT words with the big
batch processing) to handle the processing of many (iterated) data management taxonomy proposed in [39] On the basis of
graphs and data streams (which require real-time processing content analysis can be concluded that majority research
or near real time). These applications are referred to as focuses on 3 big data techniques, such as classification,
Hadoop Eco System. clustering and prediction.
Based on the survey conducted in the study [32], The biggest challenge of future research is to improve the
plaftorm big data processing can be grouped as in Figure 6. scalability of large data algorithms and framework data,
especially clustering and classification for prediction analysis
along with improved performance and quality by utilizing
distributed computing technologies such as Hadoop, Storm,
Flink, S4 and Spark-Streaming (Infospere Streams)

VI. RESEARCH THAT HAS BEEN DONE


The author has conducted research on big data in the
period 2015 - 2017. Beginning with research on big data
technology exploration for community-based application
system [40]. The exploration results were tested on small
micro enterprise bookkeeping applications (UMK) [41].
Furthermore, in 2016-2017, a Competitive / Applied
Hibah research is conducted to prepare big data
infrastructure architecture for small and medium enterprises,
both for operational management, and for analysis. The
infrastructure was used to develop Hbase-based transactional
applications, and conducted HDFS-based and MapReduce-
based analysis experiments. The results have been published
in international journals [42], [43] and [44] and nationally
Figure 6 Big Data Processing Platform [32] [45].

In Figure 6 it can be seen that the big data processing


platform consists of 4 types, such as: (1) multipurpose
REFERENCES SIGMOD International Conference on Management of Data -
SIGMOD ’98. hal. 73–84, 1998.
[24] G. Karypis, E.-H. Han, dan V. Kumar, “CHAMELEON: hierarchical
[1] M. Ferguson, “Building Big Data Analytical Applications at Scale clustering using dynamic modeling,” Computer (Long. Beach. Calif).,
Using Existing ETL Skillsets Building Big Data Analytic vol. 32, no. 8, hal. 68–75, 1999.
Applications At Scale Using Existing ETL Skillsets,” 2015.
[25] K. Udommanetanakit, T. Rakthanmanon, dan K. Waiyamai, “E-
[2] A. De Mauro, M. Greco, dan M. Grimaldi, “A formal definition of Stream: Evolution-Based Technique for Stream Clustering,” Adv.
Big Data based on its essential features,” Libr. Rev., vol. 65, no. 3, Data Min. Appl. Third Int. Conf. ADMA 2007 Harbin, China, August
hal. 122–135, Apr 2016. 6-8, 2007. Proc., hal. 606–616, 2007.
[3] J. Rowley, “The wisdom hierarchy: representations of the DIKW [26] P. Kranen, I. Assent, C. Baldauf, dan T. Seidl, “The ClusTree:
hierarchy,” J. Inf. Sci., vol. 33, no. 2, hal. 163–180, Apr 2007. indexing micro-clusters for anytime stream mining,” Knowl. Inf.
[4] A. De Mauro, M. Greco, dan M. Grimaldi, “A formal definition of Syst., vol. 29, no. 2, hal. 249–272, Nov 2011.
Big Data based on its essential features,” Libr. Rev., vol. 65, no. 3, [27] A. Hinneburg dan D. A. Keim, “An Efficient Approach to Clustering
hal. 122–135, 2016. in Large Multimedia Databases with Noise,” Proc. 4th ACM
[5] G. E. Moore, “Cramming more components onto integrated circuits, SIGKDD Knowl. Discov. Data Min. AAAI Press. Menlo Park, vol.
Reprinted from Electronics, volume 38, number 8, April 19, 1965, 5865, no. c, hal. 58–65, 1998.
pp.114 ff.,” IEEE Solid-State Circuits Soc. Newsl., vol. 11, no. 3, hal. [28] W. Wang, J. Yang, dan R. Muntz, “STING : A Statistical Information
33–35, Sep 2006. Grid Approach to Spatial Data Mining,” 23rd VLDB Conf. Athens,
[6] J. Han, M. Kamber, dan J. Pei, “Data Mining. Concepts and hal. 186–195, 1997.
Techniques, 3rd Edition (The Morgan Kaufmann Series in Data [29] G. Sheikholeslami, S. Chatterjee, dan A. Zhang, “WaveCluster: a
Management Systems),” 2011. wavelet-based clustering approach for spatial data in very large
[7] S. García et al., “Big data preprocessing: methods and prospects,” Big databases,” VLDB J., vol. 8, no. 3, hal. 289–304, 2000.
Data Anal. 2016 11, vol. 47, no. 3, hal. 52–15238, Nov 2016. [30] N. Beckmann, H.-P. Kriegel, R. Schneider, dan B. Seeger, “The R*-
[8] A. Kobsa, “Knowledge representation: a survey of its mechanisms, a tree: an efficient and robust access method for points and rectangles,”
sketch of its semantics,” Cybern. Syst., vol. 15, no. 1–2, hal. 41–89, ACM SIGMOD Rec., vol. 19, no. 2, hal. 322–331, 1990.
Jan 1984. [31] T. Kohonen, “The Self-Organizing Map,” Proceeding of the IEEE,
[9] K. Trentelman, “Survey of Knowledge Representation and Reasoning vol. 78. hal. 1464–1480, 1990.
Systems.” [32] F. Bajaber, R. Elshawi, O. Batarfi, A. Altalhi, A. Barnawi, dan S.
[10] H. L. Nguyen, Y. K. Woon, dan W. K. Ng, “A survey on data stream Sakr, “Big Data 2.0 Processing Systems: Taxonomy and Open
clustering and classification,” Knowl. Inf. Syst., vol. 45, no. 3, hal. Challenges,” J. Grid Comput., vol. 14, no. 3, hal. 379–405, 2016.
535–569, 2015. [33] A. Oussous, F.-Z. Benjelloun, A. A. Lahcen, dan S. Belfkih, “Big
[11] P. Domingos dan G. Hulten, “Mining high-speed data streams,” Proc. Data technologies: A survey,” J. King Saud Univ. - Comput. Inf. Sci.,
sixth ACM SIGKDD Int. Conf. Knowl. Discov. data Min. - KDD ’00, 2017.
hal. 71–80, 2000. [34] A. Holmes, Hadoop In Practice - MEAP. 2012.
[12] G. Hulten, L. Spencer, dan P. Domingos, “Mining time-changing data [35] S. Ghemawat, H. Gobioff, dan S.-T. Leung Google, “The Google File
streams,” in Proceedings of the seventh ACM SIGKDD international System.”
conference on Knowledge discovery and data mining - KDD ’01,
[36] K. Shvachko, H. Kuang, S. Radia, dan R. Chansler, “The Hadoop
2001, hal. 97–106.
Distributed File System,” in 2010 IEEE 26th Symposium on Mass
[13] P. Kranen, S. Günnemann, S. Fries, dan T. Seidl, “MC-Tree: Storage Systems and Technologies (MSST), 2010, hal. 1–10.
Improving Bayesian Anytime Classification,” Springer, Berlin,
[37] J. Dean dan S. Ghemawat, “MapReduce: Simplified Data Processing
Heidelberg, 2010, hal. 252–269.
on Large Clusters.”
[14] H.-P. Kriegel, P. Kröger, I. Ntoutsi, dan A. Zimek, “Density Based
[38] J. Akoka, I. Comyn-Wattiau, dan N. Laoufi, “Research on Big Data –
Subspace Clustering over Dynamic Data,” Springer, Berlin,
A systematic mapping study,” Comput. Stand. Interfaces, vol. 54, hal.
Heidelberg, 2011, hal. 387–404.
105–115, Nov 2017.
[15] Liadan O’Callaghan, N. Mishra, dan A. Meyerson, “Streaming-Data
[39] A. Siddiqa et al., “A survey of big data management: Taxonomy and
Algorithm For High-Quality Clustering,” Proc. 18th Int. Conf. Data
state-of-the-art,” J. Netw. Comput. Appl., vol. 71, hal. 151–166, Agu
Eng., hal. 685, 2002.
2016.
[16] T. Seidl, I. Assent, P. Kranen, R. Krieger, dan J. Herrmann, “Indexing
[40] G. Karya, C. Wijaya, dan V. S. Moertini, “Eksplorasi Teknologi Big
Density Models for Incremental Learning and Anytime Classification
Data Hadoop Untuk Sistem Aplikasi Berbasis Komunitas Studi
on Data Streams.”
Kasus: Aplikasi Pembukuan Umk,” Res. Rep. - Eng. Sci., vol. 2, no.
[17] A. K. Jain, M. N. Murty, dan P. J. Flynn, “Data clustering: a review,” 0, Feb 2016.
ACM Comput. Surv., vol. 31, no. 3, hal. 264–323, 1999.
[41] G. Karya dan V. S. Moertini, “Pengembangan Aplikasi Pembukuan
[18] Y. Chen dan L. Tu, “Density-based clustering for real-time stream Usaha Mikro Dan Kecil (UMK) Dengan Teknologi Mobile Cloud,”
data,” Proc. 13th ACM SIGKDD Int. Conf. Knowl. Discov. data Min. Res. Rep. - Eng. Sci., vol. 2, no. 0, 2014.
- KDD ’07, vol. d, hal. 133, 2007.
[42] V. S. Moertini dan L. Venica, “Enhancing Parallel k-Means Using
[19] F. Cao, M. Ester, W. Qian, dan A. Zhou, “Density-based clustering Map Reduce for Discovering Knowledge from Big Data,” 2017.
over an evolving data stream with noise,” Proc. Sixth SIAM Int.
[43] V. S. Moertini dan L. Venica, “Parallel K-Means for Big Data: on
Conf. Data Min., vol. 2006, hal. 328–339, 2006.
Enhancing Its Cluster Metrics and Patterns,” J. Theor. Appl. Inf.
[20] L. Wan, W. K. Ng, X. H. Dang, P. S. Yu, dan K. Zhang, “Density- Technol., vol. 3095, no. 8, 2017.
based clustering of data streams at multiple resolutions,” ACM Trans.
[44] V. S. Moertini, V. Kevin, dan G. Karya, “Extracting Opinions from
Knowl. Discov. Data, vol. 3, no. 3, hal. 1–28, Jul 2009.
Big Data of Indonesian Customer Reviews Using Hadoop
[21] C. C. Aggarwal et al., “A Framework for Clustering Evolving Data MapReduce,” World Acad. Sci. Eng. Technol. Int. J. Comput. Inf.
Streams,” 2003. Eng. Int. Sci. Index Comput. Inf. Eng. Int. Sch. Sci. Res. Innov., vol.
[22] T. Zhang, R. Ramakrishnan, dan M. Livny, “BIRCH: An Efficient 44, no. 410.
Data Clustering Databases Method for Very Large,” ACM SIGMOD [45] G. Karya dan V. S. Moertini, “Eksplorasi Teknologi Big Data Hadoop
Int. Conf. Manag. Data, vol. 1, hal. 103–114, 1996. Untuk Sistem Aplikasi Berbasis Komunitas,” J. RESTI (Rekayasa
[23] S. Guha, R. Rastogi, dan K. Shim, “CURE: An Efficient Clustering Sist. dan Teknol. Informasi), vol. 1, no. 2, hal. 160–169, Nov 2017.
Algorithm for Large Databases,” Proceedings of the 1998 ACM

Das könnte Ihnen auch gefallen