Beruflich Dokumente
Kultur Dokumente
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
1
Abstract—In recent years, many fields that experience a sudden their daily habits, including diet, physical activity, or smoking
proliferation of data, which increases the volume of data that cigarettes, and finally identified diseases, are stored in special
must be processed and the variety of formats the data is stored repositories. These repositories grow in size due to the growing
in have been identified. This causes pressure on existing compute
infrastructures and data analysis methods, as more and more number of patients, cases, and collected features that can be
data is considered as a useful source of information for making explored in order to draw interesting conclusions. The amount
critical decisions in particular fields. Among these fields exist of data that may constitute a value for making important
several areas related to human life, e.g., various branches of decisions is constantly growing due to modern techniques of
medicine, where the uncertainty of data complicates the data data harvesting, newly identified data sources, and growing
analysis, and where the inclusion of fuzzy expert knowledge in
data processing brings many advantages. capacities of storage systems that can accommodate the growth
In this paper, we show how fuzzy techniques can be incor- of incoming large data volumes. The era of Big data that
porated in Big Data analytics carried out with the declarative we entered several years ago has changed our imagination
U-SQL language over a Big Data Lake located on the Cloud. We about the type and the volume of data that can be processed,
define the concept of Big Data Lake together with the Extract, as well as the value of data. This is now visible in many
Process, and Store (EPS) process performed while schematizing
and processing data from the Data Lake, and while storing results fields which are experiencing an explosion of data that are
of the processing. Our solution, developed as a Fuzzy Search considered relevant, including social networks [1], [2], [3], [4],
Library for Data Lake, introduces the possibility of (1) massively- [5], [6], [7], [8], multimedia processing [9], internet of things
parallel, declarative querying of Big Data Lake with simple and (IoT) [10], intelligent transport [11], [12], [13], [14], medicine
complex fuzzy search criteria, (2) using fuzzy linguistic terms in and bioinformatics [15], finance [16], and many others [17],
various data transformations, and (3) fuzzy grouping. Presented
ideas are exemplified by a distributed analysis of large volumes [18], that face the problem of big data. The big data problem
of biomedical data on Microsoft Azure cloud. (or opportunity) usually arises when data sets are so large
Results of performed tests confirm that the presented solution that the conventional database management and data analysis
is highly scalable on the Cloud and is a successful step toward soft tools are insufficient to process them [19]. However, the large
and declarative processing of data on a large scale. The solution amount of data that must be processed (large volume) is not
presented in this paper directly addresses three characteristics
of Big Data, i.e., volume, variety, and velocity, and indirectly the only characteristic of big data. Apart from volume, big data
addresses, veracity and value. also have other characteristics: velocity, variety, veracity, and
value, which are together known as 5V model. The big data
Index Terms—Big Data, fuzzy logic, querying, Cloud comput-
ing, biomedical data analysis, declarative languages. solutions, including the one presented in this paper, usually
address more than one of the Vs.
Biomedical data are an example of the type of data that
I. I NTRODUCTION
have extensively proliferated in recent years. This prolifera-
1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
2
similar treatment scenarios for patients with similar symptoms. The declarative character of the SQL language encouraged
On the other hand, incorporating routines for fuzzy processing scientists working with uncertainty and soft knowledge of
in the data analysis pipeline allows generalizing the data, experts to extend the relational database management systems
group it and aggregate, and thus, change the granularity of (RDBMSs) toward fuzzy data processing by implementing
information that we have to deal with. As a consequence, appropriate procedures and functions in the programming
this provides a way to quickly reduce the volume of data language that is native for the particular database management
from big to small, which is highly required in the era of Big system (DBMS). Examples of such implementations are SQLf
Data. Biomedical data analysis, performed especially for big [26], FQUERY [27], Soft-SQL [28], fuzzy Generalised Log-
biomedical data sets, call for mechanisms that would allow ical Condition [29], FuzzyQ [30], and fuzzy SQL extensions
approximate information retrieval. This constitutes the first aim for relational databases [31], [32], [33], possibilistic databases
of our works presented in this paper. [34], and data warehouses [35], [36]. Mentioned extensions to
Apart from large data sets that are generated as a result of the SQL query language are noteworthy, as they deliver various
technology-driven data proliferation, biomedical data is usu- fuzzy techniques for data exploration, like fuzzy filtering,
ally delivered in a variety of formats and may have complex fuzzy inference, generalization with fuzzy linguistic variables,
structure, e.g., this can be numerical data of biochemical mark- fuzzy grouping, which are also implemented in the solution
ers from laboratory testing, image data for X-ray pictures or presented in the paper. However, they do not address the
computed tomography, time series from EKG or EEG, or DNA problem of Big Data and any of the V characteristics, since
microarray or next-generation sequencing data from molecular they are mainly devoted to relational databases.
profiling. This makes the integration and structuralization of Recent trends in developing and using NoSQL databases
the data difficult or expensive, and leads to solutions that allow naturally led to the transition of fuzzy techniques used in
storing and processing the data in its native form. Big data lake relational databases to the NoSQL model. Although NoSQL
is a central location in which users can store all their data in database systems are more specialized and domain-oriented,
its native form, regardless of its source or format. Big data they are also more scalable to cope with large volumes of data.
lake can be used as an environment for the development of Works presented by Castelltort and Laurent [37], [38], [39]
in-depth analytics oriented toward fast decision making on the show fuzzy extensions, including fuzzy filtering and linguistic
basis of raw data. Data analysis can be performed dynamically summaries, to the declarative Cypher query language used in
and on an ad hoc manner. Data shall not be prepared and querying Neo4j graph databases. Similar attempts can be found
structuralized far before the begining of the analysis. Usually, in [40], where authors proposed a fuzzy NoSQL model to deal
only a fraction of the data gets structuralized dynamically with large fuzzy databases hosted on Neo4j graph database.
during the analysis and for the purpose of the analysis or In [41], Kacem and Touzi show how they extended document-
possible secondary analyses performed later. This schema-on- oriented MongoDB database and Mongo Query Language
read approach allows skipping expensive schematization of toward flexible querying with linguistic labels. Those solutions
data, applying a schema as data are pulled out of a stored focus mainly on the volume characteristic of the Big Data
location, focusing on analyses driven by current needs, and basing on the assumption that NoSQL databases are highly
keeping all the data even if just a part of the data constitutes scalable. However, no results of performance tests were pro-
a value for the analysis performed at the moment. As a vided in either of these works and not all of the solutions
consequence, data analysts are able to ask more powerful support declarative querying.
queries that translate data into more actionable insights. Such The second group of works describes approaches devoted
an approach speeds up the phase of data preparation before the to the processing and analysis of Big Data with the use of
data are analyzed and visualized, but it usually comes as a cost fuzzy techniques. Most of the works concentrate on clustering
of losing the possibility of declarative retrieval of information, and classification of Big Data with the use of computational
which is typically performed in relational databases through frameworks, like Apache Hadoop and Apache Spark. In works
SQL-based querying. Moreover, it usually requires dedicated [42], [43] authors have showed the classification of big data
programmatic procedures, utilization of specific computational using Fuzzy K-Nearest Neighbor classifier speeding up the
frameworks, like Apache Hadoop [23] or Apache Spark [24], classification process on Apache Hadoop. MapReduce-based
for efficient processing of terabytes of data, and extensive procedures for the classification task were also developed for
computing resources that can be provisioned on demand from, fuzzy rule based associative classifier presented in [44], Fuzzy
e.g., the Cloud [25]. Declarative data manipulation in Big Data Rule Based Classification System (ChiFRBCS-BigData) pro-
Lakes constitutes the second motivation for our works. posed in [45], and its extension [46]. In [47], Segatori et al.
show efficient and scalable MapReduce-based implementation
of Fuzzy Decision Trees used for fuzzy classification per-
A. Related Works formed on Apache Spark. In [48], authors propose a parallel
World literature provides various solutions in terms of implementation of the fuzzy minimals clustering algorithm
approximate processing of Big Data and declarative data ma- (PFM), reporting a linear increase in the performance. Re-
nipulation. The first group of works is focused on declarative duction of the volume of the processed data is proposed in
processing and transformation of data stored in relational [49], where Li et al. apply two techniques, i.e., discretization
databases and NoSQL databases by querying the databases of conditional attributes and fuzzification of class labels, to
with query languages, like SQL, containing fuzzy extensions. transform the original data set into a smaller one. There is
1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
3
also a group of papers, like [50], [51], [52], [53], [54], [55], use of declarative U-SQL language. We extended the U-SQL
[56], [57], [58], [59], which adopt fuzzy clustering methods, language for parallel querying by incorporating a variety of
including various versions of Fuzzy C-Means algorithm, to techniques that enable fuzzy data processing, including fuzzy
Big Data collections, very large databases, and data streams. searching, fuzzy transformation, linguistic terms assignment,
Finally, several works, including [60], [61], [62], [63], show and fuzzy grouping. These operations can now be performed
that Fuzzy Cognitive Maps and Linguistic Fuzzy Cognitive on a large scale to transform the data, and thereby, allow
Maps are able to handle large scale data in pattern classifica- searching and grouping similar data on the basis of experts
tion applications. Elements of fuzzy sets theory are also used knowledge, reduce the volume of the data, and prepare the
in (big) data preprocessing [64], information searching [65], data for further analysis and visualization. Data processing is
managing data privacy [66] and data access [67]. Querying Big highly parallelized in the Extract, Process, and Store (EPS)
Data has been raised in [68], where Novikov et al. discuss an process, which can be scaled out in order to adjust to the
algebraic layer for complex query processing by using adaptive volume of data stored in the Data Lake. Storing and processing
abstract operations based on the concept of fuzzy set, which of the data is performed on Microsoft Azure public cloud,
are needed to support uniform handling of different kinds of which ensures large scalability of the presented approach.
similarity processing. However, no implementation was made. Performance and scalability of the proposed solution are tested
It is also worth noting a few works that are devoted to with the use of several data sets containing up to one billion
the application of fuzzy sets theory in medical data analysis, rows of medical data related to cardiac disease. Presented
since various branches of medicine deliver large volumes of work addresses directly at least 3Vs that characterize Big Data,
data [15]. In [69], Sundharakumar et al. propose a cloud-based i.e., volume, velocity, and variety, and indirectly, veracity and
system containing fuzzy inference module for monitoring the value. Moreover, the paper extends the spectrum of existing
health of patients on the basis of the analysis of streams of works by: (1) Showing that declarative, fuzzy querying can
body sensor data with the use of Storm (the real-time compu- be performed efficiently and effectively in big data sets when
tation system). The system is hosted on a private cloud, which implemented in highly scalable environments, which is proved
thereby ensures security and scalability, and the approach by performed experiments (Sect. IV). (2) Enabling fuzzy tech-
addresses mainly the velocity characteristics of Big Data. How- niques that allow processing and analyzing raw and schema-
ever, no performance results were presented. An interesting free data coming from various data sources (Sect. III). (3)
approach for the reduction of volume of data is proposed in the Providing formal definition of the Big Data Lake concept and
paper [70], where Azar and Hassanien define and show a new EPS process (Sect. II-A). (4) Providing formal definitions to
neural-fuzzy classifier for dimensionality reduction in medical the fuzzy operations performed on big data (Sect. II-B).
big data. Authors suggest that the proposed method simplifies
the classification tasks by reducing the dimensionality of II. M ETHODS
large data sets and by speeding up the learning process. In Big Data Lake changes the way how data are stored and
[71], Behadada et al. present a novel method to define semi- managed within the IT infrastructure of an institution, and
automatically fuzzy partition rules to provide a powerful and enables responding to the changing requirements for the data
accurate insight into cardiac arrhythmia. The approach is based analysis in various domains. While this shift often enables an
on text mining techniques applied to big data sets of freely immediate access to the data, without waiting to be cleaned,
available scientific papers provided by PubMed [72]. modeled, structuralized, and loaded, it also requires to change
Broader classification of fuzzy set techniques used in Big the way how data are prepared for the analysis.
Data processing is presented by Wang et al. in [73]. The wide
spectrum of published works proves that the application of A. Data Lake and EPS Process
fuzzy techniques for data processing brings many advantages.
Data lake allows quickly consolidating various types of data
Those of the presented solutions that are dedicated to Big
in one place. The data may come from various data sources,
Data usually assume reduction of the volume of processed
can be logically related, and can be stored in the raw form —
data by using various fuzzy techniques while generalizing,
structured, semistructured, or unstructured.
grouping, and aggregating the data and classifying or assigning
Definition 1: Formally, we define Data Lake as a pair:
to clusters. However, they do not allow performing these
operations in a declarative way. On the other hand, those DL = {V, M }, (1)
solutions that enable declarative and approximate information
where V is a set of values in the Data Lake, and M is a set
retrieval are not dedicated for Big Data (SQL-based solutions),
of metadata describing values in the Data Lake DL.
or need a dedicated storage model (NoSQL attempts).
If:
∀v∈V ∃md ∈M fd (v) = md , (2)
B. Scope of this Work where md represents metadata that describe the name of the
In this paper, we present a novel, scalable, and universal attribute for the value v, and fd is a function assigning the
solution for processing the data stored in Big Data Lake with name of the attribute to the value, then we call the data lake
the use of fuzzy techniques and declarative U-SQL query DL as fully described in terms of attribute names.
language. Presented solution allows exploring data coming If:
from various domains by querying large data sets with the ∀v∈V ∃mt ∈M ft (v) = mt , (3)
1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
4
1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
5
σ̃ λ (R) = {t : t ∈ R, µv (t(Ai )) ≥ λ}, (9) • T̃L (Ai ) assigns and returns a linguistic value l of the
Ai ≈v
defined linguistic variable L = {l1 , l2 , ..., lm }, m ∈ N+
where µv is a membership function of a fuzzy set v. for the value of the attribute Ai from each tuple of the
The fuzzy set v can be defined by various types of member- rowset R:
ship functions, including triangular, trapezoidal, and Gaussian.
∀t∈R ∃l∈L T̃L (t(Ai )) = l
For example, the fuzzy selection with the search condition
Cholesterol is high for the fuzzy set high cholesterol defined ∧ µl (t(Ai )) = (16)
as presented in Fig. 2, and λ = 0.5 produces the following max{µl1 (t(Ai )), µl2 (t(Ai )), ..., µlm (t(Ai ))}.
rowset (σ̃ 0.5 (P atients)).
Cholesterol ≈ high For example, the extended projection with fuzzy trans-
Id Age Cholesterol formation (both types) of the Cholesterol attribute on the
-- --- -----------
1 45 197
basis of the Cholesterol linguistic variable (ChV), defined
5 72 210 as presented in Fig. 3, produces the following rowset.
1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
6
210
197
248
165
Cholesterol
1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
7
1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
8
with linguistic variables defined based on the knowledge of U-SQL script extracts data by using the EXTRACT phrase
experts specific for particular data domains. This allows to (line 10) from the big data file (lines 8 and 13) stored in
independently create a collection of libraries for processing the CSV format. The CSV built-in extractor is used for this
various types of data. Sample C# code for defining linguistic purpose (line 14). The extraction from the Big Data Lake
variables is presented in Appendix D in the Supplement. produces the rowset @data (line 9), which is used in the
We also have made possible to define linguistic variables Process phase of the EPS. Schema of the rowset is defined
directly in the U-SQL code. This allows for dynamic defining on read (lines 10–12) by specifying a column name and C#
of variables that are not known before the implementation of type name per each extracted attribute of the Data Lake. Data
the Process phase. In the U-SQL code they are represented extraction is preceded by loading the Fuzzy Search Lib and
as string variables and formatted according to the grammar Heart Disease Data libraries into the script context (lines 1–
presented in Appendix E (Supplement). Dynamically defined 2). The former exposes all fuzzy extensions to U-SQL (fuzzy
linguistic variables are managed by methods available through modules described in Sect. III-B). The latter delivers domain-
the Udl module of the Fuzzy Search Library. The capability of specific fuzzy linguistic variables used in data processing.
defining linguistic variables in this way has two advantages: Short names for modules exposed by both libraries are mapped
• Dynamically defined linguistic variables can be created in lines 3–5 to simplify the syntax of U-SQL queries used in
ad hoc in the U-SQL scripts. the Process phase.
• They do not require to be compiled to external libraries. 1 REFERENCE ASSEMBLY H e a r t D i s e a s e . H e a r t D i s e a s e D a t a ;
2 REFERENCE ASSEMBLY H e a r t D i s e a s e . F u z z y S e a r c h L i b ;
Sample definition of a linguistic variable created in this way 3 USING Udfs = F u z z y S e a r c h L i b . Udfs ;
is presented in Listing 1. 4 USING V a r i a b l e s = H e a r t D i s e a s e D a t a . V a r i a b l e s ;
5 USING Udl = F u z z y S e a r c h L i b . Udl ;
1 DECLARE @ c h o l e s t e r o l V a r s t r i n g = ” C h o l e s t e r o l : Normal 6
, T , 0 , 0 , 1 9 0 , 2 0 0 / High , T , 1 9 0 , 2 0 0 , 2 4 0 , 2 5 0 / VeryHigh , T 7 / / Extract data
,240 ,250 ,1000 ,1000 ” ; 8 DECLARE @ d a t a P a t h s t r i n g = ” / i n / H e a r t D i s B i g . c s v ” ;
Listing 1. Sample U-SQL code for ad hoc defining of the linguistic variable 9 @data =
Cholesterol with its linguistic values. 10 EXTRACT I d s t r i n g , Age i n t , Sex i n t , . . . ,
11 Cholesterol int , CigarettesPerDay int , . . . ,
12 RestHr i n t , . . . , Disease bool
IV. E XPERIMENTAL R ESULTS 13 FROM @ d a t a P a t h
14 USING E x t r a c t o r s . Csv ( ) ;
Our extensions for fuzzy data processing in Big Data Listing 2. A part of a U-SQL script extracting data and producing the @data
Lake have been tested extensively, in order to verify its rowset.
functionality, effectiveness, and performance. In this section, A part of the rowset produced by the presented U-SQL
we show sample U-SQL queries that demonstrate capabilities script has the following form (column headers were formatted
of the proposed extensions and the implementation of the manually).
EPS process. We also present results of performance tests that Id Age Sex Chol-l Cig-PerDay RestHR Disease
----------- --- --- ------ ---------- ------ -------
reveal scalability of our solution for Big Data. 474a1bce-.. 57 M .. 192 75 .. 86 .. True
d3258689-.. 53 M .. 203 20 .. 86 .. False
3ee478ee-.. 48 M .. 229 0 .. 75 .. False
A. Data Set and Processing Environment e5724521-.. 54 M .. 239 20 .. 86 .. True
33be448a-... 49 M .. 188 30 .. 78 .. False
All tests of the presented solution were carried out with 93e06584-.. 64 M .. 211 30 .. 58 .. True
8e4d191d-.. 66 F .. 226 0 .. 49 .. True
the use of biomedical Heart Disease data set [76] from ...
UCI Machine Learning Repository. This data set contains
75 attributes, but we present our experiments by using a C. Examples of U-SQL Queries Extended with Fuzzy Process-
subset of them (mainly based on the Cleveland database). ing Capability
In order to test the performance, records were proliferated In this section, we show the use of various fuzzy techniques
up to 100 million, 500 million, and 1 billion for various in data transformations performed in the Process phase. Pre-
experiments. To examine flexibility of fuzzy querying, during sented sample query scenarios (SQ1–SQ4) allow understand-
the data proliferation, existing values of particular attributes ing how the Fuzzy Search Library can be used to process data
were multiplied by a randomly generated factor from the range in Data Lake.
[0;1] in order to differentiate the data and obtain similar, i.e., 1) Fuzzy selection with the use of fuzzy values (SQ1): This
not identical, cases. Data proliferation and performance tests simple example shows filtering with the use of fuzzy condition
were conducted in Azure Data Lake environment on Microsoft built on the basis of a fuzzy value. In this example we perform
Azure public cloud. Proliferated data were stored as CSV files the following operation: Retrieve patients, whose age is around
in the Data Lake Store on the Clcud. U-SQL scripts were 50 years old. The U-SQL query implementing this operation
executed as distributed jobs in DLA, thus, performing parallel is shown in Listing 3.
data processing for all phases of the EPS. 1 @age =
2 SELECT Id , Age ,
3 Udfs . Around ( Age , 5 0 , 4 5 ) . V a l u e AS MemberDg
B. Data Extraction 4 FROM @data
Data extraction (the Extract phase) was performed with 5 WHERE Udfs . Around ( Age , 5 0 , 4 5 ) . V a l u e >= 0 . 5 ;
Listing 3. Sample U-SQL query retrieving patients, whose age is around 50
the use of the code presented in Listing 2. This part of the years old.
1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
9
1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
10
1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
11
TABLE III processing in Azure Data Lake with U-SQL query scenarios
E FFECTIVENESS OF F UZZY Q UERYING IN SQ3 FOR VARIOUS λ SQ1–SQ5. As can be observed, with the growing number of
allocation units, the time needed to complete the whole EPS
λ Selc RCard Flex Conf Uncr FPrec
process for all query scenarios SQ1–SQ5 drops significantly
1.0 0.412 0.412 0.000 1.000 0.000 1.000
(the time is plotted on log10 scale). For example, for query
0.9 0.412 0.412 0.000 1.000 0.000 1.000
scenario SQ4, the execution time was reduced from 3 hours
0.8 0.426 0.423 0.031 0.975 0.025 0.994
0.7 0.426 0.423 0.031 0.975 0.025 0.994
and 35 minutes, exactly 12,910 seconds, (on one AU) to 185
0.6 0.439 0.431 0.061 0.957 0.043 0.982
seconds (on 80 AUs). Eighty allocation units that were in use
0.5 0.439 0.431 0.061 0.957 0.043 0.982
was the maximum number of AUs that were needed to process
0.4 0.453 0.437 0.089 0.945 0.055 0.964 the data set, since the size of the data set did not require
0.3 0.453 0.437 0.089 0.945 0.055 0.964 to engage more computing power. It is worth noting that the
0.2 0.466 0.439 0.115 0.939 0.061 0.942 execution time curves for particular query scenarios SQ1–SQ5
0.1 0.466 0.439 0.115 0.939 0.061 0.942 follow the same pattern — the dynamics of computation speed
0.0 0.466 0.439 0.115 0.939 0.061 0.942 is not constant, and interestingly, changes in the same way
Crisp 0.412 0.412 0.000 1.000 0.000 1.000 for all query scenarios. This is caused by the allocation of
compute units to particular portions of data that are processed
in parallel. The largest of the data sets that we used (109
processed (e.g., distribution of values of an attribute) and the rows) was always divided into the same number of 80 chunks
purpose of processing. If the user is rather precise in terms of (the number depends on the size of data) that were processed
the set of results and given search conditions, but still wants by various numbers of AUs. Each data chunk is processed
to obtain a few similar cases, he should start with high values by a single AU. The allocation of data chunks to the available
of the λ similarity threshold, e.g., λ = 0.9 or λ = 0.95. Then, compute units was then more or less optimally adjusted, which
the value of the parameter can be decreased to relax the search resulted in variable dynamics of the computation speed. When
conditions and include more rows with similar values of an using 80 AUs, each data chunk is assigned to one available
attribute. If the purpose of the fuzzy querying is a preselection, AU, and the processing of the whole data set is performed
a user may start with lower values of the λ parameter (e.g., in one iteration of computations (80 AUs process 80 data
λ = 0.5 or lower), and then gradually increase the value, chunks). However, processing 80 data chunks with the use of
and decide whether the outcome is properly narrowed. On the 40 AUs requires two iterations 40 data chunks are processed
other hand, grouping with the use of linguistic variables may by 40 AUs in each of the two iterations of computations.
narrow the rowset to a representative form that is user friendly Processing the same data set with the use of 60 AUs still
and suitable for presentation, visualization, or further analysis. requires two iterations 60 data chunks are processed by 60
For example, the query presented in SQ4 returns only five AUs in the first iteration, and 20 data chunks by 20 AUs in the
rows (five groups of records, card(RF ) = 5) on the basis of second iteration, leaving 40 AUs idle. Therefore, the execution
arbitrarily defined linguistic variable with five linguistic values time remains the same for processing scenarios performed with
for the number of cigarettes smoked per day. Crisp grouping by 40 to 70 AUs as shown in Fig. 6, since they all need two
CigarettesPerDay attribute returns 197 groups (card(RC ) = iterations for data processing. This dependency is less visible,
197), which is much more, even if the number of cigarettes if the number of AUs is much smaller than the number of data
that a person is able to smoke per day (a domain) is limited. chunks that must be processed.
In order to better illustrate the results, we show the accelera-
tion of fuzzy data processing when scaling the EPS process in
E. Performance Tests Azure Data Lake from 1 to 80 AUs (Fig. 7). The acceleration
We have also conducted a series of tests in order to examine is calculated according to eq. 18 and obtained by comparing
the performance of the presented solution. For the clarity of the execution times to the case in which a particular query scenario
presentation, we show results for query scenarios SQ1–SQ5 SQ1–SQ5 was performed by using only one AU.
explained in Sect. IV-C and in the Supplement. They reflect
T1
a general tendency that we observed, while processing data Sd = , (18)
from Data Lake in the presented environment of the computing Td
cloud. In all cases, performance was assessed on the basis of where d is the number of AUs in use, T1 is the execution time
execution time measurements collected when performing the obtained while performing the U-SQL query scenario with the
whole EPS process (including all phases). We carried out at use of one AU, and Td is the execution time obtained while
least three replicas for each measurement. Then, the obtained performing a particular query scenario with the use of d AUs.
results were averaged, and the averaged values are presented As can be observed in Fig. 7, the speedup is sublinear,
in this section. Averaged values of measurements were also which results from the fact that not all component steps of
used to determine n-fold speedups when scaling computations the ADL job can be equally parallelized — in some steps
in Azure Data Lake. some AUs were idle. All speedup curves follow the same
Fig. 6 shows how the execution time depends on the trend showing that adjusting the degree of parallelism to the
executed query and the number of compute nodes, called data size (and the number of data chunks) by assigning an
allocation units (AUs), when performing parallel, fuzzy data appropriate number of AUs for EPS job execution allows the
1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
12
100000 1600
SQ1 SQ1
1400
SQ2 SQ2
1200
SQ3 SQ3
10000
SQ4 1000 SQ4
SQ5 800 SQ5
600
1000
400
200
100 0
0 10 20 30 40 50 60 70 80 90 0 200 400 600 800 1000 1200
# AUs (compute nodes) # rows (millions)
Fig. 6. Execution time for varying number of allocation units (AUs) for Fig. 8. Execution times for varying volume of data (millions rows of data)
parallel, fuzzy data processing in Azure Data Lake with U-SQL query in the extracted data set for parallel, fuzzy data processing in Azure Data
scenarios SQ1–SQ5. Time plotted on log10 scale. Tests performed with Lake with U-SQL query scenarios SQ1–SQ5. Tests performed with nine
a biomedical data set (extracted rowset) containing a billion (109 ) rows. allocation units (#AUs=9).
1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
13
The solution that we have developed and presented in whole cluster (e.g., adding new cluster nodes), the number
the paper satisfies these requirements. It allows performing of AUs in Azure Data Lake, i.e., the degree of parallelism,
full EPS process over the Big Data Lake in which data are is flexibly controlled by a single runtime parameter. For a U-
extracted, processed, and transformed, and finally stored in SQL developer, it would look like he has got a huge computer
the selected place and format, suitable for decision makers. cluster with lots of nodes and he would decide how many
The Fuzzy Search Library for Azure Data Lake makes it of them should be used when he executes the EPS query
possible to formulate imprecise search criteria, find similar job. Unlike other solutions, the presented one provides the
cases, incorporate domain-specific expert knowledge into the capability to process not only structured, but also unstructured
data analysis, benefit from linguistic terms frequently used and schema-free data stored in the native format in the big data
in real life, and change the granularity of data by fuzzy lake through a unified interface of extractors that represent
data grouping. Extensions developed for the U-SQL language the data as a rowset. Our solution also simplifies fuzzy
enable declarative processing of the data through querying processing and transformations of data by using declarative U-
rowsets produced during data extraction. Finally, implementa- SQL SELECT expressions, in contrast to Hadoop/Spark-based
tion on the Azure Data Lake located on the cloud guarantees processing, which requires adaptation to particular processing
wide scaling capabilities, which is limited only by a possessed model, e.g., MapReduce, and implementation of dedicated
Azure subscription. processing functions. Fuzzy techniques provided in the form
This work complements the collection of existing solutions, of functions (UDFs in modules of FSL4DLA) that are invoked
presented in Sect. I-A, by delivering a new functionality for directly from U-SQL queries allow not only to process data in
big data analytics. The novelty of the presented solution lies a declarative way, but also to optimize U-SQL execution plans,
mainly in the application of techniques of fuzzy querying to use less allocation (compute) units, and to decrease the
in big data environments, which meets the requirements of cost of data processing. Modularization of the FSL4DLA and
modern information processing and big data challenges. Al- separation from the domain knowledge facilitate the extension
though, several fuzzy extensions were previously developed of the functionality of the presented solution with the use
for declarative SQL language and relational database man- of new libraries of user-defined functions. Finally, there is
agement systems, they were not dedicated for processing no need to maintain any (relational/NoSQL) database cluster
big data and, likewise some NoSQL solutions, remain on or (Hadoop/Spark) computational cluster neither on premises,
a strictly defined schema or format. Our approach allows nor on the cloud, since Azure Data Lake is maintained by a
defining schema on read during the data extraction, and query- cloud provider. Moreover, U-SQL developers pay only for the
ing produced rowsets in parallel on a highly scalable cloud computation time related to U-SQL query execution on the
infrastructure. On the other hand, those solutions presented in cloud and the number of used allocation units, not just for
Sect. I-A, which were dedicated for processing Big Data, were having the cluster alive, even if nothing is processed on it.
mainly focused on data clustering and classification with the It is also worth mentioning how our DLA-based solution
use of MapReduce-based procedures executed on Hadoop or solves the problem of Big Data and addresses Vs character-
Spark clusters. Both, Hadoop and Spark, are also available as istics. By massive parallelization of computations related to
the HDInsight service in Data Lake Analytics. However, we data extraction, processing, and storing on the Cloud, we can
decided to provide the scalability of our solution by extending largely reduce the time of data processing without decreasing
the functionality of U-SQL code, which enables a declarative the workload. This task parallelism addresses the volume
way of data processing and can also be highly parallelized characteristic of Big Data. A part of the Big Data Lake,
on many allocation units (compute nodes). The functionality which is currently being processed, is divided into smaller
of our Fuzzy Search Library for Data Lake Analytics also pieces, then each such a piece is processed separately in
highly extends the spectrum of methods used for processing parallel, and results are merged together and stored again in
big data. Our solution does not directly extend the set of fuzzy the Data Lake. In terms of the variety, our solution allows for
techniques that can be used, in general, but it extends the the structuralization of data stored in Data Lake by building
set of operations that can be performed on large, schema-free table-like rowsets over the unstructured data kept in various
(variable) data sets, elevating the value of the outcome through formats. The Data Lake does not have to be fully described
enrichment provided by fuzzy techniques that are applied. both, in terms of attribute names and data types, as the rowset
Presented Fuzzy Search Library for Data Lake Analytics schema is defined dynamically, on data read. Moreover, for
has many functional advantages over the existing solutions, complex data, like images coming from medical imaging or
which are inherited from the novel technologies that it is based genetic data from DNA next-generation sequencing exper-
on. One of the most important advantages is the theoretically iments, custom extractors for building rowsets provide the
unlimited scalability that allows quick adaption to the growing way for representing data in a structured form. The presented
volume of data, which is impossible to achieve with relational solution is also highly scalable on the Azure cloud, which
databases that provide similar capabilities of fuzzy querying. allows accommodating the growth of data in environments
The second one is the simplicity of scaling the distributed or domains, in which data are generated fast and decisions
fuzzy querying process without the necessity to reconfigure the must be made quickly on the basis of the performed analyses.
whole execution (hardware and software) environment. In con- This feature supports the velocity characteristic of Big Data.
trast to Hadoop/Spark-based solutions and NoSQL databases, We also strongly believe that the presented fuzzy extensions
which in case of scaling would require reconfiguration of the to U-SQL language will allow taking better decisions on the
1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
14
basis of more valuable information mined from the Data Lake. [5] S. Kundu and S. Pal, “Fuzzy-rough community in social networks,”
The value that is hidden in the big data can be revealed Pattern Recognition Letters, vol. 67, Part 2, pp. 145 – 152, 2015, granular
Mining and Knowledge Discovery.
by finding and associating similar cases with the use of [6] X. Wei, X. Luo, Q. Li, J. Zhang, and Z. Xu, “Online comment-based
flexible filtering and fuzzy grouping methods delivered by the hotel quality automatic assessment using improved fuzzy comprehensive
Fuzzy Search Library. These operations, performed as a part evaluation and fuzzy cognitive map,” IEEE Transactions on Fuzzy
Systems, vol. 23, no. 1, pp. 72–84, Feb 2015.
of data processing phase, allow incorporating the knowledge [7] C. De Maio, G. Fenza, V. Loia, and M. Parente, “Time aware knowledge
of experts not only in the analysis of biomedical data, as extraction for microblog summarization on Twitter,” Information Fusion,
presented here, but also in other domains. vol. 28, pp. 60 – 74, 2016.
[8] G. Ghosh, S. Banerjee, and N. Y. Yen, “State transition in com-
Among limitations of the presented solution, we have to munication under social network: An analysis using fuzzy logic and
mention the necessity to operate on the specific cloud platform density based clustering towards Big Data paradigm,” Future Generation
and to adapt to the specificity of the Azure Data Lake Computer Systems, vol. 65, pp. 207–220, 2016.
[9] B. K. Tripathy and D. Mittal, “Hadoop based uncertain possibilistic ker-
environment. FSL4DLA is fully a cloud-dedicated solution. nelized c-means algorithms for image segmentation and a comparative
There is no possibility to use the FSL4DLA on large scale on analysis,” Appl. Soft Comput., vol. 46, pp. 886–923, 2016.
the hardware kept on premises, although, local deployments [10] H. Chang, N. Mishra, and C. Lin, “IoT Big-Data centred knowledge
granule analytic and cluster framework for BI applications: a case base
are also possible for testing purposes. Moreover, U-SQL analysis,” Plos One, vol. 10, pp. 1–23, 2015.
developers have to pay for computations related to U-SQL [11] H. Lu, Z. Sun, and W. Qu, “Big Data-driven based real-time traffic flow
query execution. However, on the other hand, they do not state identification and prediction,” Discrete Dynamics in Nature and
Society, vol. 2015, pp. 1–11, 2015.
have to cover the costs of maintenance of the whole hardware
[12] K. Guo, R. Zhang, and L. Kuang, “TMR: Towards an efficient semantic-
infrastructure kept on premises. based heterogeneous transportation media Big Data retrieval,” Neuro-
Our intention was to develop the Fuzzy Search Library for computing, vol. 181, pp. 122–131, 2016.
Big Data Lake as a universal tool that delivers methods for [13] C. Wang, X. Li, X. Zhou, A. Wang, and N. Nedjah, “Soft computing
in Big Data intelligent transportation systems,” Applied Soft Computing,
scalable, fuzzy data processing on Azure cloud for various vol. 38, pp. 1099–1108, 2016.
domains of analyzed data. In the presented solution, experts’ [14] H. Lu, Z. Sun, W. Qu, and L. Wang, “Real-time corrected traffic
knowledge is modularized in domain-specific libraries that correlation model for traffic flow forecasting,” Math. Probl. Eng., vol.
2015, pp. 1–7, 2015.
can be provided separately for various domains by various [15] D. Mrozek, P. Kasprowski, B. Małysiak-Mrozek, and S. Kozielski, “Life
experts. These domain-specific libraries are independent from sciences data analysis,” Information Sciences, vol. 384, pp. 86–89, 2017.
the Fuzzy Search Library. However, the Fuzzy Search Library [16] C. Bai, D. Dhavale, and J. Sarkis, “Complex investment decisions using
rough set and fuzzy c-means: An example of investment in green supply
provides common methods for translating the experts’ knowl- chains,” European Journal of Operational Research, vol. 248, no. 2, pp.
edge into efficient data analysis of Big Data. This comes as 507 – 521, 2016.
a cost of fitting in to our implementation model for domain- [17] L. Meng, A. Tan, and D. Wunsch, “Adaptive scaling of cluster bound-
aries for large-scale social media data clustering,” IEEE Trans. Neur.
specific libraries, but we believe it can be beneficial for many Net. Lear., vol. 27, no. 12, pp. 2656–2669, 2015.
domains and for the future reuse of once created libraries of [18] Y. Zhong, L. Zhang, S. Xing, F. Li, and B. Wan, “The Big Data
experts’ knowledge. With this paper, we hope to attract experts processing algorithm for water environment monitoring of the three
gorges reservoir area,” Abstract and Applied Analysis, vol. 2014, 2014.
and specialists, who use fuzzy techniques in processing data [19] National Research Council, Frontiers in Massive Data Analysis. Wash-
in their areas of interests, to create domain-specific libraries ington, D.C.: National Academy Press, 2013.
and make them freely available for the community of users. [20] V. Kreinovich, H. T. Nguyen, and S. Sriboonchitta, “Need for data
processing naturally leads to fuzzy logic (and neural networks): Fuzzy
beyond experts and beyond probabilities,” Departmental Technical Re-
ACKNOWLEDGMENT ports (CS), p. Paper 870, 2014.
[21] X. Wang and Y. He, “Learning from uncertainty for Big Data: Future an-
This work was supported by Microsoft Research within alytical challenges and strategies,” IEEE Systems, Man, and Cybernetics
Microsoft Azure for Research Award grant, and Statutory Re- Magazine, vol. 2, no. 2, pp. 26–31, 2016.
[22] L. Zadeh, “Fuzzy sets,” Information and Control, vol. 8, no. 3, pp. 338–
search funds of Institute of Informatics, Silesian University of 353, 1965.
Technology, Gliwice, Poland (grant No BK/213/RAU2/2018). [23] T. White, Hadoop – The Definitive Guide: Storage and Analysis at
Fuzzy Search Library for Data Lake Analytics (FSL4DLA) Internet Scale, 3rd ed. Ireland: OReilly, 2012.
[24] Official web page of Apache Spark, “Lightning-fast cluster computing,”
is available at (http://www.zti.aei.polsl.pl/w3/dmrozek/science/ http://spark.apache.org/ (accessed on April 14, 2017).
fsl4dla.htm). Users must use their own Microsoft Azure sub- [25] P. Mell and T. Grance, “The NIST definition of Cloud Computing.
scription in order to run the FSL4DLA on the cloud. Special Publication 800-145 /accessed on March 24, 2017/,” 2011, http://
nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-145.pdf.
[26] P. Bosc and O. Pivert, SQLf Query Functionality on Top of a Regular
R EFERENCES Relational Database Management System. Heidelberg: Physica-Verlag
HD, 2000, pp. 171–190.
[1] G. B. Davis and K. M. Carley, “Clearing the fog: Fuzzy, overlapping [27] J. Kacprzyk and S. Zadrożny, Data Mining via Fuzzy Querying over the
groups for social networks,” Social Networks, vol. 30, no. 3, pp. 201 – Internet. Heidelberg: Physica-Verlag HD, 2000, pp. 211–233.
212, 2008. [28] G. Bordogna and G. Psaila, “Customizable flexible querying in classical
[2] C. De Maio, G. Fenza, V. Loia, and S. Senatore, “Hierarchical web relational databases,” in Handbook of Research on Fuzzy Information
resources retrieval by exploiting fuzzy formal concept analysis,” Infor- Processing in Databases, J. Galindo, Ed. IGI Global, 2008, pp. 191–
mation Processing & Management, vol. 48, no. 3, pp. 399 – 418, 2012. 217.
[3] Z. Wang, L. Tu, Z. Guo, L. T. Yang, and B. Huang, “Analysis of [29] M. Hudec, “An approach to fuzzy database querying, analysis and
user behaviors by mining large network data sets,” Future Generation realisation,” Computer Science and Information Systems, no. 12, pp.
Computer Systems, vol. 37, pp. 429 – 437, 2014. 127–140, 2009. [Online]. Available: http://eudml.org/doc/253504
[4] S. Kundu and S. K. Pal, “FGSN: Fuzzy granular social networks model [30] B. Małysiak-Mrozek, S. Kozielski, and D. Mrozek, “Modern software
and applications,” Information Sciences, vol. 314, pp. 100–117, 2015. tools for researching and teaching fuzzy logic incorporated into database
1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
15
systems,” in Proceedings of the iNEER International Conference on [53] S. Prabha and P. Kola Sujatha, “Reduction of Big Data sets using fuzzy
Engineering Education, Gliwice, Poland. iNEER, July 2010, pp. 1–8. clustering,” IJARCET, vol. 3, no. 6, pp. 2235–2238, 2014.
[31] B. Małysiak, D. Mrozek, and S. Kozielski, “Processing fuzzy SQL [54] N. Bharill, A. Tiwari, and A. Malviya, “Fuzzy based clustering algo-
queries with flat, context-dependent and multidimensional membership rithms to handle Big Data with implementation on Apache Spark,” in
functions,” in IASTED International Conference on Computational In- 2016 IEEE Second International Conference on Big Data Computing
telligence, Calgary, Alberta, Canada, July 4-6, 2005, M. H. Hamza, Ed. Service and Applications (BigDataService), March 2016, pp. 95–104.
IASTED/ACTA Press, 2005, pp. 36–41. [55] M. Prasad, Y. Lin, C. Lin, M. Er, and O. Prasad, “A new data-driven
[32] B. Małysiak-Mrozek, D. Mrozek, and S. Kozielski, Data Grouping neural fuzzy system with collaborative fuzzy clustering mechanism,”
Process in Extended SQL Language Containing Fuzzy Elements. Berlin, Neurocomputing, vol. 167, pp. 558 – 568, 2015.
Heidelberg: Springer Berlin Heidelberg, 2009, pp. 247–256. [56] G. Peters and R. Weber, “DCC: a framework for dynamic granular
[33] L. Portinale and S. Montani, “A fuzzy logic approach to case matching clustering,” Granular Computing, vol. 1, no. 1, pp. 1–11, 2016.
and retrieval suitable to SQL implementation,” in Proc. of the 20th IEEE [Online]. Available: http://dx.doi.org/10.1007/s41066-015-0012-z
Inter. Conf. on Tools with Artificial Intelligence. Washington, DC, USA: [57] B. Zhang, S. Qin, W. Wang, D. Wang, and L. Xue, “Data stream
IEEE Computer Society, 2008, pp. 241–245. clustering based on Fuzzy C-Mean algorithm and entropy theory,” Signal
[34] K. Myszkorowski, Inference Rules for Fuzzy Functional Dependencies Process., vol. 126, no. C, pp. 111–116, Sep. 2016.
in Possibilistic Databases. Cham: Springer International Publishing, [58] V. Chitraa and A. Thanamani, “Web log data analysis by enhanced Fuzzy
2016, pp. 181–191. C Means clustering,” International Journal on Computational Sciences
[35] G. Appelgren Lara, M. Delgado, and N. Marı́n, Fuzzy Multidimen- & Applications (IJCSA), vol. 4, no. 2, pp. 81–95, 2014.
sional Modelling for Flexible Querying of Learning Object Repositories. [59] B. K. Tripathy, D. Mittal, and D. P. Hudedagaddi, Hadoop with Intuition-
Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 112–123. istic Fuzzy C-Means for Clustering in Big Data. Singapore: Springer,
[36] B. Małysiak-Mrozek, D. Mrozek, and S. Kozielski, “Processing of crisp 2016, pp. 599–610.
and fuzzy measures in the Fuzzy Data Warehouse for global natural [60] G. A. Papakostas, E. I. Papageorgiou, and V. G. Kaburlasos, “Linguistic
resources,” in Trends in Applied Intelligent Systems, ser. Lect Notes Fuzzy Cognitive Map (lfcm) for pattern recognition,” in 2015 IEEE Int.
Comput Sci, N. Garca-Pedrajas and et al., Eds. Springer Berlin Conf. on Fuzzy Systems (FUZZ-IEEE), 2015, pp. 1–7.
Heidelberg, 2010, vol. 6098, pp. 616–625. [61] Y. Choi, H. Lee, and Z. Irani, “Big data-driven fuzzy cognitive map
[37] A. Castelltort and A. Laurent, Fuzzy Queries over NoSQL Graph for prioritising it service procurement in the public sector,” Annals of
Databases: Perspectives for Extending the Cypher Language. Cham: Operations Research, 2016.
Springer International Publishing, 2014, pp. 384–395. [62] J. Liu, Y. Chi, and C. Zhu, “A dynamic multiagent genetic algorithm for
[38] A. Castelltort and A. Laurent, “Extracting fuzzy summaries from NoSQL gene regulatory network reconstruction based on fuzzy cognitive maps,”
graph databases,” in Flexible Query Answering Systems, ser. AISC, IEEE Transactions on Fuzzy Systems, vol. 24, no. 2, pp. 419–431, 2016.
T. Andreasen et al., Ed. Springer, 2016, vol. 400, pp. 189–200. [63] S. D’Onofrio, M. Wehrle, E. Portmann, and T. Myrach, “Striving for
[39] A. Castelltort and A. Laurent, “Exploiting NoSQL graph databases and semantic convergence with fuzzy cognitive maps and graph databases,”
in memory architectures for extracting graph structural data summaries,” in 2017 IEEE Int. Conf. on Fuzzy Systems (FUZZ-IEEE), 2017, pp. 1–6.
International Journal of Uncertainty, Fuzziness and Knowledge-Based [64] S. Ramachandramurthy, S. Subramaniam, and C. Ramasamy, “Distilling
Systems, vol. 25, no. 01, pp. 81–109, 2017. Big Data: refining quality information in the era of yottabytes,” The
[40] M. S. Hidri, I. BenAli-Sougui, and A. Grissa-Touzi, “No-FSQL: A Scientific World Journal, vol. 2015, pp. 1–9, 2015.
graph-based fuzzy NoSQL querying model,” Int J Fuzzy Syst Appl, [65] Y. Cai, Q. Li, H. Xie, and H. Min, “Exploring personalized searches
vol. 5, no. 2, pp. 54–63, 2016. using tag-based user profiles and resource profiles in folksonomy,”
[41] A. B. Kacem and A. G. Touzi, “Towards fuzzy querying of NoSQL Neural Networks, vol. 58, pp. 98 – 110, 2014.
document-oriented databases,” in The Seventh Int. Conf. on Advances [66] Z. Liu, J. Li, J. Li, C. Jia, J. Yang, and K. Yuan, “SQL-based fuzzy
in Databases, Knowledge, and Data Applications, DBKDA 2015, Rome, query mechanism over encrypted database,” International Journal of
Italy, 2015, pp. 153–158. Data Warehousing and Mining, vol. 10, no. 4, pp. 71–87, 2014.
[42] M. E. Bakry, S. Safwat, and O. Hegazy, “Big Data classification [67] D. D. Wang, W. Zhou, and H. Yan, “Mining of protein–protein interfacial
using Fuzzy K-Nearest Neighbor,” International Journal of Computer residues from massive protein sequential and spatial data,” Fuzzy Sets
Applications, vol. 132, no. 10, pp. 8–13, 2015. and Systems, vol. 258, pp. 101–116, 2015.
[43] O. Hegazy, S. Safwat, and M. E. Bakry, “A MapReduce fuzzy techniques [68] B. Novikov, N. Vassilieva, and A. Yarygina, “Querying Big Data,” in
of Big Data classification,” in 2016 SAI Computing Conference (SAI), Proceedings of the 13th International Conference on Computer Systems
July 2016, pp. 118–128. and Technologies. ACM, 2012, pp. 1–10.
[44] P. Ducange, F. Marcelloni, and A. Segatori, “A MapReduce-based [69] K. Sundharakumar, S. Dhivya, S. Mohanavalli, and R. V. Chander,
fuzzy associative classifier for big data,” in 2015 IEEE International “Cloud based fuzzy healthcare system,” Procedia Computer Science,
Conference on Fuzzy Systems (FUZZ-IEEE), Aug 2015, pp. 1–8. vol. 50, pp. 143 – 148, 2015.
[45] S. del Rı́o, V. López, J. M. Benı́tez, and F. Herrera, “A MapReduce [70] A. T. Azar and A. E. Hassanien, “Dimensionality reduction of medical
approach to address Big Data classification problems based on the Big Data using neural-fuzzy classifier,” Soft Computing, vol. 19, no. 4,
fusion of linguistic fuzzy rules,” International Journal of Computational pp. 1115–1127, 2015.
Intelligence Systems, vol. 8, no. 3, pp. 422–437, 2015. [71] O. Behadada, M. Trovati, M. A. Chikh, and N. Bessis, “Big Data-
[46] V. López, S. del Rı́o, J. M. Benı́tez, and F. Herrera, “Cost-sensitive based extraction of fuzzy partition rules for heart arrhythmia detection:
linguistic fuzzy rule based classification systems under the MapReduce a semi-automated approach,” Concurrency and Computation: Practice
framework for imbalanced Big Data,” Fuzzy Sets and Systems, vol. 258, and Experience, vol. 28, no. 2, pp. 360–373, 2016.
pp. 5–38, 2015. [72] J. McEntyre and D. Lipman, “PubMed: bridging the information gap,”
[47] A. Segatori, F. Marcelloni, and W. Pedrycz, “On distributed fuzzy CMAJ, vol. 164, no. 9, pp. 1317–1319, 2001.
decision trees for Big Data,” IEEE Transactions on Fuzzy Systems, [73] H. Wang, Z. Xu, and W. Pedrycz, “An overview on the roles of
vol. PP, no. 99, pp. 1–1, 2017. fuzzy set techniques in Big Data processing: Trends, challenges and
[48] I. Timón, J. Soto, H. Pérez-Sánchez, and J. M. Cecilia, “Parallel opportunities,” Knowledge-Based Systems, vol. 118, pp. 15 – 30, 2017.
implementation of fuzzy minimals clustering algorithm,” Expert Systems [74] M. Detyniecki, R. R. Yager, and B. Bouchon-Meunier, “Specifying t-
with Applications, vol. 48, pp. 35 – 41, 2016. norms based on the value of t(1/2,1/2),” Mathware & Soft Computing,
[49] Y. Li, R. Wang, and S. Shiu, “Interval extreme learning machine for vol. 7, no. 1, pp. 77 – 35 387, 2000.
Big Data based on uncertainty reduction,” J. Intell. Fuzzy Syst., vol. 28, [75] Microsoft Azure, “Overview of Microsoft Azure Data Lake Analytics,”
no. 5, pp. 2391–2403, 2015. https://docs.microsoft.com/en-us/azure/data-lake-analytics/
[50] S. A. Ludwig, “MapReduce-based fuzzy c-means clustering algo- data-lake-analytics-overview (accessed on April 14, 2017).
rithm: implementation and scalability,” International Journal of Machine [76] R. Detrano, A. Janosi, W. Steinbrunn, M. Pfisterer, J. Schmid, S. Sandhu,
Learning and Cybernetics, vol. 6, no. 6, pp. 923–934, 2015. K. Guppy, S. Lee, and V. Froelicher, “International application of a
[51] T. C. Havens, J. C. Bezdek, C. Leckie, L. O. Hall, and M. Palaniswami, new probability algorithm for the diagnosis of coronary artery disease,”
“Fuzzy c-means algorithms for very large data,” IEEE Transactions on American Journal of Cardiology, vol. 64, pp. 304–310, 1989.
Fuzzy Systems, vol. 20, no. 6, pp. 1130–1146, Dec 2012.
[52] P. Su, C. Shang, and Q. Shen, “A hierarchical fuzzy cluster ensemble ap-
proach and its application to Big Data clustering,” Journal of Intelligent
& Fuzzy Systems, vol. 28, no. 6, pp. 2409–2421, 2015.
1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.