Sie sind auf Seite 1von 15

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
1

Soft and Declarative Fishing of Information


in Big Data Lake
Bożena Małysiak-Mrozek, Member, IEEE CI, Marek Stabla, and Dariusz Mrozek, Senior Member, IEEE SMC,
EMBS

Abstract—In recent years, many fields that experience a sudden their daily habits, including diet, physical activity, or smoking
proliferation of data, which increases the volume of data that cigarettes, and finally identified diseases, are stored in special
must be processed and the variety of formats the data is stored repositories. These repositories grow in size due to the growing
in have been identified. This causes pressure on existing compute
infrastructures and data analysis methods, as more and more number of patients, cases, and collected features that can be
data is considered as a useful source of information for making explored in order to draw interesting conclusions. The amount
critical decisions in particular fields. Among these fields exist of data that may constitute a value for making important
several areas related to human life, e.g., various branches of decisions is constantly growing due to modern techniques of
medicine, where the uncertainty of data complicates the data data harvesting, newly identified data sources, and growing
analysis, and where the inclusion of fuzzy expert knowledge in
data processing brings many advantages. capacities of storage systems that can accommodate the growth
In this paper, we show how fuzzy techniques can be incor- of incoming large data volumes. The era of Big data that
porated in Big Data analytics carried out with the declarative we entered several years ago has changed our imagination
U-SQL language over a Big Data Lake located on the Cloud. We about the type and the volume of data that can be processed,
define the concept of Big Data Lake together with the Extract, as well as the value of data. This is now visible in many
Process, and Store (EPS) process performed while schematizing
and processing data from the Data Lake, and while storing results fields which are experiencing an explosion of data that are
of the processing. Our solution, developed as a Fuzzy Search considered relevant, including social networks [1], [2], [3], [4],
Library for Data Lake, introduces the possibility of (1) massively- [5], [6], [7], [8], multimedia processing [9], internet of things
parallel, declarative querying of Big Data Lake with simple and (IoT) [10], intelligent transport [11], [12], [13], [14], medicine
complex fuzzy search criteria, (2) using fuzzy linguistic terms in and bioinformatics [15], finance [16], and many others [17],
various data transformations, and (3) fuzzy grouping. Presented
ideas are exemplified by a distributed analysis of large volumes [18], that face the problem of big data. The big data problem
of biomedical data on Microsoft Azure cloud. (or opportunity) usually arises when data sets are so large
Results of performed tests confirm that the presented solution that the conventional database management and data analysis
is highly scalable on the Cloud and is a successful step toward soft tools are insufficient to process them [19]. However, the large
and declarative processing of data on a large scale. The solution amount of data that must be processed (large volume) is not
presented in this paper directly addresses three characteristics
of Big Data, i.e., volume, variety, and velocity, and indirectly the only characteristic of big data. Apart from volume, big data
addresses, veracity and value. also have other characteristics: velocity, variety, veracity, and
value, which are together known as 5V model. The big data
Index Terms—Big Data, fuzzy logic, querying, Cloud comput-
ing, biomedical data analysis, declarative languages. solutions, including the one presented in this paper, usually
address more than one of the Vs.
Biomedical data are an example of the type of data that
I. I NTRODUCTION
have extensively proliferated in recent years. This prolifera-

T HE pace of changes in human life, especially in big cities,


is constantly increasing. Such fast pace of life does not
remain indifferent to human health, and has a negative effect
tion has resulted from the ongoing technological progress in
monitoring patients and collecting appropriate information on
their state of health directly by doctors or, remotely, through
on the state of health. Intense and often stressful work that telemedicine systems. Biomedical data also are the type of
frequently exceeds the allowed number of hours, poor diet, data where the use of fuzzy techniques for data processing is
smoking, lack of adequate sleep, sufficient portion of physical justified and that brings many advantages. This is motivated by
exercises, and time to rest and relax, cause a sharp increase in two factors. (1) Fuzzy logic allows efficient representation of
the incidence of cardiovascular diseases, including high blood expert knowledge, especially in the case of imprecise data,
pressure, high cholesterol, cardiac arrhythmias, and finally and incorporate it into the data analysis process [20]. (2)
heart attacks or strokes. Currently, many cardiology centers Biomedical data tend to miss a lot of attributing values in
have been performing research on the relationship between the a diagnosis by a doctor in clinics and health fields, which
quality of lifestyle, proper diet, smoking, and the occurrence of increases the uncertainty of information [21]. Fuzzy sets [22]
cardiovascular diseases. The collected data describing patients, can be used to represent the information with uncertainty
B. Małysiak-Mrozek and D. Mrozek are with the Institute of Infor- and inaccuracy, or to formulate imprecise search criteria in
matics, Silesian University of Technology, Gliwice, 44-100 Poland e-mail: the information retrieval process. For example, similar values
bozena.malysiak@polsl.pl of health markers and results of laboratory tests, like blood
M. Stabla was with the Institute of Informatics, Silesian University of
Technology, Gliwice, 44-100 Poland. pressure, cholesterol, glycemic index, BMI, together with
Manuscript received April 19, 2017; revised February 26, 2018. other data, like age and gender, may lead to preparation of

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
2

similar treatment scenarios for patients with similar symptoms. The declarative character of the SQL language encouraged
On the other hand, incorporating routines for fuzzy processing scientists working with uncertainty and soft knowledge of
in the data analysis pipeline allows generalizing the data, experts to extend the relational database management systems
group it and aggregate, and thus, change the granularity of (RDBMSs) toward fuzzy data processing by implementing
information that we have to deal with. As a consequence, appropriate procedures and functions in the programming
this provides a way to quickly reduce the volume of data language that is native for the particular database management
from big to small, which is highly required in the era of Big system (DBMS). Examples of such implementations are SQLf
Data. Biomedical data analysis, performed especially for big [26], FQUERY [27], Soft-SQL [28], fuzzy Generalised Log-
biomedical data sets, call for mechanisms that would allow ical Condition [29], FuzzyQ [30], and fuzzy SQL extensions
approximate information retrieval. This constitutes the first aim for relational databases [31], [32], [33], possibilistic databases
of our works presented in this paper. [34], and data warehouses [35], [36]. Mentioned extensions to
Apart from large data sets that are generated as a result of the SQL query language are noteworthy, as they deliver various
technology-driven data proliferation, biomedical data is usu- fuzzy techniques for data exploration, like fuzzy filtering,
ally delivered in a variety of formats and may have complex fuzzy inference, generalization with fuzzy linguistic variables,
structure, e.g., this can be numerical data of biochemical mark- fuzzy grouping, which are also implemented in the solution
ers from laboratory testing, image data for X-ray pictures or presented in the paper. However, they do not address the
computed tomography, time series from EKG or EEG, or DNA problem of Big Data and any of the V characteristics, since
microarray or next-generation sequencing data from molecular they are mainly devoted to relational databases.
profiling. This makes the integration and structuralization of Recent trends in developing and using NoSQL databases
the data difficult or expensive, and leads to solutions that allow naturally led to the transition of fuzzy techniques used in
storing and processing the data in its native form. Big data lake relational databases to the NoSQL model. Although NoSQL
is a central location in which users can store all their data in database systems are more specialized and domain-oriented,
its native form, regardless of its source or format. Big data they are also more scalable to cope with large volumes of data.
lake can be used as an environment for the development of Works presented by Castelltort and Laurent [37], [38], [39]
in-depth analytics oriented toward fast decision making on the show fuzzy extensions, including fuzzy filtering and linguistic
basis of raw data. Data analysis can be performed dynamically summaries, to the declarative Cypher query language used in
and on an ad hoc manner. Data shall not be prepared and querying Neo4j graph databases. Similar attempts can be found
structuralized far before the begining of the analysis. Usually, in [40], where authors proposed a fuzzy NoSQL model to deal
only a fraction of the data gets structuralized dynamically with large fuzzy databases hosted on Neo4j graph database.
during the analysis and for the purpose of the analysis or In [41], Kacem and Touzi show how they extended document-
possible secondary analyses performed later. This schema-on- oriented MongoDB database and Mongo Query Language
read approach allows skipping expensive schematization of toward flexible querying with linguistic labels. Those solutions
data, applying a schema as data are pulled out of a stored focus mainly on the volume characteristic of the Big Data
location, focusing on analyses driven by current needs, and basing on the assumption that NoSQL databases are highly
keeping all the data even if just a part of the data constitutes scalable. However, no results of performance tests were pro-
a value for the analysis performed at the moment. As a vided in either of these works and not all of the solutions
consequence, data analysts are able to ask more powerful support declarative querying.
queries that translate data into more actionable insights. Such The second group of works describes approaches devoted
an approach speeds up the phase of data preparation before the to the processing and analysis of Big Data with the use of
data are analyzed and visualized, but it usually comes as a cost fuzzy techniques. Most of the works concentrate on clustering
of losing the possibility of declarative retrieval of information, and classification of Big Data with the use of computational
which is typically performed in relational databases through frameworks, like Apache Hadoop and Apache Spark. In works
SQL-based querying. Moreover, it usually requires dedicated [42], [43] authors have showed the classification of big data
programmatic procedures, utilization of specific computational using Fuzzy K-Nearest Neighbor classifier speeding up the
frameworks, like Apache Hadoop [23] or Apache Spark [24], classification process on Apache Hadoop. MapReduce-based
for efficient processing of terabytes of data, and extensive procedures for the classification task were also developed for
computing resources that can be provisioned on demand from, fuzzy rule based associative classifier presented in [44], Fuzzy
e.g., the Cloud [25]. Declarative data manipulation in Big Data Rule Based Classification System (ChiFRBCS-BigData) pro-
Lakes constitutes the second motivation for our works. posed in [45], and its extension [46]. In [47], Segatori et al.
show efficient and scalable MapReduce-based implementation
of Fuzzy Decision Trees used for fuzzy classification per-
A. Related Works formed on Apache Spark. In [48], authors propose a parallel
World literature provides various solutions in terms of implementation of the fuzzy minimals clustering algorithm
approximate processing of Big Data and declarative data ma- (PFM), reporting a linear increase in the performance. Re-
nipulation. The first group of works is focused on declarative duction of the volume of the processed data is proposed in
processing and transformation of data stored in relational [49], where Li et al. apply two techniques, i.e., discretization
databases and NoSQL databases by querying the databases of conditional attributes and fuzzification of class labels, to
with query languages, like SQL, containing fuzzy extensions. transform the original data set into a smaller one. There is

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
3

also a group of papers, like [50], [51], [52], [53], [54], [55], use of declarative U-SQL language. We extended the U-SQL
[56], [57], [58], [59], which adopt fuzzy clustering methods, language for parallel querying by incorporating a variety of
including various versions of Fuzzy C-Means algorithm, to techniques that enable fuzzy data processing, including fuzzy
Big Data collections, very large databases, and data streams. searching, fuzzy transformation, linguistic terms assignment,
Finally, several works, including [60], [61], [62], [63], show and fuzzy grouping. These operations can now be performed
that Fuzzy Cognitive Maps and Linguistic Fuzzy Cognitive on a large scale to transform the data, and thereby, allow
Maps are able to handle large scale data in pattern classifica- searching and grouping similar data on the basis of experts
tion applications. Elements of fuzzy sets theory are also used knowledge, reduce the volume of the data, and prepare the
in (big) data preprocessing [64], information searching [65], data for further analysis and visualization. Data processing is
managing data privacy [66] and data access [67]. Querying Big highly parallelized in the Extract, Process, and Store (EPS)
Data has been raised in [68], where Novikov et al. discuss an process, which can be scaled out in order to adjust to the
algebraic layer for complex query processing by using adaptive volume of data stored in the Data Lake. Storing and processing
abstract operations based on the concept of fuzzy set, which of the data is performed on Microsoft Azure public cloud,
are needed to support uniform handling of different kinds of which ensures large scalability of the presented approach.
similarity processing. However, no implementation was made. Performance and scalability of the proposed solution are tested
It is also worth noting a few works that are devoted to with the use of several data sets containing up to one billion
the application of fuzzy sets theory in medical data analysis, rows of medical data related to cardiac disease. Presented
since various branches of medicine deliver large volumes of work addresses directly at least 3Vs that characterize Big Data,
data [15]. In [69], Sundharakumar et al. propose a cloud-based i.e., volume, velocity, and variety, and indirectly, veracity and
system containing fuzzy inference module for monitoring the value. Moreover, the paper extends the spectrum of existing
health of patients on the basis of the analysis of streams of works by: (1) Showing that declarative, fuzzy querying can
body sensor data with the use of Storm (the real-time compu- be performed efficiently and effectively in big data sets when
tation system). The system is hosted on a private cloud, which implemented in highly scalable environments, which is proved
thereby ensures security and scalability, and the approach by performed experiments (Sect. IV). (2) Enabling fuzzy tech-
addresses mainly the velocity characteristics of Big Data. How- niques that allow processing and analyzing raw and schema-
ever, no performance results were presented. An interesting free data coming from various data sources (Sect. III). (3)
approach for the reduction of volume of data is proposed in the Providing formal definition of the Big Data Lake concept and
paper [70], where Azar and Hassanien define and show a new EPS process (Sect. II-A). (4) Providing formal definitions to
neural-fuzzy classifier for dimensionality reduction in medical the fuzzy operations performed on big data (Sect. II-B).
big data. Authors suggest that the proposed method simplifies
the classification tasks by reducing the dimensionality of II. M ETHODS
large data sets and by speeding up the learning process. In Big Data Lake changes the way how data are stored and
[71], Behadada et al. present a novel method to define semi- managed within the IT infrastructure of an institution, and
automatically fuzzy partition rules to provide a powerful and enables responding to the changing requirements for the data
accurate insight into cardiac arrhythmia. The approach is based analysis in various domains. While this shift often enables an
on text mining techniques applied to big data sets of freely immediate access to the data, without waiting to be cleaned,
available scientific papers provided by PubMed [72]. modeled, structuralized, and loaded, it also requires to change
Broader classification of fuzzy set techniques used in Big the way how data are prepared for the analysis.
Data processing is presented by Wang et al. in [73]. The wide
spectrum of published works proves that the application of A. Data Lake and EPS Process
fuzzy techniques for data processing brings many advantages.
Data lake allows quickly consolidating various types of data
Those of the presented solutions that are dedicated to Big
in one place. The data may come from various data sources,
Data usually assume reduction of the volume of processed
can be logically related, and can be stored in the raw form —
data by using various fuzzy techniques while generalizing,
structured, semistructured, or unstructured.
grouping, and aggregating the data and classifying or assigning
Definition 1: Formally, we define Data Lake as a pair:
to clusters. However, they do not allow performing these
operations in a declarative way. On the other hand, those DL = {V, M }, (1)
solutions that enable declarative and approximate information
where V is a set of values in the Data Lake, and M is a set
retrieval are not dedicated for Big Data (SQL-based solutions),
of metadata describing values in the Data Lake DL.
or need a dedicated storage model (NoSQL attempts).
If:
∀v∈V ∃md ∈M fd (v) = md , (2)
B. Scope of this Work where md represents metadata that describe the name of the
In this paper, we present a novel, scalable, and universal attribute for the value v, and fd is a function assigning the
solution for processing the data stored in Big Data Lake with name of the attribute to the value, then we call the data lake
the use of fuzzy techniques and declarative U-SQL query DL as fully described in terms of attribute names.
language. Presented solution allows exploring data coming If:
from various domains by querying large data sets with the ∀v∈V ∃mt ∈M ft (v) = mt , (3)

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
4

the data lake DL in the extraction Ei ; Mi ⊂ M represents


File 1 File 2 File 3 Store subset of metadata describing values extracted in the extraction
Ei ; TEi is a set of transformation rules.
Transformations tE ∈ TEi can be applied to the extracted
values or to the metadata, e.g., they can be used to supplement
Query 1 Query 2 Query 3 Process a missing name of an attribute or to define the type for data
(U-SQL) (U-SQL) (U-SQL) extracted from the partially described data lake, or to derive
some values on the basis of other values in the data lake.
They are used to properly read the complex unstructured data,
R1 R2 to clean and standardize the data, and finally to format the
Extract
data as a rowset (e.g., see Sect. IV-B).
Definition 4: Process: Data processing covers all operations
applied to the produced rowsets that allow converting the data
from one form to another in order to meet requirements of the
final analysis or the visualization performed by a customer
(consumer of the results).
The set of operations may include:
• Selecting only certain attributes of the produced rowsets
Data Lake
to be presented to the final user;
• Joining data from multiple rowsets and producing new
rowsets;
Fig. 1. Extract, Process, Store (EPS) over a Data Lake. Data extracted to
rowsets R1 and R2 are processed and queried, and results are stored in output • Filtering rowsets based on simple and complex search
files. conditions;
• Deriving a new calculated attribute based on other at-
tributes of processed rowsets;
where mt represents metadata that describe the type of the • Sorting or ordering data based on selected attributes;
value v, and ft is a function assigning the type to the value, • Grouping and aggregating data;
then we call the data lake DL as fully described in terms of • Labeling data;
data types. • Looking up for relevant data in other rowsets;
If data lake DL is fully described in terms of attribute names • Validating data and cleaning with custom procedures and
and in terms of data types, then we call the data lake DL as functions.
completely described. If the data lake is completely described
Examples of various operations performed in the Process phase
it can usually be processed automatically without additional
are shown in query scenarios SQ1–SQ4 in Sect. IV-C, and
knowledge of experts. Otherwise, the knowledge is needed in
Appendix B (SQ5) and Appendix C in the Supplement to the
various phases of processing data stored in the data lake.
paper.
Definition 2: Extract, Process, Store (EPS) refers to a
Definition 5: Store: Data storing is the last phase of the
process of using data stored in a data lake by extracting appro-
EPS process, in which data are read from the selected rowsets
priate information, structuralizing it, processing and querying,
and sent to a place of destination. The place of destination
and storing the information in a specific form (Fig. 1). Partic-
can be a relational database, a simple delimited flat file, or a
ular steps of the EPS process will be defined in the following
formatted file with a more complex structure.
parts of the section. If the destination place is a relational database, a single
Definition 3: Extract: Data extraction is the first phase output Oi is an operation that, on the basis of the rowset
of the EPS process, in which data are read from the Data Ri (Ai1 , Ai2 , ..., Ain ) with n attributes, produces a relation
Lake and represented as one or several rowsets. A single RiDB with the same schema in the relational database:
rowset is a set of tuples and it is equivalent to the relation in
relational databases. Schema of the relation/rowset is defined Oi (Ri ) −→ RiDB (Ai1 , Ai2 , ..., Ain ). (5)
dynamically, on read, i.e., it is neither predefined, nor known If the destination place is a file, a single output Oi is an
priorly (schema on read). Although, we do not define any operation that, on the basis of the rowset Ri (Ai1 , Ai2 , ..., Ain )
additional constraints between any two rowsets, like foreign with n attributes and optional transformation rules provided by
keys in relational databases, logical relationships may exist an expert during the Store step, produces a data output file Fi
between rowsets created in the Extract phase. with the data from the rowset:
A single extraction Ei is an operation that produces a rowset
Ri (Ai1 , Ai2 , ..., Ain ) with n attributes, based on values and Oi (Ri , TOi ) −→ Fi , (6)
metadata coming from data lake DL, and optional transfor- where TOi is the set of transformation rules applied to the
mation rules provided by an expert during the extraction step: stored data. It may comprise operations, such as proper,
Ei (Vi , Mi , TEi ) −→ Ri (Ai1 , Ai2 , ..., Ain ), (4) custom formatting of the output file, or compression of the
output file. Example of the Store phase is shown in query
where: Vi ⊂ V represents the subset of values extracted from scenario SQ5 located in the Supplement to the paper.

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
5

µCh(x) Fuzzy selection can be performed on the basis of compound


Cholesterol Cholesterol
=197 high =248 fuzzy search conditions (or mixed with crisp search condi-
1
tions), e.g.:
0.7 λ
σ̃A i ≈vi θ...θ Aj ≈vj
(R)
λ (10)
= {t : t ∈ R, (t(Ai )≈vi θ...θ t(Aj )≈vj ) ≥ λ},
0.2 Cholesterol
where Ai , Aj are attributes of rowset R, i, j = 1...n; i 6= j,
190 200 240 250 vi , vj are fuzzy sets; λ is a minimum membership degree
for which the whole complex fuzzy condition is fulfilled;
Fig. 2. Definition of high cholesterol fuzzy set and interpretation of fuzzy
selection with the fuzzy search condition cholesterol is high showing two and θ can be any of the fuzzy operators of conjunction and
values of the Cholesterol attribute: 197 that satisfies the condition, and 248 disjunction (we implemented Zadeh T-norm and S-norm [22],
that does not satisfy the condition for λ = 0.5. [74]). Therefore,
λ
σ̃A i ≈vi θ...θ Aj ≈vj
(R) = {t : t ∈ R, (µvi (t(Ai ))
B. Fuzzy Extensions to the Process Phase (11)
θ...θ µvj (t(Aj ))) ≥ λ},
For fuzzy exploration of data stored in a data lake we have
where µvi , µvj are membership functions of fuzzy sets vi , vj .
extended a collection of standard operations performed on the
generated rowsets. We will use the following sample rowset Definition 7: Extended projection with fuzzy transforma-
to exemplify the presented ideas throughout this section. tion: Given a rowset R with n attributes:
Patients = Id Age Cholesterol R = {A1 A2 A3 ...An }, (12)
-- --- -----------
1 45 197
2 67 248
a projection with fuzzy transformation π̃ is a unary operation,
3 25 165 in which attributes of the rowset R are restricted to the set
4 38 192
5 72 210
{A1 , A2 , ..., T̃ (Ai ), ..., Ak }, where {A1 , A2 , ..., Ai , ..., Ak } ⊆
{A1 , A2 , A3 , ..., An }, k ≤ n, and one (or more) of the at-
Definition 6: Fuzzy selection: Given a rowset R with n tributes is subjected to transformation T̃ :
attributes:
R = {A1 A2 A3 ...An }, (7) π̃A1 A2 ...T̃ (Ai )...Ak (R)
(13)
= {t[A1 A2 ...T̃ (Ai )...Ak ] : t ∈ R, k ≤ n},
a fuzzy selection σ̃ is a unary operation that denotes a subset
of a rowset R on the basis of fuzzy search condition: where t[A1 A2 ...T̃ (Ai )...Ak ] is the restriction of the tuple t to
λ the set {A1 , A2 , ..., T̃ (Ai ), ..., Ak } so that:
σ̃ λ (R) = {t : t ∈ R, t(Ai ) ≈ v}, (8)
Ai ≈v
t[A1 A2 ...T̃ (Ai )...Ak ] =
λ
where Ai ≈ v is a fuzzy search condition, Ai is one of the {(A, x) : x = t(A),
attributes of the rowset R for i = 1..n, n is the number of (14)
A ∈ {A1 , A2 , ..., A0i , ..., Ak },
attributes of the rowset R, v is a fuzzy set represented by a
A0i = T̃ (Ai )}.
membership function (e.g., young person, age near 30, normal
blood pressure), ≈ is a comparison operator used to compare The transformation T̃ can be one of the two specific types
the crisp value of attribute Ai for each tuple t from the rowset of transformations T̃ ∈ {T̃µ , T̃L }:
R with the fuzzy set v, λ is a minimum membership degree
• T̃µ (Ai ) calculates the membership degree of the value
(threshold) for which the search condition is satisfied (Fig. 2).
t(Ai ) of attribute Ai for each tuple of the rowset R to
The selection σ̃ λ (R) denotes all tuples in R for which
Ai ≈v the fuzzy set v:
≈ holds between the attribute Ai and the fuzzy set v with the
membership degree greater or equal to λ. Therefore, ∀t∈R T̃µ (t(Ai )) = µv (t(Ai )). (15)

σ̃ λ (R) = {t : t ∈ R, µv (t(Ai )) ≥ λ}, (9) • T̃L (Ai ) assigns and returns a linguistic value l of the
Ai ≈v
defined linguistic variable L = {l1 , l2 , ..., lm }, m ∈ N+
where µv is a membership function of a fuzzy set v. for the value of the attribute Ai from each tuple of the
The fuzzy set v can be defined by various types of member- rowset R:
ship functions, including triangular, trapezoidal, and Gaussian.
∀t∈R ∃l∈L T̃L (t(Ai )) = l
For example, the fuzzy selection with the search condition
Cholesterol is high for the fuzzy set high cholesterol defined ∧ µl (t(Ai )) = (16)
as presented in Fig. 2, and λ = 0.5 produces the following max{µl1 (t(Ai )), µl2 (t(Ai )), ..., µlm (t(Ai ))}.
rowset (σ̃ 0.5 (P atients)).
Cholesterol ≈ high For example, the extended projection with fuzzy trans-
Id Age Cholesterol formation (both types) of the Cholesterol attribute on the
-- --- -----------
1 45 197
basis of the Cholesterol linguistic variable (ChV), defined
5 72 210 as presented in Fig. 3, produces the following rowset.

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
6

ChV linguistic variable


µCh(x) Azure Data Lake
normal high very high
1
Analytics
Fuzzy Search Domain Specific
Library Libraries
HDInsight
192 U-SQL

210
197

248
165

Cholesterol

190 200 240 250


YARN
Fig. 3. Assignment of sample values of the Cholesterol attribute to linguistic
values (normal, high, and very high cholesterol ) of a sample Cholesterol WebHDFS
linguistic variable (ChV).
Store
(unstructured, semi-structured, structured)
π̃Id,Age,Cholesterol,T̃ChV (Cholesterol),T̃µ (Cholesterol)
ChV
(P atients) =
Id Age Cholesterol ChV u Fig. 4. General architecture of Azure Data Lake service with fuzzy extensions
-- --- ----------- --------- --- (Fuzzy Search Library, marked in yellow) and Domain Specific libraries
1 45 197 high 0.7 (marked in green).
2 67 248 very high 0.8
3 25 165 normal 1.0
4 38 192 normal 0.8
5 72 210 high 1.0 grouped, users can aggregate the data with the use of standard
Definition 8: Fuzzy grouping: Fuzzy grouping performs or custom aggregation functions (e.g., see SQ4 in Sect. IV-C4).
classification of crisp values of a selected attribute Ak of the
rowset R to particular groups represented by linguistic values III. I MPLEMENTATION
of a predefined linguistic variable L = {l1 , l2 , ..., lm }, m ∈ Particular mechanisms presented in Sect. II were imple-
N+ . Each value from the rowset is then transformed by the mented in Azure Data Lake environment, a hyperscale repos-
fuzzy transformation T̃L (Ak ) and represented by a fuzzy set, itory for big data analytic workloads in the Microsoft Azure
described by a membership function. The grouping algorithm cloud.
compares each value from the rowset with particular values
of the linguistic variable and produces values of membership
degree. A value (tuple) from the rowset is then assigned to a A. Azure Data Lake
group, represented by a fuzzy set, for which the value of the Azure Data Lake (ADL) is a scalable, cloud environment for
membership degree is the highest among membership degrees storage and analytics. It allows for interactive batch analysis
calculated for all linguistic values of the variable. Therefore, of various types of data, including structured, semi-structured,
each of the rowset value is assigned to a linguistic value, which and unstructured data, in real time [75]. General architecture
it fits the most. of the Azure Data Lake is presented in Fig. 4. It is comprised
Grouping algorithm: Given is a linguistic variable with a of two main parts:
set of linguistic values L = {l1 , l2 , ..., lm }, m ∈ N+ . Each 1) Data Lake Store (DLS) provides petabyte scale, unlim-
linguistic value li , i = 1, ..., m is described by an arbitrarily ited storage for data lake DL. It distributes large data
defined membership function µi (x). For each value t(Ak ) : sets located in files across many storage servers, which
t ∈ R, k < n, the algorithm takes the following steps: allows performing read operations in parallel and to
1) ∀li ,i=1,...,m calculates a membership degree µi (t(Ak )), improve read performance.
2) Determines the maximum membership degree for the 2) Data Lake Analytics (DLA) allows for an efficient and
value t(Ak ) on the basis of membership degrees calcu- scalable analysis of data stored in Big Data Lake by
lated in step 1: parallelizing the analysis on a distributed infrastructure
in the Azure cloud. It provides HDInsight services,
µmax (t(Ak )) = max |µi (t(Ak ))|. (17) including Hadoop and Spark, to analyze Big Data, and
i=1,...,m
U-SQL distributed execution environment for declarative
3) Assigns the value t(Ak ) to the group with maximum processing and analysis of data.
membership degree. The U-SQL, which is used in our solution, is a big data
For example, fuzzy grouping by the Cholesterol attribute query language that combines declarative capabilities of the
from the sample Patients rowset on the basis of the Cholesterol SQL query language and expressive power of C# code. Data
linguistic variable (ChV), defined as presented in Fig. 3, processing and transformation (the Process phase of EPS) is
would produce three groups: normal cholesterol containing performed with the use of U-SQL query expressions, includ-
two tuples, high cholesterol with two tuples, and very high ing the SELECT expression, and the PROCESS, REDUCE,
cholesterol with one tuple of the sample rowset (as shown COMBINE expressions that apply the custom or user-defined
in the example presented in Definition 7). Having the data operators (UDOs). These query expressions produce rowsets

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
7

function. This comparison involves calculation of a value


Fuzzy Search Library
of the membership function for the crisp value of the
attribute. In such cases, the Result data type is used as an
Membership intermediate data type and it was implemented in order to
Linguistics
Functions overload the & and | operators to work according to the
Zadeh t-norm and s-norm. Both operators are provided
by this module for fuzzy selections containing complex
search conditions joined according to Zadeh t-norm and
Udfs Udl s-norm. They are nonoverloadable for the standard double
data type used to represent membership degrees, and for
this reason, we had to use the intermediate Result data
type to overload these logical operators.
Fig. 5. Main modules of the Fuzzy Search Library library that allows fuzzy • Linguistics allows creating linguistic variables and lin-
data processing in Data Lake.
guistic values. For any linguistic variable, this module
also enables two important functions:
that can be assigned to rowset variables. Rowset variables are – Values that returns linguistic values of the linguistic
populated in the Extract phase implemented by the EXTRACT variable,
U-SQL phrase with the use of appropriate data extractor (can – Group that allows for linguistic grouping.
be built-in, e.g., CSV or TSV, or custom, dedicated). Data • Udfs provides functions that are used while creating U-
output (the Store phase of the EPS) is implemented in the U- SQL queries, e.g., performing fuzzy selection, including:
SQL OUTPUT phrase that uses appropriate data outputter, – Is, which assesses how much a given crisp value
which, as in the Extract phase, can be built-in or custom. matches a specified linguistic value or a defined
Sample U-SQL scripts are presented in Sect. IV and in the fuzzy set (overloaded function),
Supplement to the paper. We have enabled fuzzy processing – Around, which assesses how much a given crisp
of the big data stored in Data Lake by developing fuzzy value matches to a fuzzy value represented by a
extensions to the U-SQL language available in DLA, and by Gaussian membership function,
providing new set of operations in the Process phase of the – Value, which converts a value of a membership
EPS. These extensions are marked in yellow and green in the function (the Result object) to a numerical value,
general architecture of Azure Data Lake presented in Fig. 4.
• Udl (User dynamic linguistics) provides functions to
operate on linguistic variables created dynamically in the
B. Modules and Methods for Fuzzy Data Processing U-SQL code, e.g.:
Fuzzy transformations performed in the Process phase of – Get, which creates and returns a linguistic variable
the EPS process can be implemented in two methods: (1) and its values on the basis of a formatted character
as custom rowset processors in C# invoked in the U-SQL string passed as an argument,
PROCESS expression, optionally accompanied by custom RE- – GetLV (Get Linguistic Value), which returns a speci-
DUCER and COMBINER expressions, or (2) as user-defined fied linguistic value for a specified linguistic variable.
functions (UDFs) nested in U-SQL SELECT expressions. The
second approach usually allows building more optimal U-SQL
C. Defining Fuzzy Linguistic Variables
execution plans and was adopted in the presented solution.
Common methods for fuzzy data processing, including fuzzy Fuzzy linguistic variables are objects that allow adding
selection, extended projection with fuzzy transformation, fuzzy semantics to the processed data by assigning linguistic labels
grouping, defining linguistic variables and assigning data to to elements of a universe and by providing a confidence of the
linguistic variables, are assembled in the Fuzzy Search Library assignment. Linguistic variables can be defined in two ways:
(FuzzySearchLib, Fig. 5) that we have developed for the Pro- • they can be implemented as classes in C# programming
cess phase. These methods are exposed to users as user-defined language and delivered by a domain-specific library ex-
functions. Additional processing methods, custom extractors, tending U-SQL,
and outputters specific for the domain of the processed data • they can be implemented dynamically in the U-SQL code.
are implemented as Domain Specific Libraries. Linguistic variables, together with linguistic values (terms),
The Fuzzy Search Library contains four main component delivered by specific libraries are usually dedicated for par-
modules that provide methods for fuzzy data processing: ticular domains of processed data. They can be predefined
• MembershipFunctions that allows creating trapezoidal, on the basis of expert knowledge. Programmatically, they are
triangular, and Gaussian membership functions to rep- represented as classes in the C# programming language and
resent particular fuzzy sets or fuzzy values. The module are created through implementation of interfaces provided by
also provides a dedicated Result data type that is usually the Fuzzy Search Library (in the Linguistics module). The
used to represent a similarity degree when comparing capability of defining linguistic variables in this way is very
a crisp value of a processed attribute of a rowset to beneficial — other users (developers) may independently de-
a fuzzy set represented by and appropriate membership liver external libraries for U-SQL-based fuzzy data processing

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
8

with linguistic variables defined based on the knowledge of U-SQL script extracts data by using the EXTRACT phrase
experts specific for particular data domains. This allows to (line 10) from the big data file (lines 8 and 13) stored in
independently create a collection of libraries for processing the CSV format. The CSV built-in extractor is used for this
various types of data. Sample C# code for defining linguistic purpose (line 14). The extraction from the Big Data Lake
variables is presented in Appendix D in the Supplement. produces the rowset @data (line 9), which is used in the
We also have made possible to define linguistic variables Process phase of the EPS. Schema of the rowset is defined
directly in the U-SQL code. This allows for dynamic defining on read (lines 10–12) by specifying a column name and C#
of variables that are not known before the implementation of type name per each extracted attribute of the Data Lake. Data
the Process phase. In the U-SQL code they are represented extraction is preceded by loading the Fuzzy Search Lib and
as string variables and formatted according to the grammar Heart Disease Data libraries into the script context (lines 1–
presented in Appendix E (Supplement). Dynamically defined 2). The former exposes all fuzzy extensions to U-SQL (fuzzy
linguistic variables are managed by methods available through modules described in Sect. III-B). The latter delivers domain-
the Udl module of the Fuzzy Search Library. The capability of specific fuzzy linguistic variables used in data processing.
defining linguistic variables in this way has two advantages: Short names for modules exposed by both libraries are mapped
• Dynamically defined linguistic variables can be created in lines 3–5 to simplify the syntax of U-SQL queries used in
ad hoc in the U-SQL scripts. the Process phase.
• They do not require to be compiled to external libraries. 1 REFERENCE ASSEMBLY H e a r t D i s e a s e . H e a r t D i s e a s e D a t a ;
2 REFERENCE ASSEMBLY H e a r t D i s e a s e . F u z z y S e a r c h L i b ;
Sample definition of a linguistic variable created in this way 3 USING Udfs = F u z z y S e a r c h L i b . Udfs ;
is presented in Listing 1. 4 USING V a r i a b l e s = H e a r t D i s e a s e D a t a . V a r i a b l e s ;
5 USING Udl = F u z z y S e a r c h L i b . Udl ;
1 DECLARE @ c h o l e s t e r o l V a r s t r i n g = ” C h o l e s t e r o l : Normal 6
, T , 0 , 0 , 1 9 0 , 2 0 0 / High , T , 1 9 0 , 2 0 0 , 2 4 0 , 2 5 0 / VeryHigh , T 7 / / Extract data
,240 ,250 ,1000 ,1000 ” ; 8 DECLARE @ d a t a P a t h s t r i n g = ” / i n / H e a r t D i s B i g . c s v ” ;
Listing 1. Sample U-SQL code for ad hoc defining of the linguistic variable 9 @data =
Cholesterol with its linguistic values. 10 EXTRACT I d s t r i n g , Age i n t , Sex i n t , . . . ,
11 Cholesterol int , CigarettesPerDay int , . . . ,
12 RestHr i n t , . . . , Disease bool
IV. E XPERIMENTAL R ESULTS 13 FROM @ d a t a P a t h
14 USING E x t r a c t o r s . Csv ( ) ;
Our extensions for fuzzy data processing in Big Data Listing 2. A part of a U-SQL script extracting data and producing the @data
Lake have been tested extensively, in order to verify its rowset.
functionality, effectiveness, and performance. In this section, A part of the rowset produced by the presented U-SQL
we show sample U-SQL queries that demonstrate capabilities script has the following form (column headers were formatted
of the proposed extensions and the implementation of the manually).
EPS process. We also present results of performance tests that Id Age Sex Chol-l Cig-PerDay RestHR Disease
----------- --- --- ------ ---------- ------ -------
reveal scalability of our solution for Big Data. 474a1bce-.. 57 M .. 192 75 .. 86 .. True
d3258689-.. 53 M .. 203 20 .. 86 .. False
3ee478ee-.. 48 M .. 229 0 .. 75 .. False
A. Data Set and Processing Environment e5724521-.. 54 M .. 239 20 .. 86 .. True
33be448a-... 49 M .. 188 30 .. 78 .. False
All tests of the presented solution were carried out with 93e06584-.. 64 M .. 211 30 .. 58 .. True
8e4d191d-.. 66 F .. 226 0 .. 49 .. True
the use of biomedical Heart Disease data set [76] from ...
UCI Machine Learning Repository. This data set contains
75 attributes, but we present our experiments by using a C. Examples of U-SQL Queries Extended with Fuzzy Process-
subset of them (mainly based on the Cleveland database). ing Capability
In order to test the performance, records were proliferated In this section, we show the use of various fuzzy techniques
up to 100 million, 500 million, and 1 billion for various in data transformations performed in the Process phase. Pre-
experiments. To examine flexibility of fuzzy querying, during sented sample query scenarios (SQ1–SQ4) allow understand-
the data proliferation, existing values of particular attributes ing how the Fuzzy Search Library can be used to process data
were multiplied by a randomly generated factor from the range in Data Lake.
[0;1] in order to differentiate the data and obtain similar, i.e., 1) Fuzzy selection with the use of fuzzy values (SQ1): This
not identical, cases. Data proliferation and performance tests simple example shows filtering with the use of fuzzy condition
were conducted in Azure Data Lake environment on Microsoft built on the basis of a fuzzy value. In this example we perform
Azure public cloud. Proliferated data were stored as CSV files the following operation: Retrieve patients, whose age is around
in the Data Lake Store on the Clcud. U-SQL scripts were 50 years old. The U-SQL query implementing this operation
executed as distributed jobs in DLA, thus, performing parallel is shown in Listing 3.
data processing for all phases of the EPS. 1 @age =
2 SELECT Id , Age ,
3 Udfs . Around ( Age , 5 0 , 4 5 ) . V a l u e AS MemberDg
B. Data Extraction 4 FROM @data
Data extraction (the Extract phase) was performed with 5 WHERE Udfs . Around ( Age , 5 0 , 4 5 ) . V a l u e >= 0 . 5 ;
Listing 3. Sample U-SQL query retrieving patients, whose age is around 50
the use of the code presented in Listing 2. This part of the years old.

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
9

This U-SQL query processes and transforms the rowset ...


@data to the rowset @age containing only two attributes (Id
3) Extended projection with fuzzy transformation and fuzzy
and Age). Selection in the WHERE clause (line 5) is based
selection (SQ3): This example shows extended projection with
on the fuzzy condition that uses the Around function from
T̃µ fuzzy transformation, filtering on the basis of complex
the Udfs module of the Fuzzy Search Library to represent the
search condition, and the use of overloaded disjunction oper-
fuzzy set age around 50 with the use of the Gaussian mem-
ator | (working according to Zadeh s-norm). In this example,
bership function. The third argument of the Around function
we perform the following operation: Retrieve data for patients,
(45) is the left crossover point, i.e., the element in universe,
whose resting heart rate is good or competitive (athlete’s).
left to the midpoint (50), whose membership value is equal
The U-SQL query implementing this operation is shown in
to 0.5. The right crossover point is calculated automatically,
Listing 5.
e.g., for the given example the right crossover point is 55. The
query retrieves only those rows that satisfy the fuzzy search 1 @hr =
condition with a minimal membership degree greater or equal 2 SELECT Id , Age , RestHr ,
3 ( Udfs . I s ( RestHr , V a r i a b l e s . R e s t H r C o n d i t i o n .
to 0.5 (λ = 0.5). Results produced by the query have the V a l u e s . Good ) |
following form. 4 Udfs . I s ( RestHr , V a r i a b l e s . R e s t H r C o n d i t i o n .
V a l u e s . C o m p e t i t i v e ) ) . V a l u e AS MemberDg
Id Age MemberDg 5 FROM @data
------------ --- --------
d3258689-... 53 0.779
6 WHERE Udfs . V a l u e ( Udfs . I s ( RestHr ,
3ee478ee-... 48 0.895 7 V a r i a b l e s . R e s t H r C o n d i t i o n . V a l u e s . Good ) |
e5724521-... 54 0.641 8 Udfs . I s ( RestHr , V a r i a b l e s . R e s t H r C o n d i t i o n .
8fba1e44-... 50 1.000 V a l u e s . C o m p e t i t i v e ) ) >= 0 . 2 ;
1e0340c4-... 45 0.500
...
Listing 5. Sample U-SQL query retrieving data for patients, whose resting
heart rate is good or competitive (athlete’s).
2) Fuzzy selection with the use of linguistic variables This U-SQL query processes and transforms the rowset
(SQ2): In this example, filtering is performed on the basis of @data (line 5) and produces the rowset @hr (line 1) con-
fuzzy search condition that makes use of a linguistic variable. taining selected attributes of the rowset @data (Id of the
In this example, we perform the following operation: Retrieve patient, age, and resting heart rate RestHr, line 2), and the
data for patients, whose cholesterol level is normal. The U- degree of membership of the crisp value of the resting heart
SQL query implementing this operation is shown in Listing 4. rate (RestHr) for a particular row to the fuzzy set Good or
1 @normal = Competitive defined as linguistic values (lines 3–4). Selection
2 SELECT Id , Age , Sex , RestBP , C h o l e s t e r o l , D i s e a s e
3 FROM @data
in the WHERE clause (line 6) is based on the complex fuzzy
4 WHERE Udfs . V a l u e ( Udfs . I s ( C h o l e s t e r o l , search condition that uses the overloaded disjunction operator
5 V a r i a b l e s . C h o l e s t e r o l . V a l u e s . Normal ) ) >= 0 . 5 ; |. Moreover, it makes use of linguistic variable RestHrCondi-
Listing 4. Sample U-SQL query retrieving patients, whose cholesterol level tion (lines 7–8) and its Good and Competitive linguistic values
is normal.
defined in an external, domain-specific library HeartDisease-
This U-SQL query processes and transforms the rowset
Data, implemented in C# programming language. The rowset
@data (line 3) and produces the rowset @normal (line 1)
generated by the query has the following form.
containing selected attributes of the rowset @data (line 2). Se-
lection in the WHERE clause (lines 4–5) is based on the fuzzy Id Age RestHR MemberDg
------------ --- ------ --------
condition that uses the linguistic variable Cholesterol (invoca- 93e06584-... 64 58 1.0
tion: Variables.Cholesterol, line 5) and its Normal linguistic 8e4d191d-... 66 49 1.0
9a249302-... 43 67 0.6
value (invocation: Variables.Cholesterol.Values.Normal, line 0b72a9cf-... 55 69 0.2
5), all defined in an external, domain-specific library called 0a8a261d-... 65 68 0.4
...
HeartDiseaseData, implemented in C# programming lan-
guage. The Udfs.Is function (line 4) is used to assess how 4) Fuzzy grouping with the use of dynamically defined
much a crisp value of the Cholesterol attribute (line 4) of linguistic variable (SQ4): This example focuses on fuzzy
the @data rowset matches the Normal linguistic value of the grouping, which is performed with the use of a linguistic
linguistic variable Cholesterol (line 5) for each row of the variable. Unlike in SQ3, here the linguistic variable and its
processed rowset. The Udfs.Value function (line 4) from the linguistic values are defined dynamically in the U-SQL code.
Udfs module of the Fuzzy Search Library is invoked in order We also show extended projection with T̃L fuzzy transfor-
to convert the Result object returned by the Udfs.Is function mation. In this example, we perform the following operation:
to a numerical value that can be compared with the minimal Show me the report on the number of patients grouped by
membership degree λ, which qualifies rows for retrieval. The the number of cigarettes smoked per day. The U-SQL query
produced rowset has the following form. implementing this operation is shown in Listing 6.
Id Age Sex RestBP Chlrl Disease 1 DECLARE @ c i g a r e t t e s s t r i n g = ” c i g s : None , T , 0 , 0 , 0 , 0 /
------------ --- --- ------ ----- ------- L i t t l e , T , 0 , 1 , 3 , 5 / Average , T , 4 , 5 , 9 , 1 0 / Much , T
474a1bce-... 57 M 140 192 True
, 9 , 1 0 , 2 0 , 2 2 / VeryMuch , T , 1 8 , 2 0 , 1 0 0 , 2 0 0 ” ;
33be448a-... 49 M 120 188 False
ca67a6dc-... 40 M 110 167 False 2 @groups = SELECT
f5e38c59-... 46 F 142 177 True 3 Udl . Get ( @ c i g a r e t t e s ) . Group ( C i g a r e t t e s P e r D a y )
8b0b7450-... 62 F 160 164 False 4 AS CigsPerDay , COUNT( I d ) AS N o O f P a t i e n t s
9d252762-... 51 M 110 175 True 5 FROM @data

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
10

6 GROUP BY Udl . Get ( @ c i g a r e t t e s ) . Group ( TABLE I


CigarettesPerDay ) ; E FFECTIVENESS OF F UZZY Q UERYING IN SQ1 FOR VARIOUS λ
Listing 6. Sample U-SQL code showing the statistics on the number of
patients grouped by the number of cigarettes smoked per day according to λ Selc RCard Flex Conf Uncr FPrec
linguistic variable defined dynamically. 1.0 0.016 0.016 0.00 1.000 0.000 1.000
This U-SQL query processes and transforms the rowset 0.9 0.048 0.047 0.67 0.340 0.660 0.982
@data (line 5) and produces the rowset @groups (line 2) 0.8 0.080 0.076 0.80 0.212 0.788 0.947
containing the name of the group, which is one of the linguistic 0.7 0.112 0.100 0.86 0.160 0.840 0.900
values returned by the T̃L fuzzy transformation (line 3), 0.6 0.143 0.120 0.89 0.133 0.867 0.843
and the number of patients in the particular group (line 4). 0.5 0.174 0.136 0.91 0.118 0.882 0.782
Grouping (line 6) is performed on the basis of cigs linguistic 0.4 0.174 0.136 0.91 0.118 0.882 0.782
variable defined ad hoc in the U-SQL script (@cigarettes, 0.3 0.204 0.147 0.92 0.109 0.891 0.720
line 1). The Udl.Get function (invoked twice, in line 3 and 0.2 0.235 0.155 0.93 0.104 0.896 0.660
6), from the Udl module of the Fuzzy Search Library, is 0.1 0.294 0.163 0.95 0.098 0.902 0.555
used to extract the linguistic variable and its values from U- 0.0 1.000 0.167 0.98 0.096 0.904 0.167
SQL string variable @cigarettes. Then, the Group function Crisp 0.016 0.016 0.00 1.000 0.000 1.000
(from the Linguistics module of the Fuzzy Search Library)
returns the best matching linguistic value for the crisp value TABLE II
of CigarettesPerDay attribute of each row of the processed E FFECTIVENESS OF F UZZY Q UERYING IN SQ2 FOR VARIOUS λ
rowset. The query generates the following rowset.
λ Selc RCard Flex Conf Uncr FPrec
CigsPerDay NoOfPatients
---------- ------------ 1.0 0.305 0.305 0.00 1.00 0.00 1.000
None 390,970,972 0.9 0.308 0.308 0.01 0.99 0.01 0.999
Little 37,623,539
Average 43,628,778 0.8 0.312 0.311 0.02 0.98 0.02 0.996
Much 134,500,127 0.7 0.316 0.313 0.03 0.97 0.03 0.993
VeryMuch 291,459,274
0.6 0.319 0.316 0.05 0.97 0.03 0.988
0.5 0.323 0.318 0.06 0.96 0.04 0.983
D. Effectiveness of Fuzzy Querying 0.4 0.327 0.319 0.07 0.95 0.05 0.976
Fuzzy querying intends to provide more flexibility while 0.3 0.331 0.320 0.08 0.95 0.05 0.968
retrieving relevant data from a data source, like a database 0.2 0.334 0.321 0.09 0.95 0.05 0.960
or a data lake. Effectiveness of using the presented fuzzy 0.1 0.338 0.321 0.10 0.95 0.05 0.950
techniques in querying big data has been examined and 0.0 0.338 0.321 0.10 0.95 0.05 0.950
compared to traditional query methods. In order to examine the Crisp 0.305 0.305 0.00 1.00 0.00 1.000
effectiveness, we used several measures, including: selectivity,
relative cardinality, flexibility, fuzzy precision, confidence, and
uncertainty. All used measures are defined in Appendix A in several rows that satisfy the search condition for λ = 1.0
the Supplement to the paper. (equivalent to age = 50). For queries presented in SQ2 and
Effectiveness of using fuzzy techniques in querying big SQ3 the flexibility degree hardly reaches 6% (F lex = 0.06)
data in data lake with the use of queries presented in query for λ = 0.5, and therefore, these queries provide less flexibility
scenarios SQ1–SQ3 for various λ thresholds is presented in with respect to their crisp counterparts. Uncertainty of the
Tables I–III. Additionally, the last row in each of the Tables I– output rowset increases and the confidence of the rowset
III contains values of the same measures for crisp counterparts decreases proportionally with the growing flexibility and de-
of the fuzzy queries (e.g., crisp age = 50 vs. fuzzy age creasing λ. Low values of the uncertainty and high values of
around 50). As can be observed, with the decreasing value the confidence (SQ2) suggest that the output rowset contains
of the λ-cut threshold, the queries become more flexible, mainly rows with the highest possible membership degree.
which is visible in the growing selectivity (Selc) and relative High values of precision in SQ2 and SQ3 mean that each row
cardinality (RCard). For λ = 1.0 the selectivity and relative of the output rowset is highly relevant (contributes with high
cardinality are identical, since only rows with µRO (t) = 1.0 membership degree, regardless of whether it is the highest or
qualify for the output rowset RO . Then, their values become just close to the highest membership degree). The precision
increasingly divergent. Large differences suggest that rows may drop significantly for queries with very relaxed search
with low membership degree are qualified for the output conditions, like the query presented in SQ1 for λ = 0.1, which
rowset. Flexibility of fuzzy queries grows together with the indicates that the output rowset contains many rows with low
decreasing λ values and the degree of flexibility depends on membership degrees.
the definition of the fuzzy set v used in the fuzzy search This shows that fuzziness may extend the rowset and we can
condition, the λ itself, and the input rowset (source data). control the degree of flexibility by appropriately choosing the
For example, flexibility of the fuzzy query presented in SQ1 fuzzy sets in the fuzzy search conditions and by using properly
reaches a high level of F lex = 0.67 (Table I) already for selected λ similarity threshold. By default the λ is usually set
λ = 0.9, since the age around 50 fuzzy set is defined with to 0.5. However, in real-world applications, setting the value
the use of Gaussian membership function and there are just of the parameter depends on our knowledge of data that are

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
11

TABLE III processing in Azure Data Lake with U-SQL query scenarios
E FFECTIVENESS OF F UZZY Q UERYING IN SQ3 FOR VARIOUS λ SQ1–SQ5. As can be observed, with the growing number of
allocation units, the time needed to complete the whole EPS
λ Selc RCard Flex Conf Uncr FPrec
process for all query scenarios SQ1–SQ5 drops significantly
1.0 0.412 0.412 0.000 1.000 0.000 1.000
(the time is plotted on log10 scale). For example, for query
0.9 0.412 0.412 0.000 1.000 0.000 1.000
scenario SQ4, the execution time was reduced from 3 hours
0.8 0.426 0.423 0.031 0.975 0.025 0.994
0.7 0.426 0.423 0.031 0.975 0.025 0.994
and 35 minutes, exactly 12,910 seconds, (on one AU) to 185
0.6 0.439 0.431 0.061 0.957 0.043 0.982
seconds (on 80 AUs). Eighty allocation units that were in use
0.5 0.439 0.431 0.061 0.957 0.043 0.982
was the maximum number of AUs that were needed to process
0.4 0.453 0.437 0.089 0.945 0.055 0.964 the data set, since the size of the data set did not require
0.3 0.453 0.437 0.089 0.945 0.055 0.964 to engage more computing power. It is worth noting that the
0.2 0.466 0.439 0.115 0.939 0.061 0.942 execution time curves for particular query scenarios SQ1–SQ5
0.1 0.466 0.439 0.115 0.939 0.061 0.942 follow the same pattern — the dynamics of computation speed
0.0 0.466 0.439 0.115 0.939 0.061 0.942 is not constant, and interestingly, changes in the same way
Crisp 0.412 0.412 0.000 1.000 0.000 1.000 for all query scenarios. This is caused by the allocation of
compute units to particular portions of data that are processed
in parallel. The largest of the data sets that we used (109
processed (e.g., distribution of values of an attribute) and the rows) was always divided into the same number of 80 chunks
purpose of processing. If the user is rather precise in terms of (the number depends on the size of data) that were processed
the set of results and given search conditions, but still wants by various numbers of AUs. Each data chunk is processed
to obtain a few similar cases, he should start with high values by a single AU. The allocation of data chunks to the available
of the λ similarity threshold, e.g., λ = 0.9 or λ = 0.95. Then, compute units was then more or less optimally adjusted, which
the value of the parameter can be decreased to relax the search resulted in variable dynamics of the computation speed. When
conditions and include more rows with similar values of an using 80 AUs, each data chunk is assigned to one available
attribute. If the purpose of the fuzzy querying is a preselection, AU, and the processing of the whole data set is performed
a user may start with lower values of the λ parameter (e.g., in one iteration of computations (80 AUs process 80 data
λ = 0.5 or lower), and then gradually increase the value, chunks). However, processing 80 data chunks with the use of
and decide whether the outcome is properly narrowed. On the 40 AUs requires two iterations 40 data chunks are processed
other hand, grouping with the use of linguistic variables may by 40 AUs in each of the two iterations of computations.
narrow the rowset to a representative form that is user friendly Processing the same data set with the use of 60 AUs still
and suitable for presentation, visualization, or further analysis. requires two iterations 60 data chunks are processed by 60
For example, the query presented in SQ4 returns only five AUs in the first iteration, and 20 data chunks by 20 AUs in the
rows (five groups of records, card(RF ) = 5) on the basis of second iteration, leaving 40 AUs idle. Therefore, the execution
arbitrarily defined linguistic variable with five linguistic values time remains the same for processing scenarios performed with
for the number of cigarettes smoked per day. Crisp grouping by 40 to 70 AUs as shown in Fig. 6, since they all need two
CigarettesPerDay attribute returns 197 groups (card(RC ) = iterations for data processing. This dependency is less visible,
197), which is much more, even if the number of cigarettes if the number of AUs is much smaller than the number of data
that a person is able to smoke per day (a domain) is limited. chunks that must be processed.
In order to better illustrate the results, we show the accelera-
tion of fuzzy data processing when scaling the EPS process in
E. Performance Tests Azure Data Lake from 1 to 80 AUs (Fig. 7). The acceleration
We have also conducted a series of tests in order to examine is calculated according to eq. 18 and obtained by comparing
the performance of the presented solution. For the clarity of the execution times to the case in which a particular query scenario
presentation, we show results for query scenarios SQ1–SQ5 SQ1–SQ5 was performed by using only one AU.
explained in Sect. IV-C and in the Supplement. They reflect
T1
a general tendency that we observed, while processing data Sd = , (18)
from Data Lake in the presented environment of the computing Td
cloud. In all cases, performance was assessed on the basis of where d is the number of AUs in use, T1 is the execution time
execution time measurements collected when performing the obtained while performing the U-SQL query scenario with the
whole EPS process (including all phases). We carried out at use of one AU, and Td is the execution time obtained while
least three replicas for each measurement. Then, the obtained performing a particular query scenario with the use of d AUs.
results were averaged, and the averaged values are presented As can be observed in Fig. 7, the speedup is sublinear,
in this section. Averaged values of measurements were also which results from the fact that not all component steps of
used to determine n-fold speedups when scaling computations the ADL job can be equally parallelized — in some steps
in Azure Data Lake. some AUs were idle. All speedup curves follow the same
Fig. 6 shows how the execution time depends on the trend showing that adjusting the degree of parallelism to the
executed query and the number of compute nodes, called data size (and the number of data chunks) by assigning an
allocation units (AUs), when performing parallel, fuzzy data appropriate number of AUs for EPS job execution allows the

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
12

100000 1600
SQ1 SQ1
1400
SQ2 SQ2
1200
SQ3 SQ3

Execution time (s)


Execution time (s)

10000
SQ4 1000 SQ4
SQ5 800 SQ5

600
1000
400

200

100 0
0 10 20 30 40 50 60 70 80 90 0 200 400 600 800 1000 1200
# AUs (compute nodes) # rows (millions)

Fig. 6. Execution time for varying number of allocation units (AUs) for Fig. 8. Execution times for varying volume of data (millions rows of data)
parallel, fuzzy data processing in Azure Data Lake with U-SQL query in the extracted data set for parallel, fuzzy data processing in Azure Data
scenarios SQ1–SQ5. Time plotted on log10 scale. Tests performed with Lake with U-SQL query scenarios SQ1–SQ5. Tests performed with nine
a biomedical data set (extracted rowset) containing a billion (109 ) rows. allocation units (#AUs=9).

90.00 that should be processed in the EPS process, the execution


SQ1 time increases linearly for all query scenarios, and the slope
80.00 SQ2 of the lines depends on the query scenario.
SQ3
70.00
SQ4 F. Coping with Various Data Formats
SQ5 Variety refers to many different formats in which data can
60.00
ideal be stored. Access to data stored in a variety of formats is
n-fold speedup

50.00 performed in the Extract phase of the EPS process. By default,


Data Lake Analytics provides three types of built-in extractors
40.00 to generate rowsets from the input files (as well as three types
of outputters to store data after processing):
30.00
• Text — allows extraction from delimited text files of

20.00 different encodings,


• CSV — allows extraction from comma-separated value
10.00 (CSV) files of different encodings,
• TSV — allows extraction from tab-separated value (TSV)
0.00 files of different encodings.
0 10 20 30 40 50 60 70 80 90
# AUs (compute nodes) Sample usage of the CSV extractor in the USING clause of
the EXTRACT U-SQL expression was presented in Listing 2
Fig. 7. n-fold speedup for varying number of allocation units (AUs) for (Sect. IV-B). Processing data stored in more complex formats
parallel, fuzzy data processing in Azure Data Lake with U-SQL query requires custom extractors that must be specific for a particular
scenarios SQ1–SQ5. Tests performed with a biomedical data set (extracted
rowset) containing a billion (109 ) rows.
format. Custom extractors can be delivered in the Domain
Specific Libraries dedicated for particular data domains. In
Appendix C in the Supplement to this paper we show how
best use of available computational resources. The best n-fold the Variety feature of Big Data is supported by domain-
speedup of 69.78 was achieved for the query scenario SQ4 specific user-defined extractors in two scenarios related to the
when scaling the execution from 1 to 80 AUs. The worst n- processing biological data.
fold speedup of 49.29 was achieved for query scenario SQ5
for the same scaling scheme, but it is still much better to scale V. D ISCUSSION AND C ONCLUDING R EMARKS
data processing than perform it on a single compute unit. Extending big data analytics with the possibility of fuzzy,
Fig. 8 shows how the solution and the entire compute declarative, and scalable querying is very important in the face
environment behave in terms of execution time increase, under of growing volumes of data that are collected by companies,
the growing amount of data for all query scenarios SQ1–SQ5. research centers, economic entities, and other institutions.
Tests were performed with the use of nine AUs for all data sets, This can be especially beneficial for those domains, in which
as this was the maximum number of AUs that were needed for decisions must be made ad hoc on the basis of large data sets
processing the smallest data set (containing 100 million rows with uncertainty, collected in Big Data Lake in native formats,
of data). As can be noticed, with the growing number of rows without any pre-processing or schematizing.

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
13

The solution that we have developed and presented in whole cluster (e.g., adding new cluster nodes), the number
the paper satisfies these requirements. It allows performing of AUs in Azure Data Lake, i.e., the degree of parallelism,
full EPS process over the Big Data Lake in which data are is flexibly controlled by a single runtime parameter. For a U-
extracted, processed, and transformed, and finally stored in SQL developer, it would look like he has got a huge computer
the selected place and format, suitable for decision makers. cluster with lots of nodes and he would decide how many
The Fuzzy Search Library for Azure Data Lake makes it of them should be used when he executes the EPS query
possible to formulate imprecise search criteria, find similar job. Unlike other solutions, the presented one provides the
cases, incorporate domain-specific expert knowledge into the capability to process not only structured, but also unstructured
data analysis, benefit from linguistic terms frequently used and schema-free data stored in the native format in the big data
in real life, and change the granularity of data by fuzzy lake through a unified interface of extractors that represent
data grouping. Extensions developed for the U-SQL language the data as a rowset. Our solution also simplifies fuzzy
enable declarative processing of the data through querying processing and transformations of data by using declarative U-
rowsets produced during data extraction. Finally, implementa- SQL SELECT expressions, in contrast to Hadoop/Spark-based
tion on the Azure Data Lake located on the cloud guarantees processing, which requires adaptation to particular processing
wide scaling capabilities, which is limited only by a possessed model, e.g., MapReduce, and implementation of dedicated
Azure subscription. processing functions. Fuzzy techniques provided in the form
This work complements the collection of existing solutions, of functions (UDFs in modules of FSL4DLA) that are invoked
presented in Sect. I-A, by delivering a new functionality for directly from U-SQL queries allow not only to process data in
big data analytics. The novelty of the presented solution lies a declarative way, but also to optimize U-SQL execution plans,
mainly in the application of techniques of fuzzy querying to use less allocation (compute) units, and to decrease the
in big data environments, which meets the requirements of cost of data processing. Modularization of the FSL4DLA and
modern information processing and big data challenges. Al- separation from the domain knowledge facilitate the extension
though, several fuzzy extensions were previously developed of the functionality of the presented solution with the use
for declarative SQL language and relational database man- of new libraries of user-defined functions. Finally, there is
agement systems, they were not dedicated for processing no need to maintain any (relational/NoSQL) database cluster
big data and, likewise some NoSQL solutions, remain on or (Hadoop/Spark) computational cluster neither on premises,
a strictly defined schema or format. Our approach allows nor on the cloud, since Azure Data Lake is maintained by a
defining schema on read during the data extraction, and query- cloud provider. Moreover, U-SQL developers pay only for the
ing produced rowsets in parallel on a highly scalable cloud computation time related to U-SQL query execution on the
infrastructure. On the other hand, those solutions presented in cloud and the number of used allocation units, not just for
Sect. I-A, which were dedicated for processing Big Data, were having the cluster alive, even if nothing is processed on it.
mainly focused on data clustering and classification with the It is also worth mentioning how our DLA-based solution
use of MapReduce-based procedures executed on Hadoop or solves the problem of Big Data and addresses Vs character-
Spark clusters. Both, Hadoop and Spark, are also available as istics. By massive parallelization of computations related to
the HDInsight service in Data Lake Analytics. However, we data extraction, processing, and storing on the Cloud, we can
decided to provide the scalability of our solution by extending largely reduce the time of data processing without decreasing
the functionality of U-SQL code, which enables a declarative the workload. This task parallelism addresses the volume
way of data processing and can also be highly parallelized characteristic of Big Data. A part of the Big Data Lake,
on many allocation units (compute nodes). The functionality which is currently being processed, is divided into smaller
of our Fuzzy Search Library for Data Lake Analytics also pieces, then each such a piece is processed separately in
highly extends the spectrum of methods used for processing parallel, and results are merged together and stored again in
big data. Our solution does not directly extend the set of fuzzy the Data Lake. In terms of the variety, our solution allows for
techniques that can be used, in general, but it extends the the structuralization of data stored in Data Lake by building
set of operations that can be performed on large, schema-free table-like rowsets over the unstructured data kept in various
(variable) data sets, elevating the value of the outcome through formats. The Data Lake does not have to be fully described
enrichment provided by fuzzy techniques that are applied. both, in terms of attribute names and data types, as the rowset
Presented Fuzzy Search Library for Data Lake Analytics schema is defined dynamically, on data read. Moreover, for
has many functional advantages over the existing solutions, complex data, like images coming from medical imaging or
which are inherited from the novel technologies that it is based genetic data from DNA next-generation sequencing exper-
on. One of the most important advantages is the theoretically iments, custom extractors for building rowsets provide the
unlimited scalability that allows quick adaption to the growing way for representing data in a structured form. The presented
volume of data, which is impossible to achieve with relational solution is also highly scalable on the Azure cloud, which
databases that provide similar capabilities of fuzzy querying. allows accommodating the growth of data in environments
The second one is the simplicity of scaling the distributed or domains, in which data are generated fast and decisions
fuzzy querying process without the necessity to reconfigure the must be made quickly on the basis of the performed analyses.
whole execution (hardware and software) environment. In con- This feature supports the velocity characteristic of Big Data.
trast to Hadoop/Spark-based solutions and NoSQL databases, We also strongly believe that the presented fuzzy extensions
which in case of scaling would require reconfiguration of the to U-SQL language will allow taking better decisions on the

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
14

basis of more valuable information mined from the Data Lake. [5] S. Kundu and S. Pal, “Fuzzy-rough community in social networks,”
The value that is hidden in the big data can be revealed Pattern Recognition Letters, vol. 67, Part 2, pp. 145 – 152, 2015, granular
Mining and Knowledge Discovery.
by finding and associating similar cases with the use of [6] X. Wei, X. Luo, Q. Li, J. Zhang, and Z. Xu, “Online comment-based
flexible filtering and fuzzy grouping methods delivered by the hotel quality automatic assessment using improved fuzzy comprehensive
Fuzzy Search Library. These operations, performed as a part evaluation and fuzzy cognitive map,” IEEE Transactions on Fuzzy
Systems, vol. 23, no. 1, pp. 72–84, Feb 2015.
of data processing phase, allow incorporating the knowledge [7] C. De Maio, G. Fenza, V. Loia, and M. Parente, “Time aware knowledge
of experts not only in the analysis of biomedical data, as extraction for microblog summarization on Twitter,” Information Fusion,
presented here, but also in other domains. vol. 28, pp. 60 – 74, 2016.
[8] G. Ghosh, S. Banerjee, and N. Y. Yen, “State transition in com-
Among limitations of the presented solution, we have to munication under social network: An analysis using fuzzy logic and
mention the necessity to operate on the specific cloud platform density based clustering towards Big Data paradigm,” Future Generation
and to adapt to the specificity of the Azure Data Lake Computer Systems, vol. 65, pp. 207–220, 2016.
[9] B. K. Tripathy and D. Mittal, “Hadoop based uncertain possibilistic ker-
environment. FSL4DLA is fully a cloud-dedicated solution. nelized c-means algorithms for image segmentation and a comparative
There is no possibility to use the FSL4DLA on large scale on analysis,” Appl. Soft Comput., vol. 46, pp. 886–923, 2016.
the hardware kept on premises, although, local deployments [10] H. Chang, N. Mishra, and C. Lin, “IoT Big-Data centred knowledge
granule analytic and cluster framework for BI applications: a case base
are also possible for testing purposes. Moreover, U-SQL analysis,” Plos One, vol. 10, pp. 1–23, 2015.
developers have to pay for computations related to U-SQL [11] H. Lu, Z. Sun, and W. Qu, “Big Data-driven based real-time traffic flow
query execution. However, on the other hand, they do not state identification and prediction,” Discrete Dynamics in Nature and
Society, vol. 2015, pp. 1–11, 2015.
have to cover the costs of maintenance of the whole hardware
[12] K. Guo, R. Zhang, and L. Kuang, “TMR: Towards an efficient semantic-
infrastructure kept on premises. based heterogeneous transportation media Big Data retrieval,” Neuro-
Our intention was to develop the Fuzzy Search Library for computing, vol. 181, pp. 122–131, 2016.
Big Data Lake as a universal tool that delivers methods for [13] C. Wang, X. Li, X. Zhou, A. Wang, and N. Nedjah, “Soft computing
in Big Data intelligent transportation systems,” Applied Soft Computing,
scalable, fuzzy data processing on Azure cloud for various vol. 38, pp. 1099–1108, 2016.
domains of analyzed data. In the presented solution, experts’ [14] H. Lu, Z. Sun, W. Qu, and L. Wang, “Real-time corrected traffic
knowledge is modularized in domain-specific libraries that correlation model for traffic flow forecasting,” Math. Probl. Eng., vol.
2015, pp. 1–7, 2015.
can be provided separately for various domains by various [15] D. Mrozek, P. Kasprowski, B. Małysiak-Mrozek, and S. Kozielski, “Life
experts. These domain-specific libraries are independent from sciences data analysis,” Information Sciences, vol. 384, pp. 86–89, 2017.
the Fuzzy Search Library. However, the Fuzzy Search Library [16] C. Bai, D. Dhavale, and J. Sarkis, “Complex investment decisions using
rough set and fuzzy c-means: An example of investment in green supply
provides common methods for translating the experts’ knowl- chains,” European Journal of Operational Research, vol. 248, no. 2, pp.
edge into efficient data analysis of Big Data. This comes as 507 – 521, 2016.
a cost of fitting in to our implementation model for domain- [17] L. Meng, A. Tan, and D. Wunsch, “Adaptive scaling of cluster bound-
aries for large-scale social media data clustering,” IEEE Trans. Neur.
specific libraries, but we believe it can be beneficial for many Net. Lear., vol. 27, no. 12, pp. 2656–2669, 2015.
domains and for the future reuse of once created libraries of [18] Y. Zhong, L. Zhang, S. Xing, F. Li, and B. Wan, “The Big Data
experts’ knowledge. With this paper, we hope to attract experts processing algorithm for water environment monitoring of the three
gorges reservoir area,” Abstract and Applied Analysis, vol. 2014, 2014.
and specialists, who use fuzzy techniques in processing data [19] National Research Council, Frontiers in Massive Data Analysis. Wash-
in their areas of interests, to create domain-specific libraries ington, D.C.: National Academy Press, 2013.
and make them freely available for the community of users. [20] V. Kreinovich, H. T. Nguyen, and S. Sriboonchitta, “Need for data
processing naturally leads to fuzzy logic (and neural networks): Fuzzy
beyond experts and beyond probabilities,” Departmental Technical Re-
ACKNOWLEDGMENT ports (CS), p. Paper 870, 2014.
[21] X. Wang and Y. He, “Learning from uncertainty for Big Data: Future an-
This work was supported by Microsoft Research within alytical challenges and strategies,” IEEE Systems, Man, and Cybernetics
Microsoft Azure for Research Award grant, and Statutory Re- Magazine, vol. 2, no. 2, pp. 26–31, 2016.
[22] L. Zadeh, “Fuzzy sets,” Information and Control, vol. 8, no. 3, pp. 338–
search funds of Institute of Informatics, Silesian University of 353, 1965.
Technology, Gliwice, Poland (grant No BK/213/RAU2/2018). [23] T. White, Hadoop – The Definitive Guide: Storage and Analysis at
Fuzzy Search Library for Data Lake Analytics (FSL4DLA) Internet Scale, 3rd ed. Ireland: OReilly, 2012.
[24] Official web page of Apache Spark, “Lightning-fast cluster computing,”
is available at (http://www.zti.aei.polsl.pl/w3/dmrozek/science/ http://spark.apache.org/ (accessed on April 14, 2017).
fsl4dla.htm). Users must use their own Microsoft Azure sub- [25] P. Mell and T. Grance, “The NIST definition of Cloud Computing.
scription in order to run the FSL4DLA on the cloud. Special Publication 800-145 /accessed on March 24, 2017/,” 2011, http://
nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-145.pdf.
[26] P. Bosc and O. Pivert, SQLf Query Functionality on Top of a Regular
R EFERENCES Relational Database Management System. Heidelberg: Physica-Verlag
HD, 2000, pp. 171–190.
[1] G. B. Davis and K. M. Carley, “Clearing the fog: Fuzzy, overlapping [27] J. Kacprzyk and S. Zadrożny, Data Mining via Fuzzy Querying over the
groups for social networks,” Social Networks, vol. 30, no. 3, pp. 201 – Internet. Heidelberg: Physica-Verlag HD, 2000, pp. 211–233.
212, 2008. [28] G. Bordogna and G. Psaila, “Customizable flexible querying in classical
[2] C. De Maio, G. Fenza, V. Loia, and S. Senatore, “Hierarchical web relational databases,” in Handbook of Research on Fuzzy Information
resources retrieval by exploiting fuzzy formal concept analysis,” Infor- Processing in Databases, J. Galindo, Ed. IGI Global, 2008, pp. 191–
mation Processing & Management, vol. 48, no. 3, pp. 399 – 418, 2012. 217.
[3] Z. Wang, L. Tu, Z. Guo, L. T. Yang, and B. Huang, “Analysis of [29] M. Hudec, “An approach to fuzzy database querying, analysis and
user behaviors by mining large network data sets,” Future Generation realisation,” Computer Science and Information Systems, no. 12, pp.
Computer Systems, vol. 37, pp. 429 – 437, 2014. 127–140, 2009. [Online]. Available: http://eudml.org/doc/253504
[4] S. Kundu and S. K. Pal, “FGSN: Fuzzy granular social networks model [30] B. Małysiak-Mrozek, S. Kozielski, and D. Mrozek, “Modern software
and applications,” Information Sciences, vol. 314, pp. 100–117, 2015. tools for researching and teaching fuzzy logic incorporated into database

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TFUZZ.2018.2812157, IEEE
Transactions on Fuzzy Systems
15

systems,” in Proceedings of the iNEER International Conference on [53] S. Prabha and P. Kola Sujatha, “Reduction of Big Data sets using fuzzy
Engineering Education, Gliwice, Poland. iNEER, July 2010, pp. 1–8. clustering,” IJARCET, vol. 3, no. 6, pp. 2235–2238, 2014.
[31] B. Małysiak, D. Mrozek, and S. Kozielski, “Processing fuzzy SQL [54] N. Bharill, A. Tiwari, and A. Malviya, “Fuzzy based clustering algo-
queries with flat, context-dependent and multidimensional membership rithms to handle Big Data with implementation on Apache Spark,” in
functions,” in IASTED International Conference on Computational In- 2016 IEEE Second International Conference on Big Data Computing
telligence, Calgary, Alberta, Canada, July 4-6, 2005, M. H. Hamza, Ed. Service and Applications (BigDataService), March 2016, pp. 95–104.
IASTED/ACTA Press, 2005, pp. 36–41. [55] M. Prasad, Y. Lin, C. Lin, M. Er, and O. Prasad, “A new data-driven
[32] B. Małysiak-Mrozek, D. Mrozek, and S. Kozielski, Data Grouping neural fuzzy system with collaborative fuzzy clustering mechanism,”
Process in Extended SQL Language Containing Fuzzy Elements. Berlin, Neurocomputing, vol. 167, pp. 558 – 568, 2015.
Heidelberg: Springer Berlin Heidelberg, 2009, pp. 247–256. [56] G. Peters and R. Weber, “DCC: a framework for dynamic granular
[33] L. Portinale and S. Montani, “A fuzzy logic approach to case matching clustering,” Granular Computing, vol. 1, no. 1, pp. 1–11, 2016.
and retrieval suitable to SQL implementation,” in Proc. of the 20th IEEE [Online]. Available: http://dx.doi.org/10.1007/s41066-015-0012-z
Inter. Conf. on Tools with Artificial Intelligence. Washington, DC, USA: [57] B. Zhang, S. Qin, W. Wang, D. Wang, and L. Xue, “Data stream
IEEE Computer Society, 2008, pp. 241–245. clustering based on Fuzzy C-Mean algorithm and entropy theory,” Signal
[34] K. Myszkorowski, Inference Rules for Fuzzy Functional Dependencies Process., vol. 126, no. C, pp. 111–116, Sep. 2016.
in Possibilistic Databases. Cham: Springer International Publishing, [58] V. Chitraa and A. Thanamani, “Web log data analysis by enhanced Fuzzy
2016, pp. 181–191. C Means clustering,” International Journal on Computational Sciences
[35] G. Appelgren Lara, M. Delgado, and N. Marı́n, Fuzzy Multidimen- & Applications (IJCSA), vol. 4, no. 2, pp. 81–95, 2014.
sional Modelling for Flexible Querying of Learning Object Repositories. [59] B. K. Tripathy, D. Mittal, and D. P. Hudedagaddi, Hadoop with Intuition-
Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 112–123. istic Fuzzy C-Means for Clustering in Big Data. Singapore: Springer,
[36] B. Małysiak-Mrozek, D. Mrozek, and S. Kozielski, “Processing of crisp 2016, pp. 599–610.
and fuzzy measures in the Fuzzy Data Warehouse for global natural [60] G. A. Papakostas, E. I. Papageorgiou, and V. G. Kaburlasos, “Linguistic
resources,” in Trends in Applied Intelligent Systems, ser. Lect Notes Fuzzy Cognitive Map (lfcm) for pattern recognition,” in 2015 IEEE Int.
Comput Sci, N. Garca-Pedrajas and et al., Eds. Springer Berlin Conf. on Fuzzy Systems (FUZZ-IEEE), 2015, pp. 1–7.
Heidelberg, 2010, vol. 6098, pp. 616–625. [61] Y. Choi, H. Lee, and Z. Irani, “Big data-driven fuzzy cognitive map
[37] A. Castelltort and A. Laurent, Fuzzy Queries over NoSQL Graph for prioritising it service procurement in the public sector,” Annals of
Databases: Perspectives for Extending the Cypher Language. Cham: Operations Research, 2016.
Springer International Publishing, 2014, pp. 384–395. [62] J. Liu, Y. Chi, and C. Zhu, “A dynamic multiagent genetic algorithm for
[38] A. Castelltort and A. Laurent, “Extracting fuzzy summaries from NoSQL gene regulatory network reconstruction based on fuzzy cognitive maps,”
graph databases,” in Flexible Query Answering Systems, ser. AISC, IEEE Transactions on Fuzzy Systems, vol. 24, no. 2, pp. 419–431, 2016.
T. Andreasen et al., Ed. Springer, 2016, vol. 400, pp. 189–200. [63] S. D’Onofrio, M. Wehrle, E. Portmann, and T. Myrach, “Striving for
[39] A. Castelltort and A. Laurent, “Exploiting NoSQL graph databases and semantic convergence with fuzzy cognitive maps and graph databases,”
in memory architectures for extracting graph structural data summaries,” in 2017 IEEE Int. Conf. on Fuzzy Systems (FUZZ-IEEE), 2017, pp. 1–6.
International Journal of Uncertainty, Fuzziness and Knowledge-Based [64] S. Ramachandramurthy, S. Subramaniam, and C. Ramasamy, “Distilling
Systems, vol. 25, no. 01, pp. 81–109, 2017. Big Data: refining quality information in the era of yottabytes,” The
[40] M. S. Hidri, I. BenAli-Sougui, and A. Grissa-Touzi, “No-FSQL: A Scientific World Journal, vol. 2015, pp. 1–9, 2015.
graph-based fuzzy NoSQL querying model,” Int J Fuzzy Syst Appl, [65] Y. Cai, Q. Li, H. Xie, and H. Min, “Exploring personalized searches
vol. 5, no. 2, pp. 54–63, 2016. using tag-based user profiles and resource profiles in folksonomy,”
[41] A. B. Kacem and A. G. Touzi, “Towards fuzzy querying of NoSQL Neural Networks, vol. 58, pp. 98 – 110, 2014.
document-oriented databases,” in The Seventh Int. Conf. on Advances [66] Z. Liu, J. Li, J. Li, C. Jia, J. Yang, and K. Yuan, “SQL-based fuzzy
in Databases, Knowledge, and Data Applications, DBKDA 2015, Rome, query mechanism over encrypted database,” International Journal of
Italy, 2015, pp. 153–158. Data Warehousing and Mining, vol. 10, no. 4, pp. 71–87, 2014.
[42] M. E. Bakry, S. Safwat, and O. Hegazy, “Big Data classification [67] D. D. Wang, W. Zhou, and H. Yan, “Mining of protein–protein interfacial
using Fuzzy K-Nearest Neighbor,” International Journal of Computer residues from massive protein sequential and spatial data,” Fuzzy Sets
Applications, vol. 132, no. 10, pp. 8–13, 2015. and Systems, vol. 258, pp. 101–116, 2015.
[43] O. Hegazy, S. Safwat, and M. E. Bakry, “A MapReduce fuzzy techniques [68] B. Novikov, N. Vassilieva, and A. Yarygina, “Querying Big Data,” in
of Big Data classification,” in 2016 SAI Computing Conference (SAI), Proceedings of the 13th International Conference on Computer Systems
July 2016, pp. 118–128. and Technologies. ACM, 2012, pp. 1–10.
[44] P. Ducange, F. Marcelloni, and A. Segatori, “A MapReduce-based [69] K. Sundharakumar, S. Dhivya, S. Mohanavalli, and R. V. Chander,
fuzzy associative classifier for big data,” in 2015 IEEE International “Cloud based fuzzy healthcare system,” Procedia Computer Science,
Conference on Fuzzy Systems (FUZZ-IEEE), Aug 2015, pp. 1–8. vol. 50, pp. 143 – 148, 2015.
[45] S. del Rı́o, V. López, J. M. Benı́tez, and F. Herrera, “A MapReduce [70] A. T. Azar and A. E. Hassanien, “Dimensionality reduction of medical
approach to address Big Data classification problems based on the Big Data using neural-fuzzy classifier,” Soft Computing, vol. 19, no. 4,
fusion of linguistic fuzzy rules,” International Journal of Computational pp. 1115–1127, 2015.
Intelligence Systems, vol. 8, no. 3, pp. 422–437, 2015. [71] O. Behadada, M. Trovati, M. A. Chikh, and N. Bessis, “Big Data-
[46] V. López, S. del Rı́o, J. M. Benı́tez, and F. Herrera, “Cost-sensitive based extraction of fuzzy partition rules for heart arrhythmia detection:
linguistic fuzzy rule based classification systems under the MapReduce a semi-automated approach,” Concurrency and Computation: Practice
framework for imbalanced Big Data,” Fuzzy Sets and Systems, vol. 258, and Experience, vol. 28, no. 2, pp. 360–373, 2016.
pp. 5–38, 2015. [72] J. McEntyre and D. Lipman, “PubMed: bridging the information gap,”
[47] A. Segatori, F. Marcelloni, and W. Pedrycz, “On distributed fuzzy CMAJ, vol. 164, no. 9, pp. 1317–1319, 2001.
decision trees for Big Data,” IEEE Transactions on Fuzzy Systems, [73] H. Wang, Z. Xu, and W. Pedrycz, “An overview on the roles of
vol. PP, no. 99, pp. 1–1, 2017. fuzzy set techniques in Big Data processing: Trends, challenges and
[48] I. Timón, J. Soto, H. Pérez-Sánchez, and J. M. Cecilia, “Parallel opportunities,” Knowledge-Based Systems, vol. 118, pp. 15 – 30, 2017.
implementation of fuzzy minimals clustering algorithm,” Expert Systems [74] M. Detyniecki, R. R. Yager, and B. Bouchon-Meunier, “Specifying t-
with Applications, vol. 48, pp. 35 – 41, 2016. norms based on the value of t(1/2,1/2),” Mathware & Soft Computing,
[49] Y. Li, R. Wang, and S. Shiu, “Interval extreme learning machine for vol. 7, no. 1, pp. 77 – 35 387, 2000.
Big Data based on uncertainty reduction,” J. Intell. Fuzzy Syst., vol. 28, [75] Microsoft Azure, “Overview of Microsoft Azure Data Lake Analytics,”
no. 5, pp. 2391–2403, 2015. https://docs.microsoft.com/en-us/azure/data-lake-analytics/
[50] S. A. Ludwig, “MapReduce-based fuzzy c-means clustering algo- data-lake-analytics-overview (accessed on April 14, 2017).
rithm: implementation and scalability,” International Journal of Machine [76] R. Detrano, A. Janosi, W. Steinbrunn, M. Pfisterer, J. Schmid, S. Sandhu,
Learning and Cybernetics, vol. 6, no. 6, pp. 923–934, 2015. K. Guppy, S. Lee, and V. Froelicher, “International application of a
[51] T. C. Havens, J. C. Bezdek, C. Leckie, L. O. Hall, and M. Palaniswami, new probability algorithm for the diagnosis of coronary artery disease,”
“Fuzzy c-means algorithms for very large data,” IEEE Transactions on American Journal of Cardiology, vol. 64, pp. 304–310, 1989.
Fuzzy Systems, vol. 20, no. 6, pp. 1130–1146, Dec 2012.
[52] P. Su, C. Shang, and Q. Shen, “A hierarchical fuzzy cluster ensemble ap-
proach and its application to Big Data clustering,” Journal of Intelligent
& Fuzzy Systems, vol. 28, no. 6, pp. 2409–2421, 2015.

1063-6706 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Das könnte Ihnen auch gefallen