Sie sind auf Seite 1von 6

Tiered Data Integration for Mobile Health Systems

Mervat Abu-Elkheir Najah A. Abu Ali


Faculty of Computer and Information Sciences College of Information Technology
Mansoura University United Arab Emirates University
Mansoura, Egypt Al-Ain, UAE
mfahmy78@mans.edu.eg najah@uaeu.ac.ae

Abstract—One of the most promising instantiations of the Schema integration also targets providing a seamless and
Internet of Things (IoT) are mobile health (mHealth) systems, unified view of the underlying data to outside users.
which promise to deliver intelligent health monitoring and
assisted living as well as advanced and integrated health services. Deterministic and probabilistic record linkage schemes for
To realize the full potential of these services, fragmented and entity resolution in medical records have been researched
heterogeneous data that is generated by different segments of the extensively [4]. Most of these schemes require manual
system need to be consolidated in order to support high-quality validation of linkage to be performed by medical experts in
processes. This paper proposes a tiered data integration scheme order to guarantee the integrity of the resulting consolidated
for mHealth systems that works on the schema, entity, and event entities. Little work has been done to fully automate and
levels. The proposed scheme incorporates an algorithm that parallelize this process. This is an essential requirement in
merges and ranks sensor streams for schema integration and mHealth systems, especially with the growing number of users,
event identification, and performs contextual record registration the variety of contexts in which they can exist, and the
and deduplication for entity resolution. We tested the proposed streaming nature of data they can generate using smart
integration scheme on two sets of sensor-based mHealth data wearable devices. Automating the process of data integration
related to human activity recognition. Preliminary results show from multiple, heterogeneous sensors is a vital requirement,
that the proposed integration scheme contributes to especially as a user (i.e. patient) can exist in varying IoT-
enhancements in event identification precision compared to the enabled contexts and be linked to different connected devices
classification performance of separate datasets produced within
over time. Delaying the integration process until experts are
the same mHealth system.
available to look at the data may result in poor response to
Keywords—data integration; mHealth; sensor networks; critical and urgent situations.
schema integration Two strategies can be used to map the individual sources to
a unified virtual view: the global-as-view and the local-as-view
I. INTRODUCTION
[5]. In the global-as-view (GAV) mapping strategy, the virtual
Data integration is the process of turning data created by view is a unique, predefined, and global mapping from the
independent data sources into a unified dataset that can be used local sources to the global schema. The integration complexity
for analysis. This process is an integral element of data is bounded to a mediator module, which maps a global query
preprocessing, which is crucial for any data management into a set of local query plans that retrieve data from the local
system since it ensures that data is useful for smart monitoring sources. In the local-as-view (LAV) mapping strategy, the data
and actuation [1]. Data integration is becoming an essential from the individual sources are transformed into local views
requirement for IoT-enabled services and applications, since that are migrated to upper-layer repositories. The query
millions of sensing and identification devices will be connected processor has to figure out how to map a query placed by the
to the Internet in order to make their data available [2]. To start user into an execution plan over the sources. Due to the nature
generating profit off such services and applications, of sensor-based mHealth systems, a hybrid of both strategies is
stakeholders will need the data used by those services and needed. Today’s layered and heterogeneous mHealth systems
applications to be timely and accurate. Therefore, consolidation need the stability of GAV, which guarantees consistent
of data to produce high-quality processes that can support performance, as well as the elasticity of LAV, which allows for
integrated services is crucial. This is especially evident in the a flexible data federation with little need for schema redesign.
field of health systems, where the requirement to integrate
large and heterogeneous data sources from multiple providers In this paper, we propose an integration scheme for
in order to enable sophisticated analysis is on the rise [3]. mHealth systems that can take multiple data sources and
consolidate their data for better assessment of health-related
Data integration involves two primary operations: entity conditions. The data integration scheme spans multiple tiers
resolution and schema integration. Embedded in the entity that correspond to different levels in the mHealth system; the
resolution process is record linkage and deduplication, which sources level, the entity level, the events level, and the schema
involves finding entities that may be considered similar and level. The sources level corresponds to the integration of
consolidating their records. Attribute redundancy reduction and sensor data streams. The entity level corresponds to the
attribute mappings are handled within schema integration. integration of sensor data associated with one person. The

978-1-4799-5952-5/15/$31.00 ©2015 IEEE


event level corresponds to associating heterogeneous sensor composite updatable unit. The integration scheme recruits the
data with health-related events and conditions. The schema assistance of human experts in order to verify attribute
level corresponds to the integration of mHealth data across identification and grouping as well as data transformation and
multiple persons associated with a health institute. Sources deduplication.
integration involves statistical-based sensor that uses both
sensor modality and the continuously changing context to Analysis of the structural similarities between multiple
assess which sources are most similar. Entity integration datasets in order to identify groups of data values was explored
involves epoch-based clustering of data records based on [9]. However, the proposed scheme supports integration only
context and behavior similarity. Through the intersection of of homogeneous gene datasets and therefore is too tailored to
both vertical sources integration and horizontal records be generalized to an all-purpose mHealth system.
integration, a more accurate identification of health conditions Semantic description of data sources and their associated
and associated activities can be achieved. Schema is therefore data flows has been proposed to enable ontology-based
defined as dynamically changing views that correspond to mediation for data integration for logistics processes [10] and
vertical and horizontal clusters. for transportation [11]. Ontologies enable powerful processing
This paper is organized as follows. Section II summaries capabilities, but are usually expensive to build. Context-aware
sensor discovery and ranking that utilizes semantic querying
the current literature related to IoT data integration. In Section
III, we outline our proposed tiered integration approach for based on an array of sensor characteristics that are defined by
mHealth applications. Section IV describes the prototype setup users was proposed as part of an IoT middleware [12].
and the datasets used, and provides a discussion of preliminary III. TIERED INTEGRATION
results. We conclude the paper in Section V.
In this section, we define the data integration problem for
II. RELATED WORK mHealth systems and introduce the basic principles behind our
tiered integration scheme.
There is little work in the literature addressing data
integration in the context of IoT and mHealth systems. The A. Problem Definition
only two complete systems that address healthcare related In a typical mHealth system, a person (or a patient) is
integration are the data federation approach proposed and expected to wear a number of health monitoring sensors. We
deployed by the Global Alzheimer’s Association Interactive will denote the set of wearable sensors that are attached to a
Network (GAAIN) [6] and The Data Tamer system [7]. The person as ,…, . The same person can have a
GAAIN system performs schema matching manually;
number of smart devices and embedded sensors in his
attributes in each data source are mapped manually to attributes
household, which he may share with other people. The set of
that are defined in a common data model, but data dictionaries
sensors that are installed at the person ’s smart home as
and statistical distributions are being explored as potential
,…, . Therefore, each person is associated with
automated attribute mapping tools [8]. The Data Tamer system
is a data curation system that is built with the purpose of the sensors set , . However,
discovering data sources and integrating them to construct a the association of home sensors may be shared with more than
one person. We assume that each sensor generates a data
stream, with each record in the stream representing a set of
time-stamped values corresponding to attributes. More
formally, the data generated by a sensor is defined as a
time series | : , ,…, , , where the
order of records cannot be controlled. We assume that the
sensors can be mobile, so each record in the stream
generated by each is geo-stamped to reflect mobility,
and the record representation becomes : , , ,…, ,
where , , is the sensor’s location at time .
Therefore, the set of streams generated by sensors in will
be denoted as and the set of streams generated by sensors
in will be denoted . The system’s data stream hierarchy
is illustrated in Fig. 1.
Each sensor is itself described by a set of meta attributes;
modality , identification , and position , , . Home
sensors are further identified by the home space in which they
are located or embedded. Therefore, the set of home-sensor
data streams can be divided into sub-clusters .
Formally, data integration can be defined as a triple
, , , where is a virtual mediated view, is the set of
heterogeneous sources, and is a mapping. When queries are
Fig. 1. mHealth system data streams hierarchy. posed to the integration system , mapping asserts
connections between and the corresponding elements in . In exclusion of sources not contributing to solid analysis
the context of mHealth systems, will be the set of streams outcomes and will otherwise strain the storage and processing
coming from individual sensors and stored into individual data capacity.
tables/repositories, and will be the global database that 2) The entity level, where multiple and possibly duplicate
consolidates those streams in one database/repository. The data sources are linked to a single entity; the person. The main
global view can be further extended to higher abstractions,
requirement for this tier is to provide accurate and continuous
corresponding to higher levels on integration in the mHealth
system hierarchy. Those three levels are the sources level, the record linkage as a person changes context.
entities level, and the silos level. In the sources level, multiple 3) The event level, where one health condition or event
data streams from multiple sensors, all associated with a single can be identified from multiple heterogeneous data sources or
person, are integrated together into a single repository repositories. The main requirement for this tier is to link
corresponding to that person. In the entities level, the records that advertise or identify the same event for event
repositories of individual persons are integrated together into a registration and verification.
single repository corresponding to a health provider with 4) The provider level, where data from multiple persons
whom those persons are associated. In the silos level, the (patients) needs to be joined in a single repository linked to a
repositories from multiple health providers are integrated to single healthcare provider. The main requirement of for this
form a consolidated overall view.
tier is to provide a global and unified view of the data that can
Let ,…, be the set of source schemas for data be accessed to obtain insights that will be used for decision
sources, and ,…, be the global schema that is making.
defined as a set of tables 1 . In the global-as-view, the The integration process is not isolated at each level, but is
mapping between the local schemas and the global tables rather defined over intersections within each level and across
can be one-to-one, where ,…, ; or more levels. We will not tackle integration at the provider level and
complicated, where ,…, and is a query will leave it for future work. In the following sections, we
expression or a view built over local schemas. This introduce two epoch-based clustering approaches; a statistical
predefinition of a global schema comes at the expense of hierarchical clustering approach and a mixed-features
system flexibility; as new data sources become available, it is clustering approach. The statistical hierarchical clustering
necessary to rewrite the mediator module to incorporate performs vertical clustering among sensor streams by
changes to the global schema. On the other hand, local-as-view measuring the statistical similarities between sensor streams
integration defines each of the source schemas as within a time window. The clustering structure is then adjusted
,…, , where each is a view built over the global to reflect changes in the streams’ statistical features. This will
schema. This means that the contribution of each source can be help define clusters that reflect localized activity and thus can
specified by its owner independently of other sources in the be associated with a single person or entity. The mixed-feature
system. Furthermore, this mapping strategy makes the system clustering approach defines clusters of stream records for each
more flexible as new data sources decide to join/leave the of localized clusters produced by vertical clustering. This
system, since no global schema will need to be redefined. Such enables the integration scheme to capture similar records that
autonomy and elasticity features make the local-as-view the are associated with a specific health condition or event. By
more appealing integration option for mHealth systems. representing such clusters with their centroids, redundancy of
records can be controlled in a meaningful way.
B. Tiered mHealth Integration
C. Epoch-based Vertical Clustering of Sensor Streams
In a mHealth system, each wearable sensor is considered a
source that is formed by a set of records, each representing a In order to capture the changing and evolving correlation
set of key-value pairs. among health sensors as a person changes context, we consider
each sensor to be represented by its stream, and cluster sensor
Integration for mobile health systems can be done through streams at predefined time epochs. Algorithm 1 illustrates this
three scopes: a personal scope, where data integration is process. The algorithm starts with an initial set of clusters that
performed over multiple sources that are linked to a single are prebuilt using a simple geo-context matching scheme using
person; a local scope, where a health institute needs to sensor locations. Sensors streams are not needed at this step,
integrate data generated by multiple associated persons with its and sensors metadata is used to perform contextual matching to
internal data; and a global scope, where multiple health define the initial clustering structure.
institutes need to integrate their data to provide large-scale
health servicess. At each time epoch, the clustering structure may change as
sensors start generating data streams. Therefore, the statistical
In each of these scopes, four main levels of integration can properties of sensor streams are extracted and used to assess
be carried out, with varying integration requirements and the similarities between pairs of streams (line 7). A new
outcome. These levels can be outlined as follows: clustering structure is then built using hierarchical clustering
1) The source level, where data from multiple sources that first partitions the initial sensor clusters according to
observations of strong localized statistical similarities (lines
(sensors) is consolidated. The main requirement for this tier is
10-15), then merges clusters that have a strong
to provide assessment and rating functionalities that enable the interconnectivity and relative closeness (lines 16-21). The
outcome of the algorithm is a set of clusters that do not
1
We assume a relational model for simplicity.
necessarily consist of collocated sensors, with each cluster The mixed-features clustering step aims at grouping stream
representing a virtual and dynamic view of statistically records whose collective features possess a strong similarity
relevant sensor streams. that indicates the representation of a certain health condition or
activity. It does this by implementing a simple clustering
By doing bidirectional hierarchical clustering, the algorithm scheme of records within each cluster in the for the current
will capture localized statistical similarities within a single time epoch and measuring the similarity using a classical
cluster, indicating association of the corresponding sensor mixed-features distance measure. At each time epoch, the
streams with a single person. The agglomerative merger of algorithm will compute the distance between two records
those localized clusters discards the old collocated clustering in within a cluster as:
favor of new clusters that share the same statistical properties
and thus have a strong association with a certain health ∑
condition or activity. , (1)

D. Mixed-features Clustering
where is an indicator that is equal to 0 if either or
Algorithm 1: Epoch-based Vertical Sensors Clustering is missing (which means that there is no measurement from
Input: sensor for either records or ); otherwise =1. The value
− An initial set of clusters where each cluster represents the contribution of each sensor to the
corresponds to a context domain that includes collocated similarity between records and , and is computed differently
sensors. for different sensor reading types. If sensor readings are
− A set of data streams generated by sensors within a time numeric, then is the normalized difference between the
window .
Output: two readings and is computed as ,
− Updated cluster set . where runs over all sensor values within the epoch. If sensor
Procedure:
readings are binary or categorical, 0 if ;
1. begin at each time window
2. for each cluster do otherwise 1.
3. Identify the set of streams generated by sensors
IV. PROTOTYPE SETUP AND EXPERIMENTAL VALIDATION
.
4. for each data stream do The prototype setup that was used for the proposed
5. Identify statistical profile over window . integration scheme is composed of a 4-node Vagrant virtual
6. end for cluster – one namenode and 3 datanodes. Datanodes each run a
64-bit Centos OS 6.5 on 2048MB of RAM. The master
7. Measure similarity , for all pairs of statistical namenode uses 12MB of RAM. Each node runs HDFS,
profiles , and assign as edge weight Zookeeper, Kafka, Spark, and HBase. The prototype was fed
between streams , . mobile health data that was generated from two separate
8. end for experiments, all conducted for the purpose of recognizing daily
9. . human activity, with one using wearable sensors while the
10. for each cluster with defined intra-cluster edge weights other using home sensors. A classification model was built on
11. Apply a hierarchical clustering algorithm such as top of this platform to assess how integration of sources and
Chameleon to partition cluster into two clusters. consolidation of entity records affects the accuracy of
12. If is partitioned do identifying the different activities. The classification
13. , . performance was assessed for a single dataset and then for the
integrated datasets.
14. end if
15. end for A. Datasets Description and Preparation
16. for each pair of clusters , The two datasets to be integrated are both public datasets
17. if relative interconnectivity , 2 > do that are related to human activity recognition. Human activity
18. Merge clusters , into one cluster . recognition is an important health service that can provide
19. . mHealth applications with the functionalities necessary for
elderly care and emergency response, among other uses. A
20. end if summary of the activities identified by each dataset, the types
21. end for of sensors used and their numbers, as well as the number of
22. . participants, is outlined in Table I. We provide a brief
23. return description of both datasets below:
24. end

2
Relative interconnectivity is defined in the Chameleon algorithm as the
normalized minimum sum of cut edges that partition a cluster into two
roughly equal parts.
identifies what a person is doing based on the presence of a
TABLE I. SUMMARY OF THE TWO DATASETS trigger for a specific sensor. Timestamps are defined as a time
interval specified for each activity and intervals are not
Accelerometer (chest, wrist, ankle),
Sensor Type ECG, Gyroscope (wrist, ankle), necessarily contiguous. Location of each sensor is included
Magnetomer (wrist, ankle) and activity location is implicitly identified when a sensor is
Standing still, Sitting and relaxing, Lying triggered at a specific location.
down, Walking, Climbing stairs, Waist Realizing that both datasets were generated from separate
mHealth Activity Class bends forward, Front arm elevation,
Dataset Knees bending, Cycling, Jogging, experiments, we did not perform entity resolution to
Running, Jump front & back consolidate duplicate records related to the same person. We
Instead, we only performed event (activity) resolution via the
#Senors 8
integration of the different sources and the resolution of
#Subjects 10 records in both datasets. In addition, we needed to perform a
semantic matching script in order to mapp the activity labels
Sensor Type PIR, Magnetic, Flush, Pressure, Electric that were used in the ADL dataset with their semantically
Leaving, Toileting, Showering, Sleeping, comparable coded labels in the mHealth dataset. Based on this
ADL
Activity Class Breakfast, Lunch, Dinner, Snack, mapping, we synthesized timestamps for the mHealth dataset
Spare_Time/TV, Grooming that correspond to its defined activity intervals.
Dataset
#Sensors 12
B. Performance Analysis and Discussion
#Subjects 2 In order to evaluate how the integration scheme’s
incorporation of similar sensor streams, the MLP classification
1) The mHealth dataset: This dataset consists of records algorithm [15] was used to classify the human activity events
pertaining to body motion and vital signs for ten volunteers of within the integrated datasets and compare the classification
diverse profiles. Sensors placed on each subject's chest, right performance against a baseline classifier model built using
only the mHealth dataset. The reason we chose the mHealth
wrist and left ankle are used to measure the motion
dataset for the baseline classifier is because its comprehensive
experienced by diverse body parts. The sensor positioned on array of wearable sensors that can closely identify personal
the chest also provides 2-lead ECG measurements, which can activities. The classifiers were each run with a 10-fold cross
be potentially used for basic heart monitoring, checking for validation setting. As can be seen in Fig. 2, precision values
various arrhythmias or looking at the effects of exercise on the are improved by the integration scheme over all activity
ECG [13]. The mHealth dataset includes fine-grained real- classes. The performance of the integration scheme in terms of
valued sensor readings of activities over a short time interval. class recall was slightly higher than that of the baseline
No explicit timestamps or locations were included in the classifier, as can be seen in Fig. 3. Activities that involve
dataset. sharp fluctuations in sensor readings over a short time interval
2) The Activities of Daily Living dataset: This dataset were harder to identify, as can be seen in the figures.
consists of records about the daily activities performed by two To evaluate how the size of clusters in the integration
users in their own homes. Sensor events were recorded using a scheme affected the overall accuracy of the classifier, we used
wireless sensor network installed at different locations around the clustering structure from a single time epoch to investigate
the house [14]. The ADL dataset includes binary data that the classification accuracy. Accuracy values of the classifiers

100% 100%

90% 90%
80% 80%
70% 70%
Percentage

Percentage

60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
Classification using mHealth Dataset Classification with mHealth Dataset
10% 10%
Classification with tiered integration Classification with tiered integration
0% 0%
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
Activity Classes Activity Classes

Fig. 2. Classifier precision comparison. Fig. 3. Classifier recall comparison.


100% [3] J. Dipnall, M. Berk, F. Jacka, L. Williams, S. Dodd and J.
90% Pasco, "Data Integration Protocol In Ten-steps (DIPIT): A new
80% standard for medical researchers," Methods, vol. 69, no. 3, pp.
70%
237-246, 2014.
Percentage

60% [4] M. Tromp, A. Ravelli, G. Bonsel, A. Hasman and J. Reitsma,


50% "Results from simulated data sets: probabilistic record linkage
40% outperforms deterministic record linkage," Journal of Clinical
30% Epidemiology, vol. 64, no. 5, pp. 565-572, 2011.
20% [5] M. Friedman, A. Levy and T. Millstein, "Navigational Plans for
10% Data Integration," in Proceedings of the 16th National
0% Conference on Artificial Intelligence and the 11th Innovative
Applications of Artificial Intelligence Conference, Orlando,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Florida, USA, 1999.
Number of sensors/cluster
[6] N. Ashish and A. Toga, "Medical data transformation using
Fig. 4. Average accuracy of classifier based on the change in the size of rewriting," Frontiers in Neuroinformatics, vol. 9, no. 1, pp. 1-8,
integrated clusters of sensors. 2015.
using sensor clusters with the same number of sensors were [7] M. Stonebraker, G. Beskales, A. Pagan, D. Bruckner, M.
averaged for different cluster sizes. As Fig. 4 shows, Cherniack, S. Xu, I. Ilyas and S. Zdonik, "Data Curation at
performance is not linear with the size of clusters, indicating Scale: The Data Tamer System," in The 6th Biennnial
the existence of smaller subsets of sensors that can provide Conference on Innovative Data Systems Research (CIDR'13),
more accurate identification of health conditions and 2013.
activities. [8] P. Dewan, N. Ashish and A. Toga, "A schema-matching tool for
Alzheimer's disease data integration," in Proceedings of the 5th
V. CONCLUSIONS ACM Conference on Bioinformatics, Computational Biology,
and Health Informatics, Newport Beach, California, 2014.
In this paper, we proposed and evaluated an integration
scheme for mHealth systems. The integration scheme deployed [9] P. Kirk, J. Griffin, R. Savage, Z. Ghahramani and D. Wild,
a tiered approach to integrate the different wearable and home "Bayesian correlated clustering to integrate multiple datasets,"
sensors attached to a person, and a contextual record Bioinformatics, vol. 28, no. 24, p. 3290–3297, 2012.
registration and deduplication to associate data generated by [10] K. Hribernik, C. Hans, C. Kramer and K.-D. Thoben, "A
home sensors to a single person in order to achieve entity Service-oriented, Semantic Approach to Data Integration for an
resolution. We applied the tiered data integration scheme to Internet of Things Supporting Autonomous Cooperating
two activity recognition datasets produced by wearable sensors Logistics Processes," in Architecting the Internet of Things, D.
and home sensors. The preliminary performance evaluation Uckelmann, M. Harrison and F. Michahelles, Eds., Springer
showed promising results in terms of precision improvement, Berlin Heidelberg, 2011, pp. 131-158.
with comparable recall values for most of the activity classes. [11] C. Fan, J. Song, Z. Wen, X. Zhang, Y. Wu and J. Zou, "A
scalable Internet of Things Lean Data provision architecture
We are currently building a real-life testbed consisting of based on ontology," in IEEE GCC Conference and Exhibition
two body area networks whose sensors data are to be collected (GCC), Dubai, 2011.
in real time and integrated with the data produced by a set of [12] C. Perera, A. Zaslavsky, C. Liu, M. Compton, P. Christen and
contextual sensors. Our next step is to investigate the D. Georgakopoulos, "Sensor Search Techniques for Sensing as a
expansion of the proposed approach by enhancing the Service Architecture for the Internet of Things," IEEE Sensors
memoryless sensors clustering approach to incrementally Journal, vol. 14, no. 2, pp. 406-420, 2014.
incorporate minor clustering structure changes in successive [13] O. Banos, R. Garcia, J. A. Holgado, M. Damas, H. Pomares, I.
time epochs. Rojas, A. Saez and C. Villalonga, "mHealthDroid: a novel
framework for agile development of mobile health applications,"
ACKNOWLEDGMENT
in Proceedings of the 6th International Work-conference on
This work was made possible by NRF-UAEU grant # Ambient Assisted Living an Active Ageing (IWAAL 2014),
31T005 – from United Arab Emirates University, NRF. Belfast, Northern Ireland, 2014.
[14] F. Ordóñez, P. de Toledo and A. Sanchis, "Activity Recognition
REFERENCES Using Hybrid Generative/Discriminative Models on Home
Environments Using Binary Sensors," Sensors, vol. 13, no. 5,
[1] J. Gubbi, R. Buyya, S. Marusic and M. Palaniswami, "Internet
pp. 5460-5477, 2013.
of Things (IoT): A vision, architectural elements, and future
directions," Future Generation Computer Systems, vol. 29, no. [15] T. Breuel and F. Shafait, "AutoMLP: Simple, Effective, Fully
7, pp. 1645-1660, 2013. Automated Learning Rate and Size Adjustment," The Learning
Workshop, Cliff Lodge, Snowbird, Utah, United States, 2010.
[2] M. Chen, S. Mao and Y. Yunhao Liu, "Big Data: A Survey,"
Mobile Networks and Applications, vol. 19, no. 2, pp. 171-209,
2014.

Das könnte Ihnen auch gefallen