Beruflich Dokumente
Kultur Dokumente
Abstract—One of the most promising instantiations of the Schema integration also targets providing a seamless and
Internet of Things (IoT) are mobile health (mHealth) systems, unified view of the underlying data to outside users.
which promise to deliver intelligent health monitoring and
assisted living as well as advanced and integrated health services. Deterministic and probabilistic record linkage schemes for
To realize the full potential of these services, fragmented and entity resolution in medical records have been researched
heterogeneous data that is generated by different segments of the extensively [4]. Most of these schemes require manual
system need to be consolidated in order to support high-quality validation of linkage to be performed by medical experts in
processes. This paper proposes a tiered data integration scheme order to guarantee the integrity of the resulting consolidated
for mHealth systems that works on the schema, entity, and event entities. Little work has been done to fully automate and
levels. The proposed scheme incorporates an algorithm that parallelize this process. This is an essential requirement in
merges and ranks sensor streams for schema integration and mHealth systems, especially with the growing number of users,
event identification, and performs contextual record registration the variety of contexts in which they can exist, and the
and deduplication for entity resolution. We tested the proposed streaming nature of data they can generate using smart
integration scheme on two sets of sensor-based mHealth data wearable devices. Automating the process of data integration
related to human activity recognition. Preliminary results show from multiple, heterogeneous sensors is a vital requirement,
that the proposed integration scheme contributes to especially as a user (i.e. patient) can exist in varying IoT-
enhancements in event identification precision compared to the enabled contexts and be linked to different connected devices
classification performance of separate datasets produced within
over time. Delaying the integration process until experts are
the same mHealth system.
available to look at the data may result in poor response to
Keywords—data integration; mHealth; sensor networks; critical and urgent situations.
schema integration Two strategies can be used to map the individual sources to
a unified virtual view: the global-as-view and the local-as-view
I. INTRODUCTION
[5]. In the global-as-view (GAV) mapping strategy, the virtual
Data integration is the process of turning data created by view is a unique, predefined, and global mapping from the
independent data sources into a unified dataset that can be used local sources to the global schema. The integration complexity
for analysis. This process is an integral element of data is bounded to a mediator module, which maps a global query
preprocessing, which is crucial for any data management into a set of local query plans that retrieve data from the local
system since it ensures that data is useful for smart monitoring sources. In the local-as-view (LAV) mapping strategy, the data
and actuation [1]. Data integration is becoming an essential from the individual sources are transformed into local views
requirement for IoT-enabled services and applications, since that are migrated to upper-layer repositories. The query
millions of sensing and identification devices will be connected processor has to figure out how to map a query placed by the
to the Internet in order to make their data available [2]. To start user into an execution plan over the sources. Due to the nature
generating profit off such services and applications, of sensor-based mHealth systems, a hybrid of both strategies is
stakeholders will need the data used by those services and needed. Today’s layered and heterogeneous mHealth systems
applications to be timely and accurate. Therefore, consolidation need the stability of GAV, which guarantees consistent
of data to produce high-quality processes that can support performance, as well as the elasticity of LAV, which allows for
integrated services is crucial. This is especially evident in the a flexible data federation with little need for schema redesign.
field of health systems, where the requirement to integrate
large and heterogeneous data sources from multiple providers In this paper, we propose an integration scheme for
in order to enable sophisticated analysis is on the rise [3]. mHealth systems that can take multiple data sources and
consolidate their data for better assessment of health-related
Data integration involves two primary operations: entity conditions. The data integration scheme spans multiple tiers
resolution and schema integration. Embedded in the entity that correspond to different levels in the mHealth system; the
resolution process is record linkage and deduplication, which sources level, the entity level, the events level, and the schema
involves finding entities that may be considered similar and level. The sources level corresponds to the integration of
consolidating their records. Attribute redundancy reduction and sensor data streams. The entity level corresponds to the
attribute mappings are handled within schema integration. integration of sensor data associated with one person. The
D. Mixed-features Clustering
where is an indicator that is equal to 0 if either or
Algorithm 1: Epoch-based Vertical Sensors Clustering is missing (which means that there is no measurement from
Input: sensor for either records or ); otherwise =1. The value
− An initial set of clusters where each cluster represents the contribution of each sensor to the
corresponds to a context domain that includes collocated similarity between records and , and is computed differently
sensors. for different sensor reading types. If sensor readings are
− A set of data streams generated by sensors within a time numeric, then is the normalized difference between the
window .
Output: two readings and is computed as ,
− Updated cluster set . where runs over all sensor values within the epoch. If sensor
Procedure:
readings are binary or categorical, 0 if ;
1. begin at each time window
2. for each cluster do otherwise 1.
3. Identify the set of streams generated by sensors
IV. PROTOTYPE SETUP AND EXPERIMENTAL VALIDATION
.
4. for each data stream do The prototype setup that was used for the proposed
5. Identify statistical profile over window . integration scheme is composed of a 4-node Vagrant virtual
6. end for cluster – one namenode and 3 datanodes. Datanodes each run a
64-bit Centos OS 6.5 on 2048MB of RAM. The master
7. Measure similarity , for all pairs of statistical namenode uses 12MB of RAM. Each node runs HDFS,
profiles , and assign as edge weight Zookeeper, Kafka, Spark, and HBase. The prototype was fed
between streams , . mobile health data that was generated from two separate
8. end for experiments, all conducted for the purpose of recognizing daily
9. . human activity, with one using wearable sensors while the
10. for each cluster with defined intra-cluster edge weights other using home sensors. A classification model was built on
11. Apply a hierarchical clustering algorithm such as top of this platform to assess how integration of sources and
Chameleon to partition cluster into two clusters. consolidation of entity records affects the accuracy of
12. If is partitioned do identifying the different activities. The classification
13. , . performance was assessed for a single dataset and then for the
integrated datasets.
14. end if
15. end for A. Datasets Description and Preparation
16. for each pair of clusters , The two datasets to be integrated are both public datasets
17. if relative interconnectivity , 2 > do that are related to human activity recognition. Human activity
18. Merge clusters , into one cluster . recognition is an important health service that can provide
19. . mHealth applications with the functionalities necessary for
elderly care and emergency response, among other uses. A
20. end if summary of the activities identified by each dataset, the types
21. end for of sensors used and their numbers, as well as the number of
22. . participants, is outlined in Table I. We provide a brief
23. return description of both datasets below:
24. end
2
Relative interconnectivity is defined in the Chameleon algorithm as the
normalized minimum sum of cut edges that partition a cluster into two
roughly equal parts.
identifies what a person is doing based on the presence of a
TABLE I. SUMMARY OF THE TWO DATASETS trigger for a specific sensor. Timestamps are defined as a time
interval specified for each activity and intervals are not
Accelerometer (chest, wrist, ankle),
Sensor Type ECG, Gyroscope (wrist, ankle), necessarily contiguous. Location of each sensor is included
Magnetomer (wrist, ankle) and activity location is implicitly identified when a sensor is
Standing still, Sitting and relaxing, Lying triggered at a specific location.
down, Walking, Climbing stairs, Waist Realizing that both datasets were generated from separate
mHealth Activity Class bends forward, Front arm elevation,
Dataset Knees bending, Cycling, Jogging, experiments, we did not perform entity resolution to
Running, Jump front & back consolidate duplicate records related to the same person. We
Instead, we only performed event (activity) resolution via the
#Senors 8
integration of the different sources and the resolution of
#Subjects 10 records in both datasets. In addition, we needed to perform a
semantic matching script in order to mapp the activity labels
Sensor Type PIR, Magnetic, Flush, Pressure, Electric that were used in the ADL dataset with their semantically
Leaving, Toileting, Showering, Sleeping, comparable coded labels in the mHealth dataset. Based on this
ADL
Activity Class Breakfast, Lunch, Dinner, Snack, mapping, we synthesized timestamps for the mHealth dataset
Spare_Time/TV, Grooming that correspond to its defined activity intervals.
Dataset
#Sensors 12
B. Performance Analysis and Discussion
#Subjects 2 In order to evaluate how the integration scheme’s
incorporation of similar sensor streams, the MLP classification
1) The mHealth dataset: This dataset consists of records algorithm [15] was used to classify the human activity events
pertaining to body motion and vital signs for ten volunteers of within the integrated datasets and compare the classification
diverse profiles. Sensors placed on each subject's chest, right performance against a baseline classifier model built using
only the mHealth dataset. The reason we chose the mHealth
wrist and left ankle are used to measure the motion
dataset for the baseline classifier is because its comprehensive
experienced by diverse body parts. The sensor positioned on array of wearable sensors that can closely identify personal
the chest also provides 2-lead ECG measurements, which can activities. The classifiers were each run with a 10-fold cross
be potentially used for basic heart monitoring, checking for validation setting. As can be seen in Fig. 2, precision values
various arrhythmias or looking at the effects of exercise on the are improved by the integration scheme over all activity
ECG [13]. The mHealth dataset includes fine-grained real- classes. The performance of the integration scheme in terms of
valued sensor readings of activities over a short time interval. class recall was slightly higher than that of the baseline
No explicit timestamps or locations were included in the classifier, as can be seen in Fig. 3. Activities that involve
dataset. sharp fluctuations in sensor readings over a short time interval
2) The Activities of Daily Living dataset: This dataset were harder to identify, as can be seen in the figures.
consists of records about the daily activities performed by two To evaluate how the size of clusters in the integration
users in their own homes. Sensor events were recorded using a scheme affected the overall accuracy of the classifier, we used
wireless sensor network installed at different locations around the clustering structure from a single time epoch to investigate
the house [14]. The ADL dataset includes binary data that the classification accuracy. Accuracy values of the classifiers
100% 100%
90% 90%
80% 80%
70% 70%
Percentage
Percentage
60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
Classification using mHealth Dataset Classification with mHealth Dataset
10% 10%
Classification with tiered integration Classification with tiered integration
0% 0%
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
Activity Classes Activity Classes