Sie sind auf Seite 1von 22

An Extensible Conceptual Model for Tabular Scientific

Datasets
Javad Chamanara, Michael Owonibi, Alsayed Algergawy, Roman Gerlach
Friedrich Schiller University of Jena
Germany
Email : firstname.lastname@uni-jena.de
DATASETS Symposium, June 2015

funded by :

Research Data Management (RDM)


Increasingly becoming more important
because of
Researchers increased awareness of benefits of
data management
Data proliferation
Funding agency requirements
Proliferation of RDM systems
Primary data and metadata management

Research Data Management (RDM) Systems


Examples BE BExIS, Pangaea, Dryad . . .
Data heterogeneity challenge

Data model heterogeneity


Structural heterogeneity
Syntactic heterogeneity
Semantic heterogeneity

Research Data Management (RDM) Systems


Focus on making
Data discoverable by humans
Data downloadable as data files

Storage mechanism
Metadata + Primary data storage/archiving as
files
Metadata + Data schema definition (in
metadata) + Primary data storage as files / dbms

Research Data Management (RDM) Systems


Typical requirements
Heterogenous data support
Data discovery beyond metadata
Which datasets have temperature higher than 35 C?

Data harmonization/integration
Flexible access pattern
Provenance management
Flexible security and access management
Machine and human interpretability of data
Semantic enablement
. . . . etc

Current data management practices in many RDM


systems can not support all the requirements
5

Aim
Datasets predominantly tabular
Therefore, in order to effectively manage tabular data in a data repository, there is
a need to model the composition of tabular datasets such that it
satisfies the manifold data management outlined requirements

Application in biodiversity research domain


Examples of data
Applicability in other domains

Related Work : W3C Tabular Data Model


Simple table
set of rows where each row contains information about an object

Annotation table
simple table + additional metadata

Group of tables

Related Work : INSPIRE Observation and


Measurementv(O&M) Model
Representation of records of scientific measurement
Observation as an event whose result is an

estimation of the value


of some property(ies)
of a feature-of-interest
obtained using a specified procedure(the instrument, algorithm or process
used)
at a specific time
under some conditions (event specific parameters e.g. instrument settings)

Related Work : Statistical Data and Metadata


Exchange (SDMX) Model
Standards for describing statistical data and metadata
Data structure definition as a set of columns
Columns

Function - dimension, measure or attribute


Roles - identity, time format, frequency
Based on a pre-defined concept
Other properties - , data type, domain

Earlier Work
High level concepts of research data repository conceptual model
cmp Component Model
Metadata Structure

Semantics

Data Structure

Geo

use

use

Metadata

Data

use

Administration

use

J. Chamanara and B. Konig-Ries, A conceptual model for data management in the field of ecology, Ecological Informatics, vol. 24, 2014

10

Core Model : Dataset


Dataset
Set of tuples
Data container for observations,
measurements, simulations, and other
supported forms of data
has one Data Structure

11

Core Model : Data Structure & Data


Descriptor

Data Structure

defines the organization & meaning of the


data
comprises several Data Descriptors

Data Descriptor
Contains information such as the name,
data type, unit of measurement, procedure
of obtaining data, methodology, scale, etc.
of the columns of datasets.
Variable or Parameter (variables auxiliary)
Semantic annotation capability

12

Core Model - Data Descriptor Reusability


Factor out reusable elements of variable
& parameter
Reuse different data structures used in
different datasets
Automatic unit conversion
functionality

Data Descriptor
Temperature

Benefits
Cross dataset query
Easier data integration
Plot Temp(C)
Enhanced data discovery

Depth

T(F)

Time

1/12/98

22

95

25

10

21

103
2/12/98
Dataset 2

Data Structure
Dataset 1

13

Core Model - Data Cell


Data Tuple as a collection of Data Cell
containing some values

Linked to Data Descriptor


Single vs Multiple Value Cell
Data Cell Auxilliary Infomation
Sampling time
Result time
Descriptions about the values

14

Core Model : Sample Table


Variable

Observation (Tuple)

Variable

Parameter

S.N.

Tmp

Time

Depth

Pos.

Hu.

14

22, 22

1/1/12

-10

46

13

23.22

1/1/12

-10

45

16

21, 24

1/1/12

-11

30

16

18, 18

2/1/12

-10

25

18

14, 15

2/1/12

-9

25

Multiple Value Cell

Data Structure

Single Value Cell

15

Model Extensions - Amendment


(Special) data cells
Attached to specific tuples
Example usage
capturing exceptional observations

Different tuples with different Amendment


Observation (Tuple)
Data Structure
Soil_Moi.
12
10
12
15
17

Depth
-10
-10
-11
-10
-9

Pos.
A
B
C
A
D

Hu
.
46
45
30
25
25

Soil_N.
14
13
16
16
18

Tm
p
22
23
21
18
14

Time

A1

A2

78
1
2
3
5
6

A3

A4

Yes
No

100

0.11

Amendments

red

16

Model Extensions - Extended Property


User defined, dataset specific attribute
whose value applies to a single column
Sample usage
Storing the error margin of the instrument
used to measure the values in a variable
Extended Properties
Error

Observation (Tuple)

0.10%
Rounded

Yes

Interval

Data Structure
Soil_Moi.

Depth

Pos.

Hu. Soil_N.

Tmp

Time

12

-10

46

14

22

10

-10

45

13

23

12

-11

30

16

21

15

-10

25

16

18

17

-9

25

18

14

1 Sec.

17

Model Extensions - View


Subset of a table obtained by selection or
projection
Purpose
Further processing, sharing or sampling
Security /Digital rights management

Soil_N. Tmp

Time

Soil_Moi Depth

Pos

Hu.

16
14
13
16

2/11/01
3/11/01
4/11/01
5/11/01

15
12
10
12

A
A
B
C

25
46
45
30

18
22
23
21

Source Dataset

-10
-10
-10
-11

Tmp

Time

Soil_Moi

18
22
23

2/11/01
3/11/01
4/11/01

15
12
10

View

18

Model Extensions -Spanning View


View across multiple dataset using the
same Data Structure
Data Structure
Sample Data Structure
Soil_N.
16
14
13
16

Tmp
18
22
23
21

Time
2/11/01
3/11/01
4/11/01
5/11/01

Source Dataset 1
Soil_N.
26
14
13

Tmp
33
28
29

Soil_Moi
15
12
10
12

Time
1/1/11
1/2/11
1/3/11

Source Dataset 2

Depth
-10
-10
-10
-11

Soil_Moi
30
23
28

Pos
A
A
B
C

Hu.
25
46
45
30

Depth
-10
-10
-10

Pos.
B
C
D
Tmp
18
22
23
33
28
29

Hu.
15
32
21
Time
2/11/01
3/11/01
4/11/01
1/1/11
1/2/11
1/3/11

Soil_Moi
15
12
10
30
23
28

Spanning View (based on Source Dataset1 & Source Dataset 2)

19

Model Extensions - Dateset Version


Permanent, change-resistant, citeable
copy of a dataset
Independent of subsequent changes
Composed of Data Tuples
Dataset can have multiple Dataset
Versions.

Based on
Checkout /Checkin mechanism
Version difference computation and
storage

20

Conclusion
Tabular data model presented
can be used to enforce the structure and type of information to be collected
as well as a base for data validation

Model assists scientists in


Datasets discovery, integration, quality management, provenance, citations,
interpretability

Used in BExIS 2 software


Projects using the RDM application include AquaDiva (https://aquadivapub1.inf-bb.uni-jena.de/), iDiv (http://idata.idiv.de/about-bdu )
About 5 more projects planning to migrate to /start using BExIS 2

21

Thanks For Your Attention

Any Questions?
22

Das könnte Ihnen auch gefallen