Beruflich Dokumente
Kultur Dokumente
Linked Big Data: Towards a Manifold Increase in Big Data Value and Veracity
AbstractThe Web of Data is an increasingly rich source all ve described Vs have been successfully addressed.
of information, which makes it useful for Big Data analysis. Therefore, we distinguish between Value as big datas fth
However, there is no guarantee that this Web of Data will dimension which requires specic attention, and big data
provide the consumer with truthful and valuable information.
Most research has focused on Big Datas Volume, Velocity, and value as the ultimate goal after carefully addressing all ve
Variety dimensions. Unfortunately, Veracity and Value, often dimensions. In this paper we discuss the potential of Linked
regarded as the fourth and fth dimensions, have been largely Data methods to tackle all ve Vs, and propose methods
overlooked. In this paper we discuss the potential of Linked for addressing the value and veracity dimensions which have
Data methods to tackle all ve Vs, and particularly propose remained largely unaddressed by previous research.
methods for addressing the last two dimensions. We draw
parallels between Linked and Big Data methods, and propose With around 5 billion documents1 , the Web is the largest
the application of existing methods to improve and maintain crowdsourcing effort known to mankind. Taking Wikipedia
quality and address Big Datas veracity challenge. for example, the open encyclopedia has almost 5 million
Keywords-Linked Data, Web of Data, Veracity, Value, Big articles written in English, and a total of 290 ofcial
Data Dimensions languages. Such an effort requires human intervention to
add, modify, and create translations to a big knowledge
I. I NTRODUCTION base. Having said that, it is a known fact that wiki articles
Advancements in Web applications (e.g. social networks) may have inconsistencies over the same article in different
and hardware compliant with the Internet of Things concept, languages, un-cited claims (often marked on Wikipedia), and
brought a change to data processing and analytics methods. completely false information as pages can be easily modied
This is due to the vast amount of data being generated by untrusted sources.
continuously, called Big Data, where data is character- Moving towards a more structured web of data (Linked
ised not only by the enormous volume or the velocity Data), the Schema.org2 effort was initiated in order to create
of its generation but also by the heterogeneity, diversity a common set of schemas to encourage web publishers
and complexity of the data[18]. Originally Big Data has to use structured data, using formats such as Microdata,
been described as a three-dimensional model, citing Volume, RDFa or JSON-LD. With structured data, documents be-
Variety and Velocity as the three major characteristics come semantically enriched resources and can be easily
and challenges for effectively generating value. Later, related to other resources. The return on investment of such
Veracity was introduced as a fourth, frequently overlooked practices is high, in the sense that the web content is shifted
but equally pressing dimension. The Veracity dimension from (raw) data to information resources, thus drastically
deals with the uncertainty of data due to various factors increasing the value. These resources can then easily be
such as data inconsistencies, incompleteness, and deliberate transformed into machine-comprehensible knowledge with
deception. the right tools. In 2013, Facebook introduced the Facebook
Value is often regarded as a fth dimension, addressing Graph Search, a semantic search engine based on natural lan-
the need to enrich raw and unprocessed data by extracting guage meaning rather than just keywords. Similarly, Google
higher-level knowledge for use across different scenarios.
However, the term remains confusing since the generation 1 Source: http://worldwidewebsize.com Thursday 25th June 2015
of actual value from Big Data is achievable only after 2 https://schema.org
93
A report by IBM [24] discusses the real-world use of steps, namely Semantic Lifting, Metadata Enrichment, Data
big data and how can value be extracted in uncertain data. Integration and Reasoning. These are discussed further in
Schroeck et al. describe that advancements in techniques the remainder of this section.
(such as analytic techniques) enabled entities to make use
of the huge amount of data to extract new insights, fast and
accurate. On the other hand, similarly to Buhl et al. [8],
the authors state that these entities should consider veracity A. Semantic Lifting
as an important dimension, by adding meaning to the data.
According to the authors this can be achieved through data Semantic Lifting is the process of transforming unstruc-
fusion tools or fuzzy logic approaches. Participants in the tured, semi-structured or even structured data into a se-
IBM survey [24] also stated that the integration component mantically enriched information using RDF schemas as
on big data platforms is a core component to enable good vocabularies. Nowadays a lot of open source conversion
big data analytic efforts, which in return creates value. tools offer Linked Data as one possible output. A key
advantage of the RDF data model is that the structure of the
III. L INKED DATA E NHANCING THE VALUE OF R AW schema and the data is the same, i.e. both are represented as
DATA subgraphs of one graph. Therefore, for example, querying
In May 2007 the rst version of the Linked Open Data the data together with the schema can easily be done in the
cloud (LOD Cloud) [15] was initially published as a set of same query language (SPARQL).
12 datasets, following the Linked Data principles4 . During Schemas can either be reused (which is encouraged) or
the last eight years the LOD Cloud has grown at a fast created. The Linked Open Vocabularies initiative5 has listed
pace, where around 1014 datasets from 570 providers (as of around 500 vocabularies for reuse. This is the rst step
August 2014) have been added. All of these data providers towards making data meaningful and valuable. In the Linked
use Linked Data as common standard representation for Data context, valuable means enabling machines to extract
interoperability. meaning, and reusing resources by linking them.
The term Linked Data refers to a set of best practices
Consider the following two snippets in CSV6 :
for publishing and connecting structured data on the Web
through URIs, HTTP and RDF [5]. Big Data modelled
Time,Event,Type,Presenter,Location
according to the Linked Data principles can be considered as ...
structured Big Data: the same ve Vs apply; in particular, 27 Aug 2014 09:00,Wikidata,Keynote,Markus Krtzsch,
27 Aug 2014 10:15,Working with Wikidata: A Hands-on
the overall LOD Cloud amounts to a voluminous dataset Guide for Researchers and Developers,Tutorial,
(Volume), and the rise of streaming RDF data and sensor Markus Krtzsch,
94
!
@prefix ex: <http://purl.org/net/wiss2014/> . relational databases or XML, the RDF data model is capable
# ... further prefix bindings
of handling this blurring distinction between metadata and
<ex:presenters/#MarkusKrtzsch> rdf:type foaf:Person data, since metadata is described in the same way as the
;
ex:hasAffilation dbpedia: data itself.
Dresden_University_of_Technology ;
foaf:name "Markus Krtzsch" . C. Link Discovery and Interlinking
<ex:schedule/#Keynote1> rdf:type ex:Keynote ; The LOD Cloud is clustered into a variety of domains.
ex:presenter <ex:presenters/#MarkusKrtzsch> ;
ex:title "Wikidata" ; All datasets are represented in the same standard data
dcterms:date "2014-08-27T09:00:00"^^xsd:dateTime . model; however, different datasets typically employ different
<ex:schedule/#Tutorial1> rdf:type ex:Tutorial ; identiers for the same things (e.g. the same person) and
ex:presenter <ex:presenters/#MarkusKrtzsch> ; might also use different schemas. Since RDF is represented
ex:title "Working with Wikidata: A Hands-on Guide
for Researchers and Developers" ; as graphs, it is easy to create edges between similar nodes in
dcterms:date "2014-08-27T10:15:00"^^xsd:dateTime . different datasets (or graphs). Similar nodes between differ-
Listing 1. RDF Representation of Listings III-A and III-A in Turtle
ent datasets can be discovered through semantic matching.
For example, a tourist ofce can link guided walking tour
With such a representation, a machine can understand that information with geolocation data, by matching the name
Markus Krtzsch will be presenting both sessions, and that of the tour places of interests against GPS locations. This
he is afliated with the University of Technology in Dresden. discovery would enable further analysis and possibly more
dbpedia:Dresden_University_of_Technology innovative use of the data itself, including its exploitation for
is a semantic resource in DBpedia, thus a machine can business purposes, as in nding the most interesting route
infer more knowledge, such as the country and the year taking the least time.
the university was established. Therefore value is not Tools helping this link discovery include Silk [30] and
added only through semantic enriching of the data itself, LIMES [20]. Silk7 is a exible link discovery framework
but also through reuse by linking to external and internal allowing users to dene linkage heuristics using a declarative
resources (such as <ex:person/MarkusKrtzsch>). language. Similarly, LIMES8 is a large-scale link discovery
Thanks to specications such as RDFa, Linked Data can framework based on metric spaces. Another approach for
also be embedded into HTML pages to make their content link discovery is DHR [13], where, unlike Silk and LIMES,
useful for both human readers and machine services. The the links are discovered by closed frequent graphs and
Schema.org vocabulary is being promoted explicitly with similarity matching.
this use case in mind.
D. Data Integration
B. Promoting Metadata into Valuable Data
The amount of new data being generated every second is
Data can be made more valuable (retrievable, comprehens-
resulting in an even bigger information overload, making it
ible, reusable, . . . ) by providing expressive metadata using
hard to nd the needle in the haystack. Data integration is the
a standardised metadata vocabulary. Sophisticated metadata
process of incorporating heterogenous data from different
are often more complex than simple key/value lists. For
sources into a central, canonical data source, thus establish-
example, descriptions of provenance, such as what activity,
ing a single entry point with only the required information.
involving what agents, led to the creation of the artefact
being described, form knowledge graphs of their own (for 7 http://silk-framework.com
example the W3C PROV standard [12]). More than, e.g., 8 http://aksw.org/Projects/limes
95
Data integration is a higher level process, which incorporates polynomial time. On the other hand, once these metrics are
tasks such as link discovery and quality assessment. Unlike computed on large datasets, the algorithms upper bound
mere link discovery, data integration does not create links grows and, as a result, the computation becomes intractable
between the datasets being integrated. time-wise. In this section we describe eight quality metrics
LDIF9 is a framework for Linked Data Integration. LDIF (extracted from [32]) that we deem relevant for the veracity
translates linked datasets into a clean, local target represent- challenge in relation to Linked Data, and probabilistic ap-
ation, keeping track of the provenance [25]. proximation techniques that can be used to compute them
To integrate a linked dataset with other sources, the efciently while still achieving a high precision.
sources can have different formats. Ontology-based data
access (OBDA) can also assist in combining data from mul- A. Reservoir sampling
tiple heterogenous sources through a conceptual layer [9]. Reservoir sampling is a statistics-based technique that
Therefore, using OBDA, linked datasets can be integrated facilitates the sampling of evenly distributed items. The
together with datasets having different formats. Ontop [1] sampling process randomly selects k elements ( n) from
is an example of an OBDA framework, where relational a source list, possibly of an unknown size n, such that
databases are queried as virtual RDF graphs. Similarly, each element has a k/n probability of being chosen [29].
Ultrawrap [27] encodes a relational database into an RDF Therefore, the size of k affects the computation time and the
graph using SQL views, allowing SPARQL queries to be accurateness of the result. The reservoir sampling technique
executed on these RDF graphs without requiring Extract- is part of the randomised algorithms family, that offer
Transform-Load (ETL). Another such NoETL approach is simple and fast solutions for time-consuming counterparts
SparqlMap [28], which is a SPARQL-to-SQL rewriter based by implementing a degree of randomness.
on the W3C R2RML specication10 . A survey of further 1) Dereferenceability Metric: HTTP URIs should be
OBDA approaches can be found in [31]. dereferenceable, i.e. HTTP clients should be able to retrieve
the resources identied by a URI by download from that
E. Reasoning
URI. A typical web URI resource returns a 200 OK code
Reasoning enables consumers to derive new facts that indicating that a request was successful and a 4xx or 5xx
were not previously explicitly expressed in a knowledge code if the request was unsuccessful. In Linked Data, a
base, usually using the simple RDF Schema logic, de- successful request should return an RDF document con-
scription logic as implemented in the OWL Web Ontology taining triples that describe the requested resource. Unless
Language, or rst order predicate logic. Thanks to the wide URIs include a hash (#), they should respond with a 303
tool support, e.g., by the OWL reasoner HermiT11 , reasoning Redirect code [23]. Dereferenceable URIs ensure that
is one of the main selling points of Semantic technologies. the data is complete in a sense that data consumers are
In Linked Data, reasoning poses a number of challenges, as able to retrieve machine-comprehensible representations of
mentioned by Polleres et al. in their lecture notes [21]. These all resources of interest. This metric computes the ratio of
challenges include the existence of inconsistent data and the correctly dereferenceable URIs.
evolution of the data. Solutions for reasoning over big linked 2) Entities as Members of Disjoint Classes: A consistent
datasets and streaming linked data are available [21, 7, 10]. dataset should be free of any logical or formal contra-
IV. L INKED DATA I DENTIFYING V ERACITY IN B IG dictions, both at an explicit level and an implicit level.
DATA Consistency is a central challenge for the veracity dimension
as it ensures correctness of data. In the Web Ontology Lan-
The fourth dimension of Big Data, Veracity, is more guage (OWL), classes can be dened as disjoint, meaning
challenging than the volume, variety, velocity and value. that they do not have any entities as common members.
It is dened as conformity with truth or facts by the In practice, however, entities in a dataset may accidentally
Merriam-Webster dictionary. This denition makes the vera- end up as members of disjoint classes.These would cause
city dimension difcult to measure, as truth is not always conicting information and data consumers would not be
known objectively. In practice, a consumer might doubt able to correctly interpret the data, thus decreasing veracity
data from producer A but trust data from producer B. On and value of the data. This metric checks if the dened types
the other hand, another consumer might actually doubt the (and their super types) for a resource are explicitly declared
truthfulness of producer B, but blindly trust producer A. disjoint in the schema.
Zaveri et al. present a comprehensive survey of quality
metrics for linked open datasets [32]. Most of the quality B. Bloom Filters
metrics discussed are deterministic and computable within A Bloom Filter is a fast and space efcient bit vector
9 http://ldif.wbsg.de/ data structure commonly used to query for elements in a
10 http://www.w3.org/TR/r2rml/ set (is element A in the set?). The size of the bit vector
11 http://hermit-reasoner.com plays an important role with regard to the precision of the
96
result. A set of hash functions is used to map each item whilst a Datatype Property (owl:DatatypeProperty)
added to be compared to a corresponding set of bits in the expects data of a simple type (e.g. string, integer, date/time)
array lter. The main drawback of Bloom Filters is that they as its value. Misusing a property creates inconsistencies in
can produce false positives, therefore possibly identifying an the data and inhibits the consumer from appropriate reuse.
item as existing in the lter when it is not, but this happens 3) Usage of Undened Classes and Properties: Using
with a very low probability. The trade-off of having a fast classes and properties without any formal denition, i.e.
computation yet a very close estimate of the result depends without a denition in any schema, makes resources mean-
on the size of the bit vector. With some modications, Bloom ingless, since data consumers cannot understand the se-
Filters are useful for detecting duplicates in data streams, mantics of data. This further makes the data incoherent and
generating a low false positive rate during the process [3]. thus reduces the effectiveness of tools such as reasoners.
1) Extensional Conciseness: A linked dataset is exten- This metric checks that each resource type (the object
sionally concise if there are no redundant instances. Having attached to the rdf:type property) and each property
duplicate records with different IDs can cause doubtfulness, attached to the resource can be dereferenced to a denition
and thus lower the veracity of a dataset. This metric meas- in some schema.
ures the number of unique instances found in the dataset. The 4) Usage of Blank Nodes: Blank nodes are local identi-
uniqueness of instances is determined from their properties ers given during the publishing process, which are not URIs.
and values. An instance is unique if no other instance (in These identiers cannot be externally referenced and thus are
the same dataset) exists with the same set of properties and useless to the data consumer. Blank nodes furthermore make
corresponding values. merging to different sources a difcult task [6]. Therefore,
high usage of blank nodes reduces the value of a dataset.
C. Traditional Techniques This metric computes the ratio of blank nodes in a dataset.
Some metrics do not require any probabilistic approxim- V. F INAL R EMARKS
ation techniques since their computation is straightforward, As discussed in Section II, Value and Veracity are para-
such as pattern matching some predicate in the datasets mount to successful Big Data technology. This article fo-
triples. In this section we will describe a number of simple cuses precisely on these two dimensions, and how Linked
metrics for Linked Data quality with regard to various Data methodology can help better addressing them.
veracity aspects. In Section III we argue that Linked Data is the most
1) Endpoint and Data Dump Availability: One of the suitable data model for increasing the value of data over
most important aspects in both Linked Data and Big Data is conventional Web formats such as CSV, thus contributing
availability. Data should be available for consumption from towards the value challenge in Big Data. Linked Data
both SPARQL endpoints (i.e. standardised query interfaces) converts raw data into information based on RDF schemas,
and data dumps. Whilst the latter makes available a snapshot enabling data interoperability between machines. Based on
(taken at different intervals such as hourly, daily, weekly this concept, we also presented our vision for a semantic
etc.) of the data, an endpoint provides data consumers with pipeline that creates a semantically rich knowledge base
the most recent (possibly up-to-date) version of a dataset. from heterogenous formats available on the Web using a
The URLs of such endpoints should be discoverable from variety of techniques.
the datasets metadata. The VoID12 and DCAT13 schemas, Section IV addressed the veracity aspect of Big Data.
both W3C recommendations, are used to describe Linked Dened as conformity with truth or facts, we described
Datasets and data catalogues in order to bridge dataset eight Linked Data quality metrics (a subset of [32]) which
publishers and consumers. Both vocabularies have properties can (1) improve a consumers perspective with regard to
that allow the attachment of a SPARQL endpoint URI and the data, and (2) become an enabler of a data cleaning
a data dump URI to a dataset. Having the right SPARQL process that in return improves the datasets value. We also
endpoint described in the datasets metadata ensures that the described two techniques, reservoir sampling and Bloom
consumed data is provided from the intended source. Filters, that can be used on quality metrics whose time
2) Misused OWL Datatype and Properties: Similarly complexity becomes intractable when assessing big datasets.
to entities as members of disjoint classes, this metric is In the future we plan to implement more quality metrics
also part of the consistency dimension of quality. This to cover more aspects of Linked Data quality. We will also
metric checks if the value of a resources property is implement and evaluate the proposed semantic pipeline as-
correctly dened as an object or data literal, as inten- a-service, in the hope that it enables consumers to easily
ded by the vocabulary maintainer. An Object Property transform their Big Data into Linked Data-ready datasets.
(owl:ObjectProperty) expects a resource as its value,
ACKNOWLEDGMENT
12 http://www.w3.org/TR/void/ This work is supported by the European Commission
13 http://www.w3.org/TR/vocab-dcat/ under the Seventh Framework Program FP7 grant 601043
97
(http://diachron-fp7.eu) and the Horizon 2020 grant 644564 17. Lehmann, J. et al. DBpedia A Large-scale, Multi-
(http://www.big-data-europe.eu). lingual Knowledge Base Extracted from Wikipedia. In:
Semantic Web Journal (2014).
R EFERENCES
18. Llewellyn, A. NASA Tournament Labs Big Data Chal-
1. Bagosi, T. et al. The Ontop Framework for Ontology lenge. https : / / open . nasa . gov / blog / 2012 / 10 / 03 / nasa -
Based Data Access. In: The Semantic Web and Web Sci- tournament-labs-big-data-challenge/. Oct. 2012.
ence 8th Chinese Conference, Wuhan, China, August 19. Lukoianova, T., Rubin, V. L. Veracity roadmap: Is big
8-12, 2014, Revised Selected Papers. 2014. data objective, truthful and credible? In: Advances in
2. Barbieri, D. F. et al. Querying RDF Streams with C- Classication Research Online 24(1) (2014), pp. 415.
SPARQL. In: SIGMOD Rec. 39(1) (Sept. 2010), pp. 20 20. Ngomo, A.-C. N., Auer, S. LIMES: A Time-efcient
26. http://doi.acm.org/10.1145/1860702.1860705. Approach for Large-scale Link Discovery on the Web
3. Bera, S. K. et al. Advanced Bloom Filter Based of Data. In: Proceedings of the Twenty-Second Interna-
Algorithms for Efcient Approximate Data De- tional Joint Conference on Articial Intelligence. 2011.
Duplication in Streams. In: (2012). 21. Polleres, A. et al. RDFS and OWL Reasoning for
4. Berners-Lee, T. Linked Data. http : / / www . w3 . org / Linked Data. In: Reasoning Web. Semantic Technologies
DesignIssues/LinkedData.html. July 2006. for Intelligent Data Access. LNCS. Springer Berlin
5. Bizer, C., Heath, T., Berners-Lee, T. Linked data the Heidelberg, 2013.
story so far. In: Int. J. Semantic Web Inf. Syst. (2009). 22. Snger, J. et al. Trust and Big Data: A Roadmap for
6. Bizer, C., Cyganiak, R., Heath, T. How to publish Research. In: Database and Expert Systems Applications
Linked Data on the Web. Web page. Revised 2008. (DEXA). Sept. 2014, pp. 278282.
Accessed 22/02/2015. 2007. http : / / www4 . wiwiss . fu - 23. Sauermann, L., Cyganiak, R. Cool URIs for the Se-
berlin.de/bizer/pub/LinkedDataTutorial/. mantic Web. Interest Group Note. W3C, Dec. 2008.
7. Bonatti, P. A. et al. Robust and scalable Linked Data http://www.w3.org/TR/cooluris/.
reasoning incorporating provenance and trust annota- 24. Schroeck, M. et al. Analytics: The real-world use of big
tions. In: J. Web Sem. 9(2) (2011), pp. 165201. data. How innovative enterprises extract value from un-
8. Buhl, H. U. et al. Big Data. In: Business & Information certain data. Tech. rep. Sad Business School, University
Systems Engineering 5(2) (2013), pp. 6569. http://dx. of Oxford: IBM Institute for Business Value.
doi.org/10.1007/s12599-013-0249-5. 25. Schultz, A. et al. LDIF: Linked Data Integration Frame-
9. Calvanese, D. et al. Ontologies and databases: The DL- work. In: ISWC Posters & Demos. 2011.
Lite approach. In: Reasoning Web. LNCS 5689. 2009. 26. Schwarte, A. et al. FedX: Optimization Techniques for
10. Chevalier, J. A Linked Data Reasoner in the Cloud. Federated Query Processing on Linked Data. In: The
In: ESWC. Ed. by P. Cimiano et al. Vol. 7882. LNCS. Semantic Web ISWC 2011, Part I. 2011.
Springer, 2013, pp. 722726. 27. Sequeda, J., Miranker, D. P. Ultrawrap: SPARQL Exe-
11. Farhan Husain, M. et al. Storage and Retrieval of cution on Relational Data. In: Web Semantics: Science,
Large RDF Graph Using Hadoop and MapReduce. In: Services and Agents on the World Wide Web (2013).
Proceedings of the 1st International Conference on 28. Unbehauen, J., Stadler, C., Auer, S. Accessing Rela-
Cloud Computing. CloudCom 09. Berlin, Heidelberg: tional Data on the Web with SparqlMap. In: JIST. 2012.
Springer-Verlag, 2009, pp. 680686. 29. Vitter, J. S. Random Sampling with a Reservoir. In:
12. Gil, Y. et al. PROV Model Primer. W3C Working Group ACM Trans. Math. Softw. (3rd Apr. 2006).
Note. World Wide Web Consortium (W3C), 30th Apr. 30. Volz, J. et al. Discovering and Maintaining Links on the
2013. http://www.w3.org/TR/prov-primer/. Web of Data. In: Proceedings of the 8th International
13. Hau, N., Ichise, R., Le, B. Discovering Missing Links Semantic Web Conference. Berlin, Heidelberg, 2009.
in Large-Scale Linked Data. In: Intelligent Information 31. Wache, H. et al. Ontology-based integration of inform-
and Database Systems. Ed. by A. Selamat, N. Nguyen, ation a survey of existing approaches. In: IJCAI-
H. Haron. LNCS. Springer Berlin Heidelberg, 2013. 01 Workshop: Ontologies and Information. Ed. by H.
14. Hitzler, P., Janowicz, K. Linked Data, Big Data, and the Stuckenschmidt. 2001, pp. 108117.
4th Paradigm. In: Semantic Web 4(3) (2013). 32. Zaveri, A. et al. Quality Assessment for Linked Data:
15. Jentzsch, A., Cyganiak, R., Bizer, C. State of the LOD A Survey. In: Semantic Web Journal (2015).
Cloud. Version 0.3. 19th Sept. 2011. http://lod- cloud. 33. Zeng, K. et al. A distributed graph engine for web scale
net/state/ (visited on 2014-08-06). RDF data. In: Proceedings of the 39th international
16. Lange, C. Publishing 5-star Open Data. Tutorial at conference on Very Large Data Bases. PVLDB13.
the Web Intelligence Summer School Web of Data. VLDB Endowment, 2013, pp. 265276.
25th Aug. 2014. http : / / clange . github . io / 5stardata -
tutorial/.
98